On Friday Deploys: Sometimes that Puppy Needs Murdering (xpost)

(Cross posted from its original source)

‘Tis the season that one of my favorite blog posts gets pulled out and put in rotation, much like “White Christmas” on the radio station. I’m speaking, of course, of “Friday Deploy Freezes are Exactly Like Murdering Puppies” (old link on WP).

This feels like as good a time as any to note that I am not as much of an extremist as people seem to think I am when it comes to Friday deploys, or deploy freezes in general.

(Sometimes I wonder why people think I’m such an extremist, and then I remember that I did write a post about murdering puppies. Ok, ok. Point taken.)

Take this recent thread from LinkedIn, where Michael Davis posted an endorsement of my Puppies article along with his own thoughts on holiday code freezes, followed by a number of smart, thoughtful comments on why this isn’t actually attainable for everyone. Payam Azadi talks about an “icing” and “defrosting” period where you ease into and out of deploy freezes (never heard of this, I like it!), and a few other highly knowledgeable folks chime in with their own war stories and cautionary tales.

It’s a great thread, with lots of great points. I recommend reading it. I agree with all of them!!

For the record, I do not believe that everyone should get rid of deploy freezes, on Fridays or otherwise.

If you do not have the ability to move swiftly with confidence, which in practice means “you can generally find problems in your new code before your customers do”, which generally comes down to the quality and usability of your observability tooling, and your ability to explore high cardinality dimensions in real time (which most teams do not have), then deploy freezes before a holiday or a big event, or hell, even weekends, are probably the sensible thing to do.

If you can’t do the “right” thing, you find a workaround. This is what we do, as engineers and operators.

Deploy freezes are a hack, not a virtue

Look, you know your systems better than I do. If you say you need to freeze deploys, I believe you.

Honestly, I feel like I’ve always been fairly pragmatic about this. The one thing that does get my knickers in a twist is when people adopt a holier-than-thou posture towards their Friday deploy freezes. Like they’re doing it because they Care About People and it’s the Right Thing To Do and some sort of grand moral gesture. Dude, it’s a fucking hack. Just admit it.

It’s the best you can do with the hand you’ve been dealt, and there’s no shame in that! That is ALL I’m saying. Don’t pat yourself on the back, act a little sheepish, and I am so with you.

I think we can have nice things

I think there’s a lot of wisdom in saying “hey, it’s the holidays, this is not the time to be rushing new shit out the door absent some specific forcing function, alright?”

My favorite time of year to be at work (back when I worked in an office) was always the holidays. It was so quiet and peaceful, the place was empty, my calendar was clear, and I could switch gears and work on completely different things, out of the critical line of fire. I feel like I often peaked creatively during those last few weeks of the year.

I believe we can have the best of both worlds: a yearly period of peace and stability, with relatively low change rate, and we can evade the high stakes peril of locks and freezes and terrifying January recoveries.

How? Two things.

Don’t freeze deploys. Freeze merges.

To a developer, ideally, the act of merging their changes back to main and those changes being deployed to production should feel like one singular atomic action, the faster the better, the less variance the better. You merge, it goes right out. You don’t want it to go out, you better not merge.

The worst of both worlds is when you let devs keep merging diffs, checking items off their todo lists, closing out tasks, for days or weeks. All these changes build up like a snowdrift over a pile of grenades. You aren’t going to find the grenades til you plow into the snowdrift on January 5th, and then you’ll find them with your face. Congrats!

If you want to freeze deploys, freeze merges. Let people work on other things. I assure you, there is plenty of other valuable work to be done.

Don’t freeze deploys unless your goal is to test deploy freezes

The second thing is a corollary. Don’t actually freeze deploys, unless your SREs and on call folks are bored and sitting around together, going “wouldn’t this be a great opportunity to test for memory leaks and other systemic issues that we don’t know about due to the frequency and regularity of our deploys?”

If that’s you, godspeed! Park that deploy engine and sit on the hood, let’s see what happens!

People always remember the outages and instability that we trigger with our actions. We tend to forget about the outages and instability we trigger with our inaction. But if you’re used to deploying every day, or many times a day: first, good for you. Second, I bet you a bottle of whiskey that something’s gonna break if you go for two weeks without deploying.

I bet you the good shit. Top shelf. 🥃

This one is so easy to mitigate, too. Just run the deploy process every day or two, but don’t ship new code out.

Alright. Time for me to go fly to my sister’s house. Happy holidays everyone! May your pagers be silent and your bellies be full, and may no one in your family or friend group mention politics this year!

💜💙💚💛🧡❤️💖
charity

Me and Bubba and Miss Pinky Persnickety

P.S. The title is hyperbole! I was frustrated! I felt like people were intentionally misrepresenting my point and my beliefs, so I leaned into it. Please remember that I grew up on a farm and we ended up eating most of our animals. Possibly I am still adjusting to civilized life. Also, I have two cats and I love them very much and have not eaten either of them yet.

A few other things I’ve written on the topic:

On Friday Deploys: Sometimes that Puppy Needs Murdering (xpost)

2025 was for AI what 2010 was for cloud (xpost)

The satellite, experimental technology has become the mainstream, foundational tech. (At least in developer tools.) (xposted from new home)

I was at my very first job, Linden Lab, when EC2 and S3 came out in 2006. We were running Second Life out of three datacenters, where we racked and stacked all the servers ourselves. At the time, we were tangling with a slightly embarrassing data problem in that there was no real way for users to delete objects (the Trash folder was just another folder), and by the time we implemented a delete function, our ability to run garbage collection couldn’t keep up with the rate of asset creation. In desperation, we spun up an experimental project to try using S3 as our asset store. Maybe we could make this Amazon’s problem and buy ourselves some time?

Why yes, we could. Other “experimental” projects sprouted up like weeds: rebuilding server images in the cloud, running tests, storing backups, load testing, dev workstations. Everybody had shit they wanted to do that exceeded our supply of datacenter resources.

By 2010, the center of gravity had shifted. Instead of “mainstream engineering” (datacenters) and “experimental” (cloud), there was “mainstream engineering” (cloud) and “legacy, shut it all down” (datacenters).

Why am I talking about the good old days? Because I have a gray beard and I like to stroke it, child. (Rude.)

And also: it was just eight months ago that Fred Hebert and I were delivering the closing keynote at SRECon. The title is “AIOps: Prove It! An Open Letter to Vendors Selling AI for SREs”, which makes it sound like we’re talking to vendors, but we’re not; we’re talking to our fellow SREs, begging them to engage with AI on the grounds that it’s not ALL hype.

We’re saying to a room of professional technological pessimists that AI needs them to engage. That their realism and attention to risk is more important than ever, but in order for their critique to be relevant and accurate and be heard, it has to be grounded in expertise and knowledge. Nobody cares about the person outside taking potshots.

This talk recently came up in conversation, and it made me realize—with a bit of a shock—how far my position has come since then.

That was just eight months ago, and AI still felt like it was somehow separable, or a satellite of tech mainstream. People would gripe about conferences stacking the lineup with AI sessions, and AI getting shoehorned into every keynote.

I get it. I too love to complain about technology, and this is certainly an industry that has seen its share of hype trains: dotcom, cloud, crypto, blockchain, IoT, web3, metaverse, and on and on. I understand why people are cynical—why some are even actively looking for reasons to believe it’s a mirage.

But for me, this year was for AI what 2010 was for the cloud: the year when AI stopped being satellite, experimental tech and started being the mainstream, foundational technology. At least in the world of developer tools.

It doesn’t mean there isn’t a bubble. Of COURSE there’s a fucking bubble. Cloud was a bubble. The internet was a bubble. Every massive new driver of innovation has come with its own frothy hype wave.

But the existence of froth doesn’t disprove the existence of value.

Maybe y’all have already gotten there, and I’m the laggard. 😉 (Hey, it’s an SRE’s job to mind the rear guard.) But I’m here now, and I’m excited. It’s an exciting time to be a builder.

2025 was for AI what 2010 was for cloud (xpost)

Hello World (xpost from substack)

I recently posted a short note about moving from WordPress to Substack after ten years on WP. A number of people replied, commented, or DM’d me to express their dismay, along the lines of “why are you supporting Nazis?”. A few begged me to reconsider.

So I did. I paused the work I was doing on migration and setup, and I paused the post I was drafting on Substack. I read the LeaveSubstack site, and talked with its author (thank you, Sean 💜). I had a number of conversations with people I consider experts in content creation, and people I consider stakeholders (my coworkers and customers), as well as my own personal Jiminy Cricket, Liz Fong-Jones. I also slept on it.

I’ve decided to stay.

I said I would share my thinking once I made a decision, and it comes down to this: I have a job to do, and I haven’t been doing it.

I have not been doing my job 💔

I’ve gone increasingly dark on social media over the past few years, and while this has been delightful from a personal perspective, I have developed an uncomfortable conviction that I have also been abdicating a core function of my job in doing so.

The world of software is changing—fast. It’s exciting. But it is not enough to have interesting ideas and say them once, or write them down in a single book. You need to be out there mixing it up with the community every day, or at least every week. You need to be experimenting with what language works for people, what lands, what sparks a light in people’s eyes.

You (by which I mean me) also need to be listening more— reading and interacting with other people’s thoughts, volleying back and forth, polishing each other like diamonds.

How many times did we define observability or high cardinality or the sins of aggregation? Cool. How many times have we talked about the ways that AI has made the honeycomb vision technologically realizable for the first time? Uh, less, by an order of thousands.

Write more, engage with mainstream tech

My primary goal is to get back into the mainstream of technical discussion and mix it up a lot more. Unfortunately, to the extent there is a tech mainstream, it still exists on X. I am not ruling out the possibility of returning, but I would strongly prefer not to. I’m going to see if I can do my job by being much more active on LinkedIn and Substack.

My secondary goal is to remove friction and barriers to posting. WordPress just feels so heavyweight. Like I’m trying to craft a web page, not write a quick post. Substack feels more like writing an email. I’ve been trying to make myself post more all year on WP, and it hasn’t happened. I have a lot of shit backed up to talk about, and I think this will help grease the wheels.

There are platforms that are outside the pale, that exist solely to platform and support Nazis and violent extremists—your Gabs, your Parlers. Substack is very far from being one of those. All of these content platforms exist on some continuum of grey, and governance is hard, hard, hard in an era of mainstreaming right wing extremism.

Substack may not make all the decisions I would make, but I feel like it is a light dove grey, all things considered.

Some mitigations

I have received some tips and done some research on how to minimize the value of my writing to Substack. Here they are.

  • Substack makes money from paid subscriptions, so I don’t accept money. Ever.
  • I am told that if you use email or RSS, it benefits Substack less than if you use the app. RSS feed here.
  • I will set up an auto-poster from Substack to WordPress (at some point… probably whenever I find the time to fix the url rewriter and change domain pointer)

I hope these will allow conscientous objectors continue to read and engage with my work, but I also understand if not.

A vegan friend of mine once used an especially vivid metaphor to indignantly tell us why no, he could NOT just pick the meat and dairy off his plate and eat the vegetables and grains left behind (they were not cooked together). He said, “If somebody shit on your plate, would you just pick the shit off and keep eating?”

So. If Substack is the shit in your social media plate, and you feel morally obligated to reject anything that has ever so much as touched the domain, I can respect that.

Everyone has to decide which battles are theirs to fight. This one is not mine.

💜💙💚💛🧡❤️💖,
charity.

Hello World (xpost from substack)

In Praise of “Normal” Engineers

This article was originally commissioned by Luca Rossi (paywalled) for refactoring.fm, on February 11th, 2025. Luca edited a version of it that emphasized the importance of building “10x engineering teams” . It was later picked up by IEEE Spectrum (!!!), who scrapped most of the teams content and published a different, shorter piece on March 13th.

This is my personal edit. It is not exactly identical to either of the versions that have been publicly released to date. It contains a lot of the source material for the talk I gave last week at #LDX3 in London, “In Praise of ‘Normal’ Engineers” (slides), and a couple weeks ago at CraftConf. 

In Praise of “Normal” Engineers

Most of us have encountered a few engineers who seem practically magician-like, a class apart from the rest of us in their ability to reason about complex mental models, leap to non-obvious yet elegant solutions, or emit waves of high quality code at unreal velocity.In Praise of "Normal" Engineers

I have run into any number of these incredible beings over the course of my career. I think this is what explains the curious durability of the “10x engineer” meme. It may be based on flimsy, shoddy research, and the claims people have made to defend it have often been risible (e.g. “10x engineers have dark backgrounds, are rarely seen doing UI work, are poor mentors and interviewers”), or blatantly double down on stereotypes (“we look for young dudes in hoodies that remind us of Mark Zuckerberg”). But damn if it doesn’t resonate with experience. It just feels true.

The problem is not the idea that there are engineers who are 10x as productive as other engineers. I don’t have a problem with this statement; in fact, that much seems self-evidently true. The problems I do have are twofold.

Measuring productivity is fraught and imperfect

First: how are you measuring productivity? I have a problem with the implication that there is One True Metric of productivity that you can standardize and sort people by. Consider, for a moment, the sheer combinatorial magnitude of skills and experiences at play:

  • Are you working on microprocessors, IoT, database internals, web services, user experience, mobile apps, consulting, embedded systems, cryptography, animation, training models for gen AI… what?
  • Are you using golang, python, COBOL, lisp, perl, React, or brainfuck? What version, which libraries, which frameworks, what data models? What other software and build dependencies must you have mastered?
  • What adjacent skills, market segments, or product subject matter expertise are you drawing upon…design, security, compliance, data visualization, marketing, finance, etc?
  • What stage of development? What scale of usage? What matters most — giving good advice in a consultative capacity, prototyping rapidly to find product-market fit, or writing code that is maintainable and performant over many years of amortized maintenance? Or are you writing for the Mars Rover, or shrinkwrapped software you can never change?

Also: people and their skills and abilities are not static. At one point, I was a pretty good DBRE (I even co-wrote the book on it). Maybe I was even a 10x DB engineer then, but certainly not now. I haven’t debugged a query plan in years.

“10x engineer” makes it sound like 10x productivity is an immutable characteristic of a person. But someone who is a 10x engineer in a particular skill set is still going to have infinitely more areas where they are normal or average (or less). I know a lot of world class engineers, but I’ve never met anyone who is 10x better than everyone else across the board, in every situation.

Engineers don’t own software, teams own software

Second, and even more importantly: So what? It doesn’t matter. Individual engineers don’t own software, teams own software. The smallest unit of software ownership and delivery is the engineering team. It doesn’t matter how fast an individual engineer can write software, what matters is how fast the team can collectively write, test, review, ship, maintain, refactor, extend, architect, and revise the software that they own.

Everyone uses the same software delivery pipeline. If it takes the slowest engineer at your company five hours to ship a single line of code, it’s going to take the fastest engineer at your company five hours to ship a single line of code. The time spent writing code is typically dwarfed by the time spent on every other part of the software development lifecycle.

If you have services or software components that are owned by a single engineer, that person is a single point of failure.

I’m not saying this should never happen. It’s quite normal at startups to have individuals owning software, because the biggest existential risk that you face is not moving fast enough, not finding product market fit, and going out of business. But as you start to grow up as a company, as users start to demand more from you, and you start planning for the survival of the company to extend years into the future…ownership needs to get handed over to a team. Individual engineers get sick, go on vacation, and leave the company, and the business has got to be resilient to that.

If teams own software, then the key job of any engineering leader is to craft high-performing engineering teams. If you must 10x something, 10x this. Build 10x engineering teams.

The best engineering orgs are the ones where normal engineers can do great work

When people talk about world-class engineering orgs, they often have in mind teams that are top-heavy with staff and principal engineers, or recruiting heavily from the ranks of ex-FAANG employees or top universities.

But I would argue that a truly great engineering org is one where you don’t HAVE to be one of the “best” or most pedigreed engineers in the world to get shit done and have a lot of impact on the business.

I think it’s actually the other way around. A truly great engineering organization is one where perfectly normal, workaday software engineers, with decent software engineering skills and an ordinary amount of expertise, can consistently move fast, ship code, respond to users, understand the systems they’ve built, and move the business forward a little bit more, day by day, week by week.

Any asshole can build an org where the most experienced, brilliant engineers in the world can build product and make progress. That is not hard. And putting all the spotlight on individual ability has a way of letting your leaders off the hook for doing their jobs. It is a HUGE competitive advantage if you can build sociotechnical systems where less experienced engineers can convert their effort and energy into product and business momentum.

A truly great engineering org also happens to be one that mints world-class software engineers. But we’re getting ahead of ourselves, here.

Let’s talk about “normal” for a moment

A lot of technical people got really attached to our identities as smart kids. The software industry tends to reflect and reinforce this preoccupation at every turn, from Netflix’s “we look for the top 10% of global talent” to Amazon’s talk about “bar-raising” or Coinbase’s recent claim to “hire the top .1%”. (Seriously, guys? Ok, well, Honeycomb is going to hire only the top .00001%!)

In this essay, I would like to challenge us to set that baggage to the side and think about ourselves as normal people.

It can be humbling to think of ourselves as normal people, but most of us are in fact pretty normal people (albeit with many years of highly specialized practice and experience), and there is nothing wrong with that. Even those of us who are certified geniuses on certain criteria are likely quite normal in other ways — kinesthetic, emotional, spatial, musical, linguistic, etc.

Software engineering both selects for and develops certain types of intelligence, particularly around abstract reasoning, but nobody is born a great software engineer. Great engineers are made, not born. I just don’t think there’s a lot more we can get out of thinking of ourselves as a special class of people, compared to the value we can derive from thinking of ourselves collectively as relatively normal people who have practiced a fairly niche craft for a very long time.

Build sociotechnical systems with “normal people” in mind

When it comes to hiring talent and building teams, yes, absolutely, we should focus on identifying the ways people are exceptional and talented and strong. But when it comes to building sociotechnical systems for software delivery, we should focus on all the ways people are normal.

Normal people have cognitive biases — confirmation bias, recency bias, hindsight bias. We work hard, we care, and we do our best; but we also forget things, get impatient, and zone out. Our eyes are inexorably drawn to the color red (unless we are colorblind). We develop habits and ways of doing things, and resist changing them. When we see the same text block repeatedly, we stop reading it.

We are embodied beings who can get overwhelmed and fatigued. If an alert wakes us up at 3 am, we are much more likely to make mistakes while responding to that alert than if we tried to do the same thing at 3pm. Our emotional state can affect the quality of our work. Our relationships impact our ability to get shit done.

When your systems are designed to be used by normal engineers, all that excess brilliance they have can get poured into the product itself, instead of wasting it on navigating the system itself.

How do you turn normal engineers into 10x engineering teams?

None of this should be terribly surprising; it’s all well known wisdom. In order to build the kind of sociotechnical systems for software delivery that enable normal engineers to move fast, learn continuously, and deliver great results as a team, you should:

Shrink the interval between when you write the code and when the code goes live.

Make it as short as possible; the shorter the better. I’ve written and given talks about this many, many times. The shorter the interval, the lower the cognitive carrying costs. The faster you can iterate, the better. The more of your brain can go into the product instead of the process of building it.

One of the most powerful things you can do is have a short, fast enough deploy cycle that you can ship one commit per deploy. I’ve referred to this as the “software engineering death spiral” … when the deploy cycle takes so long that you end up batching together a bunch of engineers’ diffs in every build. The slower it gets, the more you batch up, and the harder it becomes to figure out what happened or roll back. The longer it takes, the more people you need, the higher the coordination costs, and the more slowly everyone moves.

Deploy time is the feedback loop at the heart of the development process. It is almost impossible to overstate the centrality of keeping this short and tight.

Make it easy and fast to roll back or recover from mistakes.

Developers should be able to deploy their own code, figure out if it’s working as intended or not, and if not, roll forward or back swiftly and easily. No muss, no fuss, no thinking involved.

Make it easy to do the right thing and hard to do the wrong thing.

Wrap designers and design thinking into all the touch points your engineers have with production systems. Use your platform engineering team to think about how to empower people to swiftly make changes and self-serve, but also remember that a lot of times people will be engaging with production late at night or when they’re very stressed, tired, and possibly freaking out. Build guard rails. The fastest way to ship a single line of code should also be the easiest way to ship a single line of code.

Invest in instrumentation and observability.

You’ll never know — not really — what the code you wrote does just by reading it. The only way to be sure is by instrumenting your code and watching real users run it in production. Good, friendly sociotechnical systems invest heavily in tools for sense-making.

Being able to visualize your work is what makes engineering abstractions accessible to actual engineers. You shouldn’t have to be a world-class engineer just to debug your own damn code.

Devote engineering cycles to internal tooling and enablement.

If fast, safe deploys, with guard rails, instrumentation, and highly parallelized test suites are “everybody’s job”, they will end up nobody’s job. Engineering productivity isn’t something you can outsource. Managing the interfaces between your software vendors and your own teams is both a science and an art. Making it look easy and intuitive is really hard. It needs an owner.

Build an inclusive culture.

Growth is the norm, growth is the baseline. People do their best work when they feel a sense of belonging. An inclusive culture is one where everyone feels safe to ask questions, explore, and make mistakes; where everyone is held to the same high standard, and given the support and encouragement they need to achieve their goals.

Diverse teams are resilient teams.

Yeah, a team of super-senior engineers who all share a similar background can move incredibly fast, but a monoculture is fragile. Someone gets sick, someone gets pregnant, you start to grow and you need to integrate people from other backgrounds and the whole team can get derailed — fast.

When your teams are used to operating with a mix of genders, racial backgrounds, identities, age ranges, family statuses, geographical locations, skill sets, etc — when this is just table stakes, standard operating procedure — you’re better equipped to roll with it when life happens.

Assemble engineering teams from a range of levels.

The best engineering teams aren’t top-heavy with staff engineers and principal engineers. The best engineering teams are ones where nobody is running on autopilot, banging out a login page for the 300th time; everyone is working on something that challenges them and pushes their boundaries. Everyone is learning, everyone is teaching, everyone is pushing their own boundaries and growing. All the time.

By the way — all of that work you put into making your systems resilient, well-designed, and humane is the same work you would need to do to help onboard new engineers, develop junior talent, or let engineers move between teams.

It gets used and reused. Over and over and over again.

The only meaningful measure of productivity is impact to the business

The only thing that actually matters when it comes to engineering productivity is whether or not you are moving the business materially forward.

Which means…we can’t do this in a vacuum. The most important question is whether or not we are working on the right thing, which is a problem engineering can’t answer without help from product, design, and the rest of the business.

Software engineering isn’t about writing lots of lines of code, it’s about solving business problems using technology.

Senior and intermediate engineers are actually the workhorses of the industry. They move the business forward, step by step, day by day. They get to put their heads down and crank instead of constantly looking around the org and solving coordination problems. If you have to be a staff+ engineer to move the product forward, something is seriously wrong.

Great engineering orgs mint world-class engineers

A great engineering org is one where you don’t HAVE to be one of the best engineers in the world to have a lot of impact. But — rather ironically — great engineering orgs mint world class engineers like nobody’s business.

The best engineering orgs are not the ones with the smartest, most experienced people in the world, they’re the ones where normal software engineers can consistently make progress, deliver value to users, and move the business forward, day after day.

Places where engineers can get shit done and have a lot of impact are a magnet for top performers. Nothing makes engineers happier than building things, solving problems, making progress.

If you’re lucky enough to have world-class engineers in your org, good for you! Your role as a leader is to leverage their brilliance for the good of your customers and your other engineers, without coming to depend on their brilliance. After all, these people don’t belong to you. They may walk out the door at any moment, and that has to be okay.

These people can be phenomenal assets, assuming they can be team players and keep their egos in check. Which is probably why so many tech companies seem to obsess over identifying and hiring them, especially in Silicon Valley.

But companies categorically overindex on finding these people after they’ve already been minted, which ends up reinforcing and replicating all the prejudices and inequities of the world at large. Talent may be evenly distributed across populations, but opportunity is not.

Don’t hire the “best” people. Hire the right people.

We (by which I mean the entire human race) place too much emphasis on individual agency and characteristics, and not enough on the systems that shape us and inform our behaviors.

I feel like a whole slew of issues (candidates self-selecting out of the interview process, diversity of applicants, etc) would be improved simply by shifting the focus on engineering hiring and interviewing away from this inordinate emphasis on hiring the BEST PEOPLE and realigning around the more reasonable and accurate RIGHT PEOPLE.

It’s a competitive advantage to build an environment where people can be hired for their unique strengths, not their lack of weaknesses; where the emphasis is on composing teams rather than hiring the BEST people; where inclusivity is a given both for ethical reasons and because it raises the bar for performance for everyone. Inclusive culture is what actual meritocracy depends on.

This is the kind of place that engineering talent (and good humans) are drawn to like a moth to a flame. It feels good to ship. It feels good to move the business forward. It feels good to sharpen your skills and improve your craft. It’s the kind of place that people go when they want to become world class engineers. And it’s the kind of place where world class engineers want to stick around, to train up the next generation.

<3, charity

 

In Praise of “Normal” Engineers

There Is Only One Key Difference Between Observability 1.0 and 2.0

Originally posted on the Honeycomb blog on November 19th, 2024

We’ve been talking about observability 2.0 a lot lately; what it means for telemetry and instrumentation, its practices and sociotechnical implications, and the dramatically different shape of its cost model. With all of these details swimming about, I’m afraid we’re already starting to lose sight of what matters.

The distinction between observability 1.0 and observability 2.0 is not a laundry list, it’s not marketing speak, and it’s not that complicated or hard to understand. The distinction is a technical one, and it’s actually quite simple:

  1. Observability 1.0 has three pillars and many sources of truth, scattered across disparate tools and formats.
  2. Observability 2.0 has one source of truthwide structured log events, from which you can derive all the other data types.

That’s it. That’s what defines each generation, respectively. Everything else is a consequence that flows from this distinction.

Multiple “pillars” are an observability 1.0 phenomenon

We’ve all heard the slogan, “metrics, logs, and traces are the three pillars of observability.” Right?

Well, that’s half true; it’s true of observability 1.0 tools. You might even say that pillars define the observability 1.0 generation. For every request that enters your system, you write logs, increment counters, and maybe trace spans; then you store telemetry in many places. You probably use some subset (or superset) of tools including APM, RUM, unstructured logs, structured logs, infra metrics, tracing tools, profiling tools, product analytics, marketing analytics, dashboards, SLO tools, and more. Under the hood, these are stored in various metrics formats: unstructured logs (strings), structured logs, time-series databases, columnar databases, and other proprietary storage systems.

Observability 1.0 tools force you to make a ton of decisions at write time about how you and your team would use the data in the future. They silo off different types of data and different kinds of questions into entirely different tools, as many different tools as you have use cases.

Many pillars, many tools.

An observability 2.0 tool does not have pillars.

Your observability 2.0 tool has one unified source of truth

Your observability 2.0 tool stores the telemetry for each request in one place, in one format: arbitrarily-wide structured log events.

These log events are not fired off willy-nilly as the request executes. They are specifically composed to describe all of the context accompanying a unit of work. Some common patterns include canonical logs, organized around each hop of the request; traces and spans, organized around application logic; or traces emitted as pulses for long-running jobs, queues, CI/CD pipelines, etc.

Structuring your data in this way preserves as much context and connective tissue as possible about the work being done. Once your data is gathered up this way, you can:

  • Derive metrics from your log events
  • Visualize them over time, as a trace
  • Zoom into individual requests, zoom out to long-term trends
  • Derive SLOs and aggregates
  • Collect system, application, product, and business telemetry together
  • Slice and dice and explore your data in an open-ended way
  • Swiftly compute outliers and identify correlations
  • Capture and preserve as much high-cardinality data as you want

The beauty of observability 2.0 is that it lets you collect your telemetry and store it—once—in a way that preserves all that rich context and relational data, and make decisions at read time about how you want to query and use the data. Store it once, and use it for everything.

Everything else is a consequence of this differentiator

Yeah, there’s a lot more to observability 2.0 than whether your data is stored in one place or many. Of course there is. But everything else is unlocked and enabled by this one core difference.

Here are some of the other aspects of observability 2.0, many of which have gotten picked up and discussed elsewhere in recent weeks:

  • Observability 1.0 is how you operate your code; observability 2.0 is about how you develop your code
  • Observability 1.0 has historically been infra-centric, and often makes do with logs and metrics software already emits, or that can be extracted with third-party tools
  • Observability 2.0 is oriented around your application code, the software at the core of your business
  • Observability 1.0 is traditionally focused on MTTR, MTTD, errors, crashes, and downtime
  • Observability 2.0 includes those things, but it’s about holistically understanding your software and your users—not just when things are broken
  • To control observability 1.0 costs, you typically focus on limiting the cardinality of your data, reducing your log levels, and reducing the cost multiplier by eliminating tools.
  • To control observability 2.0 costs, you typically reach for tail-based or head-based sampling
  • Observability 2.0 complements and supercharges the effectiveness of other modern development best practices like feature flags, progressive deployments, and chaos engineering.

The reason observability 2.0 is so much more effective at enabling and accelerating the entire software development lifecycle is because the single source of truth and wide, dense, cardinality-rich data allow you do things you can’t in an observability 1.0 world: slice and dice on arbitrary high-cardinality dimensions like build_id, feature flags, user_id, etc. to see precisely what is happening as people use your code in production.

In the same way that whether a database is a document store, a relational database, or a columnar database has an enormous impact on the kinds of workloads it can do, what it excels at and which teams end up using it, the difference between observability 1.0 and 2.0 is a technical distinction that has enduring consequences for how people use it.

These are not hard boundaries; data is data, telemetry is telemetry, and there will always be a certain amount of overlap. You can adopt some of these observability 2.0-ish behaviors (like feature flags) using 1.0 tools, to some extent—and you should try!—but the best you can do with metrics-backed tools will always be percentile aggregates and random exemplars. You need precision tools to unlock the full potential of observability 2.0.

Observability 1.0 is a dinner knife; 2.0 is a scalpel.

Why now? What changed?

If observability 2.0 is so much better, faster, cheaper, simpler, and more powerful, then why has it taken this long to emerge on the landscape?

Observability 2.0-shaped tools (high cardinality, high dimensionality, explorable interfaces, etc.) have actually been de rigeur on the business side of the house for years. You can’t run a business without them! It was close to 20 years ago that columnar stores like Vertica came on the scene for data warehouses. But those tools weren’t built for software engineers, and they were prohibitively expensive at production scale.

FAANG companies have also been using tools like this internally for a very long time. Facebook’s Scuba was famously the inspiration for Honeycomb—however, Scuba ran on giant RAM disks as recently as 2015, which means it was quite an expensive service to run. The falling cost of storage, bandwidth, and compute has made these technologies viable as commodity SaaS platforms, at the same time as the skyrocketing complexity of systems due to microservices, decoupled architecture patterns has made them mandatory.

Three big reasons the rise of observability 2.0 is inevitable

Number one: our systems are exploding in complexity along with power and capabilities. The idea that developing your code and operating your code are two different practices that can be done by two different people is no longer tenable. You can’t operate your code as a black box, you have to instrument it. You also can’t predict how things are going to behave or break, and one of the defining characteristics of observability 1.0 was that you had to make those predictions up front, at write time.

Number two: the cost model of observability 1.0 is brutally unsustainable. Instead of paying to store your data once, you pay to store it again and again and again, in as many different pillars or formats or tools as you have use cases. The post-ZIRP era has cast a harsh focus on a lot of teams’ observability bills—not only the outrageous costs, but also the reality that as costs go up, the value you get out of them is going down.

Yet the cost multiplier angle is in some ways the easiest to fix: you bite the bullet and sacrifice some of your tools. Cardinality is even more costly, and harder to mitigate. You go to bed Friday night with a $150k Datadog bill and wake up Monday morning with a million dollar bill, without changing a single line of code. Many observability engineering teams spend an outright majority of their time just trying to manage the cardinality threshold—enough detail to understand their systems and solve users’ problems, not so much detail that they go bankrupt.

And that is the most expensive part of all: engineering cycles. The cost of the time engineers spend laboring below the value line—trying to understand their code, their telemetry, their user behaviors—is astronomicalPoor observability is the dark matter of engineering teams. It’s why everything we do feels so incredibly, grindingly slow, for no apparent reason. Good observability empowers teams to ship swiftly, consistently, and with confidence.

Number three: a critical mass of developers have seen what observability 2.0 can do. Once you’ve tried developing with observability 2.0, you can’t go back. That was what drove Christine and me to start Honeycomb, after we experienced this at Facebook. It’s hard to describe the difference in words, but once you’ve built software with fast feedback loops and real-time, interactive visibility into what your code is doing, you simply won’t go back.


It’s not just Honeycomb; observability 2.0 tools are going mainstream

We’re starting to see a wave of early startups building tools based on these principles. You’re seeing places like Shopify build tools in-house using something like Clickhouse as a backing store. DuckDB is now available in the open-source realm. I expect to see a blossoming of composable solutions in the next year or two, in the vein of ELK stacks for o11y 2.0.

Jeremy Morrell recently published the comprehensive guide to observability 2.0 instrumentation, and it includes a vendor-neutral overview of your options in the space.

There are still valid reasons to go with a 1.0 vendor. Those tools are more mature, fully featured, and most importantly, they have a more familiar look and feel to engineers who have been working with metrics and logs their whole career. But engineers who have tried observability 2.0 are rarely willing to go back.

Beware observability 2.0 marketing claims

You do have to be a little bit wary here. There are lots of observability 1.0 vendors who talk about having a “unified observability platform” or having all your data in one place. But what they actually mean is that you can pay for all your tools in one unified bill, or present all the different data sources in one unified visualization.

The best of these vendors have built a bunch of elaborate bridges between their different tools and storage systems, so you can predefine connection points between e.g. a particular metric and your logging tool or your tracing tool. This is a massive improvement over having no connection points between datasets, no doubt. But a unified presentation layer is not the same thing as a unified data source.

So if you’re trying to clear a path through all the sales collateral and marketing technobabble, you only need to ask one question: how many times is your data going to be stored?

Is there one source of truth, or many?

There Is Only One Key Difference Between Observability 1.0 and 2.0

Generative AI is not going to build your engineering team for you

Originally posted on the Stack Overflow blog on June 10th, 2024

When I was 19 years old, I dropped out of college and moved to San Francisco. I had a job offer in hand to be a Unix sysadmin for Taos Consulting. However, before my first day of work I was lured away to a startup in the city, where I worked as a software engineer on mail subsystems.

I never questioned whether or not I could find work. Jobs were plentiful, and more importantly, hiring standards were very low. If you knew how to sling HTML or find your way around a command line, chances were you could find someone to pay you.

Was I some kind of genius, born with my hands on a computer keyboard? Assuredly not. I was homeschooled in the backwoods of Idaho. I didn’t touch a computer until I was sixteen and in college. I escaped to university on a classical performance piano scholarship, which I later traded in for a peripatetic series of nontechnical majors: classical Latin and Greek, musical theory, philosophy. Everything I knew about computers I learned on the job, doing sysadmin work for the university and CS departments.

In retrospect, I was so lucky to enter the industry when I did. It makes me blanch to think of what would have happened if I had come along a few years later. Every one of the ladders my friends and I took into the industry has long since vanished.

The software industry is growing up

To some extent, this is just what happens as an industry matures. The early days of any field are something of a Wild West, where the stakes are low, regulation nonexistent, and standards nascent. If you look at the early history of other industries—medicine, cinema, radio—the similarities are striking.

There is a magical moment with any young technology where the boundaries between roles are porous and opportunity can be seized by anyone who is motivated, curious, and willing to work their asses off.

It never lasts. It can’t; it shouldn’t. The amount of prerequisite knowledge and experience you must have before you can enter the industry swells precipitously. The stakes rise, the magnitude of the mission increases, the cost of mistakes soars. We develop certifications, trainings, standards, legal rites. We wrangle over whether or not software engineers are really engineers.

Software is an apprenticeship industry

Nowadays, you wouldn’t want a teenaged dropout like me to roll out of junior year and onto your pager rotation. The prerequisite knowledge you need to enter the industry has grown, the pace is faster, and the stakes are much higher, so you can no longer learn literally everything on the job, as I once did.

However, it’s not like you can learn everything you need to know at college either. A CS degree typically prepares you better for a life of computing research than life as a workaday software engineer. A more practical path into the industry may be a good coding bootcamp, with its emphasis on problem solving and learning a modern toolkit. In either case, you don’t so much learn “how to do the job” as you do “learn enough of the basics to understand and use the tools you need to use to learn the job.”

Software is an apprenticeship industry. You can’t learn to be a software engineer by reading books. You can only learn by doing…and doing, and doing, and doing some more. No matter what your education consists of, most learning happens on the job—period. And it never ends! Learning and teaching are lifelong practices; they have to be, the industry changes so fast.

It takes a solid seven-plus years to forge a competent software engineer. (Or as most job ladders would call it, a “senior software engineer”.) That’s many years of writing, reviewing, and deploying code every day, on a team alongside more experienced engineers. That’s just how long it seems to take.

What does it mean to be a “senior engineer”?

Here is where I often get some very indignant pushback to my timelines, e.g.:

“Seven years?! Pfft, it took me two years!”

“I was promoted to Senior Software Engineer in less than five years!”

Good for you. True, there is nothing magic about seven years. But it takes time and experience to mature into an experienced engineer, the kind who can anchor a team. More than that, it takes practice.

I think we have come to use “Senior Software Engineer” as shorthand for engineers who can ship code and be a net positive in terms of productivity, and I think that’s a huge mistake. It implies that less senior engineers must be a net negative in terms of productivity, which is untrue. And it elides the real nature of the work of software engineering, of which writing code is only a small part.

To me, being a senior engineer is not primarily a function of your ability to write code. It has far more to do with your ability to understand, maintain, explain, and manage a large body of software in production over time, as well as the ability to translate business needs into technical implementation. So much of the work is around crafting and curating these large, complex sociotechnical systems, and code is just one representation of these systems.

What does it mean to be a senior engineer? It means you have learned how to learn, first and foremost, and how to teach; how to hold these models in your head and reason about them, and how to maintain, extend, and operate these systems over time. It means you have good judgment, and instincts you can trust.

Which brings us to the matter of AI.

We need to stop cannibalizing our own future

It is really, really tough to get your first role as an engineer. I didn’t realize how hard it was until I watched my little sister (new grad, terrific grades, some hands on experience, fiendishly hard worker) struggle for nearly two years to land a real job in her field. That was a few years ago; anecdotally, it seems to have gotten even harder since then.

This past year, I have read a steady drip of articles about entry-level jobs in various industries being replaced by AI. Some of which absolutely have merit. Any job that consists of drudgery such as converting a document from one format to another, reading and summarizing a bunch of text, or replacing one set of icons with another, seems pretty obviously vulnerable. This doesn’t feel all that revolutionary to me, it’s just extending the existing boom in automation to cover textual material as well as mathy stuff.

Recently, however, a number of execs and so-called “thought leaders” in tech seem to have genuinely convinced themselves that generative AI is on the verge of replacing all the work done by junior engineers. I have read so many articles about how junior engineering work is being automated out of existence, or that the need for junior engineers is shriveling up. It has officially driven me bonkers.

All of this bespeaks a deep misunderstanding about what engineers actually do. By not hiring and training up junior engineers, we are cannibalizing our own future. We need to stop doing that.

Writing code is the easy part

People act like writing code is the hard part of software. It is not. It never has been, it never will be. Writing code is the easiest part of software engineering, and it’s getting easier by the day. The hard parts are what you do with that code—operating it, understanding it, extending it, and governing it over its entire lifecycle.

A junior engineer begins by learning how to write and debug lines, functions, and snippets of code. As you practice and progress towards being a senior engineer, you learn to compose systems out of software, and guide systems through waves of change and transformation.

Sociotechnical systems consist of software, tools, and people; understanding them requires familiarity with the interplay between software, users, production, infrastructure, and continuous changes over time. These systems are fantastically complex and subject to chaos, nondeterminism and emergent behaviors. If anyone claims to understand the system they are developing and operating, the system is either exceptionally small or (more likely) they don’t know enough to know what they don’t know. Code is easy, in other words, but systems are hard.

The present wave of generative AI tools has done a lot to help us generate lots of code, very fast. The easy parts are becoming even easier, at a truly remarkable pace. But it has not done a thing to aid in the work of managing, understanding, or operating that code. If anything, it has only made the hard jobs harder.

It’s easy to generate code, and hard to generate good code

If you read a lot of breathless think pieces, you may have a mental image of software engineers merrily crafting prompts for ChatGPT, or using Copilot to generate reams of code, then committing whatever emerges to GitHub and walking away. That does not resemble our reality.

The right way to think about tools like Copilot is more like a really fancy autocomplete or copy-paste function, or maybe like the unholy love child of Stack Overflow search results plus Google’s “I feel lucky”. You roll the dice, every time.

These tools are at their best when there’s already a parallel in the file, and you want to just copy-paste the thing with slight modifications. Or when you’re writing tests and you have a giant block of fairly repetitive YAML, and it repeats the pattern while inserting the right column and field names, like an automatic template.

However, you cannot trust generated code. I can’t emphasize this enough. AI-generated code always looks quite plausible, but even when it kind of “works”, it’s rarely congruent with your wants and needs. It will happily generate code that doesn’t parse or compile. It will make up variables, method names, function calls; it will hallucinate fields that don’t exist. Generated code will not follow your coding practices or conventions. It is not going to refactor or come up with intelligent abstractions for you. The more important, difficult or meaningful a piece of code is, the less likely you are to generate a usable artifact using AI.

You may save time by not having to type the code in from scratch, but you will need to step through the output line by line, revising as you go, before you can commit your code, let alone ship it to production. In many cases this will take as much or more time as it would take to simply write the code—especially these days, now that autocomplete has gotten so clever and sophisticated. It can be a LOT of work to bring AI-generated code into compliance and coherence with the rest of your codebase. It isn’t always worth the effort, quite frankly.

Generating code that can compile, execute, and pass a test suite isn’t especially hard; the hard part is crafting a code base that many people, teams, and successive generations of teams can navigate, mutate, and reason about for years to come.

How working engineers really use generative AI

So that’s the TLDR: you can generate a lot of code, really fast, but you can’t trust what comes out. At all. However, there are some use cases where generative AI consistently shines.

For example, it’s often easier to ask chatGPT to generate example code using unfamiliar APIs than by reading the API docs—the corpus was trained on repositories where the APIs are being used for real life workloads, after all.

Generative AI is also pretty good at producing code that is annoying or tedious to write, yet tightly scoped and easy to explain. The more predictable a scenario is, the better these tools are at writing the code for you. If what you need is effectively copy-paste with a template—any time you could generate the code you want using sed/awk or vi macros—generative AI is quite good at this.

It’s also very good at writing little functions for you to do things in unfamiliar languages or scenarios. If you have a snippet of Python code and you want the same thing in Java, but you don’t know Java, generative AI has got your back.

Again, remember, the odds are 50/50 that the result is completely made up. You always have to assume the results are incorrect until you can verify it by hand. But these tools can absolutely accelerate your work in countless ways.

Generative AI is a little bit like a junior engineer

One of the engineers I work with, Kent Quirk, describes generative AI as “an excitable junior engineer who types really fast”. I love that quote—it leaves an indelible mental image.

Generative AI is like a junior engineer in that you can’t roll their code off into production. You are responsible for it—legally, ethically, and practically. You still have to take the time to understand it, test it, instrument it, retrofit it stylistically and thematically to fit the rest of your code base, and ensure your teammates can understand and maintain it as well.

The analogy is a decent one, actually, but only if your code is disposable and self-contained, i.e. not meant to be integrated into a larger body of work, or to survive and be read or modified by others.

And hey—there are corners of the industry like this, where most of the code is write-only, throwaway code. There are agencies that spin out dozens of disposable apps per year, each written for a particular launch or marketing event and then left to wither on the vine. But that is not most software. Disposable code is rare; code that needs to work over the long term is the norm. Even when we think a piece of code will be disposable, we are often (urf) wrong.

But generative AI is not a member of your team

In that particular sense—generating code that you know is untrustworthy—GenAI is a bit like a junior engineer. But in every other way, the analogy fails. Because adding a person who writes code to your team is nothing like autogenerating code. That code could have come from anywhere—Stack Overflow, Copilot, whatever. You don’t know, and it doesn’t really matter. There’s no feedback loop, no person on the other end trying iteratively to learn and improve, and no impact to your team vibes or culture.

To state the supremely obvious: giving code review feedback to a junior engineer is not like editing generated code. Your effort is worth more when it is invested into someone else’s apprenticeship. It’s an opportunity to pass on the lessons you’ve learned in your own career. Even just the act of framing your feedback to explain and convey your message forces you to think through the problem in a more rigorous way, and has a way of helping you understand the material more deeply.

And adding a junior engineer to your team will immediately change team dynamics. It creates an environment where asking questions is normalized and encouraged, where teaching as well as learning is a constant. We’ll talk more about team dynamics in a moment.

The time you invest into helping a junior engineer level up can pay off remarkably quickly. Time flies. ☺️ When it comes to hiring, we tend to valorize senior engineers almost as much as we underestimate junior engineers. Neither stereotype is helpful.

We underestimate the cost of hiring seniors, and overestimate the cost of hiring juniors

People seem to think that once you hire a senior engineer, you can drop them onto a team and they will be immediately productive, while hiring a junior engineer will be a tax on team performance forever. Neither are true. Honestly, most of the work that most teams have to do is not that difficult, once it’s been broken down into its constituent parts. There’s plenty of room for lower level engineers to execute and flourish.

The grossly simplified perspective of your accountant goes something like this. “Why should we pay $100k for a junior engineer to slow things down, when we could pay $200k for a senior engineer to speed things up?” It makes no sense!

But you know and I know—every engineer who is paying attention should know—that’s not how engineering works. This is an apprenticeship industry, and productivity is defined by the output and carrying capacity of each team, not each person.

There are lots of ways a person can contribute to the overall velocity of a team, just like there are lots of ways a person can sap the energy out of a team or add friction and drag to everyone around them. These do not always correlate with the person’s level (at least not in the direction people tend to assume), and writing code is only one way.

Furthermore, every engineer you hire requires ramp time and investment before they can contribute. Hiring and training new engineers is a costly endeavor, no matter what level they are. It will take any senior engineer time to build up their mental model of the system, familiarize themselves with the tools and technology, and ramp up to speed. How long? It depends on how clean and organized the codebase is, past experience with your tools and technologies, how good you are at onboarding new engineers, and more, but likely around 6-9 months. They probably won’t reach cruising altitude for about a year.

Yes, the ramp will be longer for a junior engineer, and yes, it will require more investment from the team. But it’s not indefinite. Your junior engineer should be a net positive within roughly the same time frame, six months to a year, and they develop far more rapidly than more senior contributors. (Don’t forget, their contributions may vastly exceed the code they personally write.)

You do not have to be a senior engineer to add value

In terms of writing and shipping features, some of the most productive engineers I’ve ever known have been intermediate engineers. Not yet bogged down with all the meetings and curating and mentoring and advising and architecture, their calendars not yet pockmarked with interruptions, they can just build stuff. You see them put their headphones on first thing in the morning, write code all day, and cruise out the door in the evening having made incredible progress.

Intermediate engineers sit in this lovely, temporary state where they have gotten good enough at programming to be very productive, but they are still learning how to build and care for systems. All they do is write code, reams and reams of code.

And they’re energized…engaged. They’re having fun! They aren’t bored with writing a web form or a login page for the 1000th time. Everything is new, interesting, and exciting, which typically means they will do a better job, especially under the light direction of someone more experienced. Having intermediate engineers on a team is amazing. The only way you get them is by hiring junior engineers.

Having junior and intermediate engineers on a team is a shockingly good inoculation against overengineering and premature complexity. They don’t yet know enough about a problem to imagine all the infinite edge cases that need to be solved for. They help keep things simple, which is one of the hardest things to do.

The long term arguments for hiring junior engineers

If you ask, nearly everybody will wholeheartedly agree that hiring junior engineers is a good thing…and someone else should do it. This is because the long-term arguments for hiring junior engineers are compelling and fairly well understood.

  1. We need more senior engineers as an industry
  2. Somebody has to train them
  3. Junior engineers are cheaper
  4. They may add some much-needed diversity
  5. They are often very loyal to companies who invest in training them, and will stick around for years instead of job hopping
  6. Did we already mention that somebody needs to do it?

But long-term thinking is not a thing that companies, or capitalism in general, are typically great at. Framed this way, it makes it sound like you hire junior engineers as a selfless act of public service, at great cost to yourself. Companies are much more likely to want to externalize costs like those, which is how we got to where we are now.

The short term arguments for hiring junior engineers

However, there are at least as many arguments to be made for hiring junior engineers in the short term—selfish, hard-nosed, profitable reasons for why it benefits the team and the company to do so. You just have to shift your perspective slightly, from individuals to teams, to bring them into focus.

Let’s start here: hiring engineers is not a process of “picking the best person for the job”. Hiring engineers is about composing teams. The smallest unit of software ownership is not the individual, it’s the team. Only teams can own, build, and maintain a corpus of software. It is inherently a collaborative, cooperative activity.

If hiring engineers was about picking the “best people”, it would make sense to hire the most senior, experienced individual you can get for the money you have, because we are using “senior” and “experienced” as a proxy for “productivity”. (Questionable, but let’s not nitpick.) But the productivity of each individual is not what we should be optimizing for. The productivity of the team is all that matters.

And the best teams are always the ones with a diversity of strengths, perspectives, and levels of expertise. A monoculture can be spectacularly successful in the short term—it may even outperform a diverse team. But they do not scale well, and they do not adapt to unfamiliar challenges gracefully. The longer you wait to diversify, the harder it will be.

We need to hire junior engineers, and not just once, but consistently. We need to keep feeding the funnel from the bottom up. Junior engineers only stay junior for a couple years, and intermediate engineers turn into senior engineers. Super-senior engineers are not actually the best people to mentor junior engineers; the most effective mentor is usually someone just one level ahead, who vividly remembers what it was like in your shoes.

A healthy, high-performing team has a range of levels

A healthy team is an ecosystem. You wouldn’t staff a product engineering team with six DB experts and one mobile developer. Nor should you staff it with six staff+ engineers and one junior developer. A good team is composed of a range of skills and levels.

Have you ever been on a team packed exclusively with staff or principal engineers? It is not fun. That is not a high-functioning team. There is only so much high-level architecture and planning work to go around, there are only so many big decisions that need to be made. These engineers spend most of their time doing work that feels boring and repetitive, so they tend to over-engineer solutions and/or cut corners—sometimes at the same time. They compete for the “fun” stuff and find reasons to pick technical fights with each other. They chronically under-document and under-invest in the work that makes systems simple and tractable.

Teams that only have intermediate engineers (or beginners, or seniors, or whatever) will have different pathologies, but similar problems with contention and blind spots. The work itself has a wide range in complexity and difficulty—from simple, tightly scoped functions to tough, high-stakes architecture decisions. It makes sense for the people doing the work to occupy a similar range.

The best teams are ones where no one is bored, because every single person is working on something that challenges them and pushes their boundaries. The only way you can get this is by having a range of skill levels on the team.

The bottleneck we face is hiring, not training

The bottleneck we face now is not our ability to train up new junior engineers and give them skills. Nor is it about juniors learning to hustle harder; I see a lot of solidwell-meaning advice on this topic, but it’s not going to solve the problem. The bottleneck is giving them their first jobs. The bottleneck consists of companies who see them as a cost to externalize, not an investment in their—the company’s—future.

After their first job, an engineer can usually find work. But getting that first job, from what I can see, is murder. It is all but impossible—if you didn’t graduate from a top college, and you aren’t entering the feeder system of Big Tech, then it’s a roll of the dice, a question of luck or who has the best connections. It was rough before the chimera of “Generative AI can replace junior engineers” rose up from the swamp. And now…oof.

Where would you be, if you hadn’t gotten into tech when you did?

I know where I would be, and it is not here.

The internet loves to make fun of Boomers, the generation that famously coasted to college, home ownership, and retirement, then pulled the ladder up after them while mocking younger people as snowflakes. “Ok, Boomer” may be here to stay, but can we try to keep “Ok, Staff Engineer” from becoming a thing?

Nobody thinks we need fewer senior engineers

Lots of people seem to think we don’t need junior engineers, but nobody is arguing that we need fewer senior engineers, or will need fewer senior engineers in the foreseeable future.

I think it’s safe to assume that anything deterministic and automatable will eventually be automated. Software engineering is no different—we are ground zero! Of course we’re always looking for ways to automate and improve efficiency, as we should be.

But large software systems are unpredictable and nondeterministic, with emergent behaviors. The mere existence of users injects chaos into the system. Components can be automated, but complexity can only be managed.

Even if systems could be fully automated and managed by AI, the fact that we cannot understand how AI makes decisions is a huge, possibly insurmountable problem. Running your business on a system that humans can’t debug or understand seems like a risk so existential that no security, legal or finance team would ever sign off on it. Maybe some version of this future will come to pass, but it’s hard to see it from here. I would not bet my career or my company on it happening.

In the meantime, we still need more senior engineers. The only way to grow them is by fixing the funnel.

Should every company hire junior engineers?

No. You need to be able to set them up for success. Some factors that disqualify you from hiring junior engineers:

  • You have less than two years of runway
  • Your team is constantly in firefighting mode, or you have no slack in your system
  • You have no experienced managers, or you have bad managers, or no managers at all
  • You have no product roadmap
  • Nobody on your team has any interest in being their mentor or point person

The only thing worse than never hiring any junior engineers is hiring them into an awful experience where they can’t learn anything. (I wouldn’t set the bar quite as high as Cindy does in this article; while I understand where she’s coming from, it is so much easier to land your second job than your first job that I think most junior engineers would frankly choose a crappy first job over none at all.)

Being a fully distributed company isn’t a complete dealbreaker, but it does make things even harder. I would counsel junior engineers to seek out office jobs if at all possible. You learn so much faster when you can soak up casual conversations and technical chatter, and you lose that working from home. If you are a remote employer, know that you will need to work harder to compensate for this. I suggest connecting with others who have done this successfully (they exist!) for advice.

I also advise companies not to start by hiring a single junior engineer. If you’re going to hire one, hire two or three. Give them a cohort of peers, so it’s a little less intimidating and isolating.

Nobody is coming to fix our problems for us

I have come to believe that the only way this will ever change is if engineers and engineering managers across our industry take up this fight and make it personal.

Most of the places I know that do have a program for hiring and training entry level engineers, have it only because an engineer decided to fight for it. Engineers—sometimes engineering managers—were the ones who made the case and pushed for resources, then designed the program, interviewed and hired the junior engineers, and set them up with mentors. This is not an exotic project, it is well within the capabilities of most motivated, experienced engineers (and good for your career as well).

Finance isn’t going to lobby for this. Execs aren’t likely to step in. The more a person’s role inclines them to treat engineers like fungible resources, the less likely they are to understand why this matters.

AI is not coming to solve all our problems and write all our code for us—and even if it wasit wouldn’t matter. Writing code is but a sliver of what professional software engineers do, and arguably the easiest part. Only we have the context and the credibility to drive the changes we know form the bedrock for great teams and engineering excellence..

Great teams are how great engineers get made. Nobody knows this better than engineers and EMs. It’s time for us to make the case, and make it happen.

Generative AI is not going to build your engineering team for you

The Cost Crisis in Observability Tooling

Originally posted on the Honeycomb blog on January 24th, 2024

The cost of services is on everybody’s mind right now, with interest rates rising, economic growth slowing, and organizational budgets increasingly feeling the pinch. But I hear a special edge in people’s voices when it comes to their observability bill, and I don’t think it’s just about the cost of goods sold. I think it’s because people are beginning to correctly intuit that the value they get out of their tooling has become radically decoupled from the price they are paying.

In the happiest cases, the price you pay for your tools is “merely” rising at a rate several times faster than the value you get out of them. But that’s actually the best case scenario. For an alarming number of people, the value they get actually decreases as their bill goes up.

Observability 1.0 and the cost multiplier effect

Are you familiar with this chestnut?

“Observability has three pillars: metrics, logs, and traces.”

This isn’t exactly true, but it’s definitely true of a particular generation of tools—one might even say definitionally true of a particular generation of tools. Let’s call it “observability 1.0.”

From an evolutionary perspective, you can see how we got here. Everybody has logs… so we spin up a service for log aggregation. But logs are expensive and everybody wants dashboards… so we buy a metrics tool. Software engineers want to instrument their applications… so we buy an APM tool. We start unbundling the monolith into microservices, and pretty soon we can’t understand anything without traces… so we buy a tracing tool. The front-end engineers point out that they need sessions and browser data… so we buy a RUM tool. On and on it goes.

Logs, metrics, traces, APM, RUM. You’re now paying to store telemetry five different ways, in five different places, for every single request. And a 5x multiplier is on the modest side of the spectrum, given how many companies pay for multiple overlapping tools in the same category. You may also also be collecting:

  • Profiling data
  • Product analytics
  • Business intelligence data
  • Database monitoring/query profiling tools
  • Mobile app telemetry
  • Behavioral analytics
  • Crash reporting
  • Language-specific profiling data
  • Stack traces
  • CloudWatch or hosting provider metrics
  • …and so on.

So, how many times are you paying to store data about your user requests? What’s your multiplier? (If you have one consolidated vendor bill, this may require looking at your itemized bill.)

There are many types of tools, each gathering slightly different data for a slightly different use case, but underneath the hood there are really only three basic data types: the metric, unstructured logs, and structured logs. Each of these have their own distinctive trade-offs when it comes to how much they cost and how much value you can get out of them.

Metrics

Metrics are the great-granddaddy of telemetry formats; tiny, fast, and cheap. A “metric” consists of a single number, often with tags appended. All of the context of the request gets discarded at write time; each individual metric is emitted separately. This means you can never correlate one metric with another from the same request, or select all the metrics for a given request ID, user, or app ID, or ask arbitrary new questions about your metrics data.

Metrics-based tools include vendors like Datadog and open-source projects like Prometheus. RUM tools are built on top of metrics to understand browser user sessions; APM tools are built on top of metrics to understand application performance.

When you set up a metrics tool, it generally comes prepopulated with a bunch of basic metrics, but the useful ones are typically the custom metrics you emit from your application.

Your metrics bill is usually dominated by the cost of these custom metrics. At minimum, your bill goes up linearly with the number of custom metrics you create. Which is unfortunate, because to restrain your bill from unbounded growth, you have to regularly audit your metrics, do your best to guess which ones are going to be useful in the future, and prune any you think you can afford to go without. Even in the hands of experts, these tools require significant oversight.

Linear cost growth is the goal, but it’s rarely achieved. The cost of each metric varies wildly depending on how you construct it, what the values are, how often it gets hit, etc. I’ve seen a single custom metric cost $30k per month. You probably have dozens of custom metrics per service, and it’s almost impossible to tell how much each of them costs you. Metrics bills tend to be incredibly opaque (possibly by design).

Nobody can understand their software or their systems with a metrics tool alone, because the metric is extremely limited in what it can do. No context, no cardinality, no strings… only basic static dashboards. For richer data, we must turn to logs.

Unstructured logs

You can understand much more about your code with logs than you can with metrics. Logs are typically emitted multiple times throughout the execution of the request, with one or a small number of nouns per log line, plus the request ID. Unstructured logs are still the default, although this is slowly changing.

The cost of unstructured logs is driven by a few things:

  • Write amplification. If you want to capture lots of rich context about the request, you need to emit a lot of log lines. If you are printing out just 10 log lines per request, per service, and you have half a dozen services, that’s 60 log events for every request.
  • Noisiness. It’s extremely easy to accidentally blow up your log footprint yet add no value—e.g., by putting a print statement inside a loop instead of outside the loop. Here, the usefulness of the data goes down as the bill shoots up.
  • Constraints on physical resources. Due to the write amplification of log lines per request, it’s often physically impossible to log everything you want to log for all requests or all users—it would saturate your NIC or disk. Therefore, people tend to use blunt instruments like these to blindly slash the log volume:
    • Log levels
    • Consistent hashes
    • Dumb sample rates

When you emit multiple log lines per request, you end up duplicating a lot of raw data; sometimes over half the bits are consumed by request ID, process ID, timestamp. This can be quite meaningful from a cost perspective.

All of these factors can be annoying. But the worst thing about unstructured logs is that the only thing you can do to query them is full text search. The more data you have, the slower it becomes to search that data, and there’s not much you can do about it.

Searching your logs over any meaningful length of time can take minutes or even hours, which means experimenting and looking around for unknown-unknowns is prohibitively time-consuming. You have to know what to look for in order to find it. Once again, as your logging bill goes up, the value goes down.

Structured logs

Structured logs are gaining adoption across the industry, especially as OpenTelemetry picks up steam. The nice thing about structured logs is that you can actually do things with the data other than slow, dumb string searches. If you’ve structured your data properly, you can perform calculations! Compute percentiles! Generate heatmaps!

Tools built on structured logs are so clearly the future. But just taking your existing logs and adding structure isn’t quite good enough. If all you do is stuff your existing log lines into key/value pairs, the problems of amplification, noisiness, and physical constraints remain unchanged—you can just search more efficiently and do some math with your data.

There are a number of things you can and should do to your structured logs in order to use them more effectively and efficiently. In order of achievability:

  • Instrument your code using the principles of canonical logs, which collects all the vital characteristics of a request into one wide, dense event. It is difficult to overstate the value of doing this, for reasons of usefulness and usability as well as cost control.
  • Add trace IDs and span IDs so you can trace your code using the same events instead of having to use an entirely separate tool.
  • Feed your data into a columnar storage engine so you don’t have to predefine a schema or indexes to decide which dimensions future you can search or compute based on.
  • Use a storage engine that supports high cardinality, with an explorable interface.

If you go far enough down this path of enriching your structured events, instrumenting your code with the right data, and displaying it in real time, you will reach an entirely different set of capabilities, with a cost model so distinct it can only be described as “observability 2.0.” More on that in a second.

Ballooning costs are baked into observability 1.0

To recap: high costs are baked into the observability 1.0 model. Every pillar has a price.

You have to collect and store your data—and pay to store it—again and again and again, for every single use case. Depending on how many tools you use, your observability bill may be growing at a rate 3x faster than your traffic is growing, or 5x, or 10x, or even more.

It gets worse. As your costs go up, the value you get out of your tools goes down.

  • Your logs get slower and slower to search.
  • You have to know what you’re searching for in order to find it.
  • You have to use blunt force sampling technique to keep log volume from blowing up.
  • Any time you want to be able to ask a new question, you first have to commit new code and deploy it.
  • You have to guess which custom metrics you’ll need and which fields to index in advance.
  • As volume goes up, your ability to find a needle in the haystack—any unknown-unknowns—goes down commensurately.

And nothing connects any of these tools. You cannot correlate a spike in your metrics dashboard with the same requests in your logs, nor can you trace one of the errors. It’s impossible. If your APM and metrics tools report different error rates, you have no way of resolving this confusion. The only thing connecting any of these tools is the intuition and straight-up guesses made by your most senior engineers. Which means that the cognitive costs are immense, and your bus factor risks are very real. The most important connective data in your system—connecting metrics with logs, and logs with traces—exists only in the heads of a few people.

At the same time, the engineering overhead required to manage all these tools (and their bills) rises inexorably. With metrics, an engineer needs to spend time auditing your metrics, tracking people down to fix poorly constructed metrics, and reaping those that are too expensive or don’t get used. With logs, an engineer needs to spend time monitoring the log volume, watching for spammy or duplicate log lines, pruning or consolidating them, choosing and maintaining indexes.

But all this the time spent wrangling observability 1.0 data types isn’t even the costliest part. The most expensive part is the unseen costs inflicted on your engineering organization as development slows down and tech debt piles up, due to low visibility and thus low confidence.

Is there an alternative? Yes.

The cost model of observability 2.0 is very different

Observability 2.0 has no three pillars; it has a single source of truth. Observability 2.0 tools are built on top of arbitrarily-wide structured log events, also known as spans. From these wide, context-rich structured log events you can derive the other data types (metrics, logs, or traces).

Since there is only one data source, you can correlate and cross-correlate to your heart’s content. You can switch fluidly back and forth between slicing and dicing, breaking down or grouping by events, and viewing them as a trace waterfall. You don’t have to worry about cardinality or key space limitations.

You also effectively get infinite custom metrics, since you can append as many as you want to the same events. Not only does your cost not go up linearly as you add more custom metrics, your telemetry just gets richer and more valuable the more key-value pairs you add! Nor are you limited to numbers; you can add any and all types of data, including valuable high-cardinality fields like “App Id” or “Full Name.”

Observability 2.0 has its own amplification factor to consider. As you instrument your code with more spans per request, the number of events you have to send (and pay for) goes up. However, you have some very powerful tools for dealing with this: you can perform dynamic head-based sampling or even tail-based sampling, where you decide whether or not to keep the event after it’s finished, allowing you to capture 100% of slow requests and other outliers.

Engineering time is your most precious resource

But the biggest difference between observability 1.0 and 2.0 won’t show up on any invoice. The difference shows up in your engineering team’s ability to move quickly, with confidence.

Modern software engineering is all about hooking up fast feedback loops. And observability 2.0 tooling is what unlocks the kind of fine-grained, exploratory experience you need in order to accelerate those feedback loops.

Where observability 1.0 is about MTTR, MTTD, reliability, and operating software, observability 2.0 is what underpins the entire software development lifecycle, setting the bar for how swiftly you can build and ship software, find problems, and iterate on them. Observability 2.0 is about being in conversation with your code, understanding each user’s experience, and building the right things.

Observability 2.0 isn’t exactly cheap either, although it is often less expensive. But the key difference between o11y 1.0 and o11y 2.0 has never been that either is cheap; it’s that with observability 2.0, when your bill goes up, the value you derive from your telemetry goes up too. You pay more money, you get more out of your tools. As you should.

Interested in learning more? We’ve written at length about the technical prerequisites for observability with a single source of truth (“observability 2.0” as we’ve called it here). Honeycomb was built to this spec; ServiceNow (formerly Lightstep) and Baselime are other vendors that qualify. Click here to get a Honeycomb demo.

CORRECTION: The original version of this document said that “nothing connects any of these tools.” If you are using a single unified vendor for your metrics, logging, APM, RUM, and tracing tools, this is not strictly true. Vendors like New Relic or Datadog now let you define certain links between your traces and metrics, which allows you to correlate between data types in a few limited, predefined ways. This is better than nothing! But it’s very different from the kind of fluid, open-ended correlation capabilities that we describe with o11y 2.0. With o11y 2.0, you can slice and dice, break down, and group by your complex data sets, then grab a trace that matches any specific set of criteria at any level of granularity. With o11y 1.0, you can define a metric up front, then grab a random exemplar of that metric, and that’s it. All the limitations of metrics still apply; you can’t correlate any metric with any other metric from that request, app, user, etc, and you certainly can’t trace arbitrary criteria. But you’re right, it’s not nothing. 😸

The Cost Crisis in Observability Tooling

LLMs Demand Observability-Driven Development

Originally posted on the Honeycomb blog on September 20th, 2023

Our industry is in the early days of an explosion in software using LLMs, as well as (separately, but relatedly) a revolution in how engineers write and run code, thanks to generative AI.

Many software engineers are encountering LLMs for the very first time, while many ML engineers are being exposed directly to production systems for the very first time. Both types of engineers are finding themselves plunged into a disorienting new world—one where a particular flavor of production problem they may have encountered occasionally in their careers is now front and center.

Namely, that LLMs are black boxes that produce nondeterministic outputs and cannot be debugged or tested using traditional software engineering techniques. Hooking these black boxes up to production introduces reliability and predictability problems that can be terrifying. It’s important to understand this, and why.

100% debuggable? Maybe not

Software is traditionally assumed to be testable, debuggable, and reproducible, depending on the flexibility and maturity of your tooling and the complexity of your code. The original genius of computing was one of constraint; that by radically constraining language and mathematics to a defined set, we could create algorithms that would run over and over and always return the same result. In theory, all software is debuggable. However, there are lots of things that can chip away at that beauteous goal and make your software mathematically less than 100% debuggable, like:

  • Adding concurrency and parallelism.
  • Certain types of bugs.
  • Stacking multiple layers of abstractions (e.g., containers).
  • Randomness.
  • Using JavaScript (HA HA).

There is a much longer list of things that make software less than 100% debuggable in practice. Some of these things are related to cost/benefit tradeoffs, but most are about weak telemetry, instrumentation, and tooling.

If you have only instrumented your software with metrics, for example, you have no way of verifying that a spike in api_requests and an identical spike in 503 errors are for the same events (i.e., you are getting a lot of api_requests returning 503) or for a disjoint set of events (the spike in api_requests is causing general congestion causing a spike in 503s across ALL events). It is mathematically impossible; all you can do is guess. But if you have a log line that emits both the request_path and the error_code, and a tool that lets you break down and group by arbitrary dimensions, this would be extremely easy to answer. Or if you emit a lot of events or wide log lines but cannot trace them, or determine what order things executed in, there will be lots of other questions you won’t be able to answer.

There is another category of software errors that are logically possible to debug, but prohibitively expensive in practice. Any time you see a report from a big company that tracked down some obscure error in a kernel or an ethernet device, you’re looking at one of the rare entities with 1) enough traffic for these one in a billion errors to be meaningful, and 2) enough raw engineering power to dedicate to something most of us just have to live with.

But software is typically understandable because we have given it structure and constraints.

IF (); THEN (); ELSE () is testable and reproducible. Natural languages, on the other hand, are infinitely more expressive than programming languages, query languages, or even a UI that users interact with. The most common and repeated patterns may be fairly predictable, but the long tail your users will create is very long, and they expect meaningful results there, as well. For complex reasons that we won’t get into here, LLMs tend to have a lot of randomness in the long tail of possible results.

So with software, if you ask the exact same question, you will always get the exact same answer. With LLMs, you might not.

LLMs are their own beast

Unit testing involves asserting predictable outputs for defined inputs, but this obviously cannot be done with LLMs. Instead, ML teams typically build evaluation systems to evaluate the effectiveness of the model or prompt. However, to get an effective evaluation system bootstrapped in the first place, you need quality data based on real use of an ML model. With software, you typically start with tests and graduate to production. With ML, you have to start with production to generate your tests.

Even bootstrapping with early access programs or limited user testing can be problematic. It might be ok for launching a brand new feature, but it’s not good enough for a real production use case.

Early access programs and user testing often fail to capture the full range of user behavior and potential edge cases that may arise in real-world usage when there are a wide range of users. All these programs do is delay the inevitable failures you’ll encounter when an uncontrolled and unprompted group of end users does things you never expected them to do.

Instead of relying on an elaborate test harness to give you confidence in your software a priori, it’s a better idea to embrace a “ship to learn” mentality and release features earlier, then systematically learn from what is shipped and wrap that back into your evaluation system. And once you have a working evaluation set, you also need to figure out how quickly the result set is changing.

Phillip gives this list of things to be aware of when building with LLMs:

  • Failure will happen—it’s a question of when, not if.
  • Users will do things you can’t possibly predict.
  • You will ship a “bug fix” that breaks something else.
  • You can’t really write unit tests for this (nor practice TDD).
  • Latency is often unpredictable.
  • Early access programs won’t help you.

Sound at all familiar? 😂

Observability-driven development is necessary with LLMs

Over the past decade or so, teams have increasingly come to grips with the reality that the only way to write good software at scale is by looping in production via observability—not by test-driven development, but observability-driven development. This means shipping sooner, observing the results, and wrapping your observations back into the development process.

Modern applications are dramatically more complex than they were a decade ago. As systems get increasingly more complex, and nondeterministic outputs and emergent properties become the norm, the only way to understand them is by instrumenting the code and observing it in production. LLMs are simply on the far end of a spectrum that has become ever more unpredictable and unknowable.

Observability—both as a practice and a set of tools—tames that complexity and allows you to understand and improve your applications. We have written a lot about what differentiates observability from monitoring and logging, but the most important bits are 1) the ability to gather and store telemetry as very wide events, ordered in time as traces, and 2) the ability to break down and group by any arbitrary, high-cardinality dimension. This allows you to explore your data and group by frequency, input, or result.

In the past, we used to warn developers that their software usage patterns were likely to be unpredictable and change over time; now we inform you that if you use LLMs, your data set is going to be unpredictable, and it will absolutely change over time, and you must have a way of gathering, aggregating, and exploring that data without locking it into predefined data structures.

With good observability data, you can use that same data to feed back into your evaluation system and iterate on it in production. The first step is to use this data to evaluate the representativity of your production data set, which you can derive from the quantity and diversity of use cases.

You can make a surprising amount of improvements to an LLM based product without even touching any prompt engineering, simply by examining user interactions, scoring the quality of the response, and acting on the correctable errors (mainly data model mismatches and parsing/validation checks). You can fix or handle for these manually in the code, which will also give you a bunch of test cases that your corrections actually work! These tests will not verify that a particular input always yields a correct final output, but they will verify that a correctable LLM output can indeed be corrected.

You can go a long way in the realm of pure software, without reaching for prompt engineering. But ultimately, the only way to improve LLM-based software is by adjusting the prompt, scoring the quality of the responses (or relying on scores provided by end users), and readjusting accordingly. In other words, improving software that uses LLMs can only be done by observability and experimentation. Tweak the inputs, evaluate the outputs, and every now and again, consider your dataset for representivity drift.

Software engineers who are used to boolean/discrete math and TDD now need to concern themselves with data quality, representivity, and probabilistic systems. ML engineers need to spend more time learning how to develop products and concern themselves with user interactions and business use cases. Everyone needs to think more holistically about business goals and product use cases. There’s no such thing as a LLM that gives good answers that don’t serve the business reason it exists, after all.

So, what do you need to get started with LLMs?

Do you need to hire a bunch of ML experts in order to start shipping LLM software? Not necessarily. You cannot (there aren’t enough of them), you should not (this is something everyone needs to learn), and you don’t want to (these are changes that will make software engineers categorically more effective at their jobs). Obviously you will need ML expertise if your goal is to build something complex or ambitious, but entry-level LLM usage is well within the purview of most software engineers. It is definitely easier for software engineers to dabble in using LLMs than it is for ML engineers to dabble in writing production applications.

But learning to write and maintain software in the manner of LLMs is going to transform your engineers and your engineering organizations. And not a minute too soon.

The hardest part of software has always been running it, maintaining it, and understanding it—in other words, operating it. But this reality has been obscured for many years by the difficulty and complexity of writing software. We can’t help but notice the upfront cost of writing software, while the cost of operating it gets amortized over many years, people, and teams, which is why we have historically paid and valued software engineers who write code more than those who own and operate it. When people talk about the 10x engineer, everyone automatically assumes it means someone who churns out 10x as many lines of code, not someone who can operate 10x as much software.

But generative AI is about to turn all of these assumptions upside down. All of a sudden, writing software is as easy as sneezing. Anyone can use ChatGPT or other tools to generate reams of code in seconds. But understanding it, owning it, operating it, extending and maintaining it… all of these are more challenging than ever, because in the past, the way most of us learned to understand software was by writing it.

What can we possibly do to make sure our code makes sense and works, and is extendable and maintainable (and our code base is consistent and comprehensible) when we didn’t go through the process of writing it? Well, we are in the early days of figuring that out, too. 🙃

If you’re an engineer who cares about your craft: do code reviews. Follow coding standards and conventions. Write (or generate) tests for it. But ultimately, the only way you can know for sure whether or not it works is to ship it to production and watch what happens.

This has always been true, by the way. It’s just more true now.

If you’re an engineer adjusting to the brave new era: take some of that time you used to spend writing lines of code and reinvest it back into understanding, shipping under controlled circumstances, and observing. This means instrumenting your code with intention, and inspecting its output. This means shipping as soon as possible into the production environment. This means using feature flags to decouple deploys from releases and gradually roll new functionality out in a controlled fashion. Invest in these—and other—guardrails to make the process of shipping software more safe, fine-grained, and controlled.

Most of all, it means developing the habit of looking at your code in production, through the lens of your telemetry, and asking yourself: does this do what I expected it to do? Does anything else look weird?

Or maybe I should say “looking at your systems” instead of “looking at your code,” since people might confuse the latter with an admonition to “read the code.” The days when you could predict how your system would behave simply by reading lines of code are long, long gone. Software behaves in unpredictable, emergent ways, and the important part is observing your code as it’s running in production, while users are using it. Code in a buffer can tell you very little.

This future is a breath of fresh air

This, for once, is not a future I am afraid of. It’s a future I cannot wait to see manifest. For years now, I’ve been giving talks on modern best practices for software engineering—developers owning their code in production, testing in production, observability-driven development, continuous delivery in a tight feedback loop, separating deploys from releases using feature flags. No one really disputes that life is better, code is better, and customers are happier when teams adopt these practices. Yet, only 11% of teams can deploy their code in less than a day, according to the DORA report. Only a tiny fraction of teams are operating in the way everybody agrees we all should!

Why? The answers often boil down to organizational roadblocks, absurd security/compliance policies, or lack of buy-in/prioritizing. Saddest of all are the ones who say something like, “our team just isn’t that good” or “our people just aren’t that smart” or “that only works for world-class teams like the Googles of the world.” Completely false. Do you know what’s hard? Trying to build, run, and maintain software on a two month delivery cycle. Running with a tight feedback loop is so much easier.

Just do the thing

So how do teams get over this hump and prove to themselves that they can have nice things? In my experience, only one thing works: when someone joins the team who has seen it work before, has confidence in the team’s abilities, and is empowered to start making progress against those metrics (which they tend to try to do, because people who have tried writing code the modern way become extremely unwilling to go back to the bad old ways).

And why is this relevant?

I hypothesize that over the course of the next decade, developing with LLMs will stop being anything special, and will simply be one skill set of many, alongside mobile development, web development, etc. I bet most engineers will be writing code that interacts with an LLM. I bet it will become not quite as common as databases, but up there. And while they’re doing that, they will have to learn how to develop using short feedback loops, testing in production, observability-driven development, etc. And once they’ve tried it, they too may become extremely unwilling to go back.

In other words, LLMs might ultimately be the Trojan Horse that drags software engineering teams into the modern era of development best practices. (We can hope.)

In short, LLMs demand we modify our behavior and tooling in ways that will benefit even ordinary, deterministic software development. Ultimately, these changes are a gift to us all, and the sooner we embrace them the better off we will be.

LLMs Demand Observability-Driven Development

Ritual Brilliance: How a pair of Shrek ears shaped Linden Lab culture by making failure funny — and safe

[Originally posted on the now-defunct “Roadmap: A Magazine About Work” website, on May 30th, 2023. A pretty, nicely-formatted PDF version of this article can be downloaded here. Thanks to Molly McArdle for editing!]

If you talk to former Lindens about the company’s culture—and be careful, because we will do so at length—you will eventually hear about the Shrek ears.

When you saw a new person wearing the Shrek ears, a matted green-felt headband with ogre ears on it, you introduced yourself, congratulated them warmly, and begged to hear the story of how they came to be wearing them. Then you welcomed the new person to the team (“You’re truly one of us now!”) and shared a story about a time when you did something even dumber than they did.

My first job after (dropping out of) college was at Linden Lab, the home of Second Life. I joined in 2004 and stayed for nearly six years, during which the company grew from around 25 nerds in a room to around 400 employees who worked out of offices in Brighton, San Francisco, Menlo Park, and Singapore, or their own homes—wherever they were.

When I think back on that time now, almost two decades later, I’m puzzled by the Shrek ears phenomenon. I wasn’t exactly powerful then, at barely 20 years old. Not only was this my first real job, I was also the first woman engineer, and I made tons of mistakes. Shouldn’t I have found the practice of being systematically singled out and spotlighted for my errors humiliating, shaming, and traumatic?

Yet I remember loving the tradition and participating with joy and vigor. Everyone else seemed to love it, too. The practice spread beyond engineering and out into the rest of the company, not by fiat but because individual people would voluntarily track down the Shrek ears and put them on their own head. (I’m not imagining this, right?)

Step 1, break production; Step 2, put on Shrek ears

Here’s how it worked: The first time an engineer broke production or caused major outage, they would seek out the ears and put them on for the day. The ears weren’t a mark of shame—they were a badge of honor! Everyone breaks production eventually, if they’re working on something meaningful.

If people saw you wearing the ears, they would eagerly ask, “What happened? How did you find the problem? What was the fix?” Then they would regale you with their own stories of breaking production or tell you about the first outage they caused. If the person was self-flagellating or being too hard on themselves, the Shrek ears gave their colleagues an excuse to kindly but firmly correct it on the spot. It was Linden’s way of saying, Hey, we don’t do that here: “You did the reasonable thing! How can we make the system better, so the next person doesn’t stumble into the same trap?”

In those days, Linden was running a massively distributed system across multiple data centers on three continents, and doing so without the help of DevOps, CI/CD, GitHub, virtualization, the cloud, or infrastructure as code. We had an incredibly high-performing operations team, with a thousand-to-one server-to-ops engineer ratio, which was a real achievement in the days when the role required doing everything from racking and stacking boxes in the colocation center to developing your own automation software.

Failures were just fucking inevitable. In a world like that, devoid of the entire toolchain ecosystem we’ve come to rely on, you just had to learn to roll with it, absorb the hits, and keep moving fast. You could only test so much in staging; it was more important to get it out into production and watch it—understand it—there. It was better to invest in swift recovery, graceful degradation, and decoupling services than to focus on trying to prevent anything from going wrong. (Still is, as a matter of fact.)

This might all sound a little overwrought to you—maybe even dangerous or irresponsible. Didn’t we care about quality? Were we bad engineers?

The Shrek ears were “blameless retros” before there were blameless retros

I assure you, we cared. The engineers I worked with at Linden were of at least as high a caliber as the engineers I later worked with at Facebook (and a whole lot more diverse). In this specific place and time, the Shrek ears were what we needed to alleviate paralysis and fear of production, and to encourage the sharing of knowledge—even if anecdotal—about our systems.

In retrospect, the Shrek ears were a brilliant piece of social jujitsu. There was an element of shock value or contrarianism in celebrating outages instead of getting all worked up about them. But the larger purpose of the ears was to reset people’s expectations (especially in the case of new hires) and reprogram them with a different set of values: Linden’s values.

In the years since those early days at Linden, the industry has developed an entire language and set of practices around dealing with the aftermath of incidents: blameless post mortems, retrospectives, and so on. But those tools weren’t available to us at the time. What we did have was the Shrek ears. A couple of times a month something would break, the ears would be claimed, and we would all go around reminding one another that failure is both inevitable and ridiculous, and that no one is going to get mad at you or fire you when it happens.

Failure is always a question of when, not if

It’s important to note that you never saw anyone get teased or shamed for wearing the ears or for breaking production. There was a script to follow, and we all knew it. We learned it from watching others put on the ears, or by donning them ourselves. On a day when the Shrek ears had appeared, people would gather around at lunch or at the bar after work and swap war stories, one-upping one another and laughing uproariously.

Every new engineer was told, “If you never break production, you probably aren’t
doing anything that really matters or taking enough risks.”

It’s also important to emphasize that the ears were opt-in, not opt-out. You didn’t have to do it. And if you did take them, you could expect a wave of sympathy, good humor, and support. It affirmed that you deserved to be here, that you were part of the team.

And though the Shrek ears started in engineering, people in sales, marketing, accounting, and other departments picked them up over the years. It was a process of voluntary adoption, not a top-down policy. Someone would announce in IRC that they were wearing the ears today, and why. The camaraderie and laughter that ensued were infectious—and made it easier and easier over time for people to be transparent about what wasn’t working.

Rituals exist to instill values and train culture

In Rituals for Work, Kursat Ozenc defines rituals as “actions that a person or group does repeatedly, following a similar pattern or script, in which they’ve imbued symbolism and meaning.” Ritual exists to instill a value, create a mindset, or train a reflex.

And this particular ritual was extremely effective at taking lots of scared engineers and teaching them, very quickly:

✨ It is safe to fail✨
✨ Failure is constant✨
✨ Failure is fucking hilarious✨

At Linden, failure was not something to be ashamed of or to hide from your teammates. We understood that it’s not something that happens only to careless or inexperienced people. In fact, the senior people have the funniest fuckups—because what they are trying to do is insanely hard. The Shrek ears taught us that you fail, you laugh, you drink whiskey, you move on.

Other companies had similar rituals around the same time—Etsy famously had the “three-armed sweater,” which they would pass around to whoever had last broken production. But I’ve never again worked at a place where mistakes were discussed as freely and easily across the entire company as they were at Linden Lab. And I think the Shrek ears had a lot to do with that.

Their point was never to single out the person who had made a mistake and humiliate them, but the exact opposite. By putting on the ears, you said not just “Hi, I made a mistake” but also “I’m going to be brave about it, so we can all collectively learn and improve.” It was a ritualized act of bravery rewarded by affirmation, empathy, and acceptance. At Linden, the Shrek ears weren’t just a terrific tool for promoting team coherence and creating a sense of belonging. They also provided structure to help individuals and teams recover from scary events, and even traumas.

In so many ways, Linden Lab was ahead of its time

Linden was an extremely strange workplace when I was there, and it inspired unusually strong devotion, which we self-deprecatingly referred to as “the Kool-Aid.” It can be difficult to convey just how radical and weird it was at the time because the world has changed so much since then, and so many of the company’s “weird” philosophies have since gone mainstream. (Though not all: using “Kool-Aid” as a casual phrase to denote “excessive enthusiasm” or “cult-like devotion” is now recognized by many as being in poor taste. After all, people actually died at the Jonestown massacre.)

In a lot of ways, Linden culture (and Second Life technology) was profoundly, recognizably modern, and similar to the best workplaces of today, 20+ years later.

Philip Rosedale, Linden’s founder and CEO, is an inventor and technologist who believed it was every inch as interesting and important to experiment with company culture as with the virtual worlds we built. Except we did it all from scratch: building the technology and the culture together. And this led us down some weird rabbit holes, such as a cron job that rsynced the entire file system down over thousands of live servers every night. And the Shrek ears.

There was a period when “Choose your own work” was a company core value, and there were effectively no managers. (Not every experiment worked!) We went all-in on a fully distributed company culture at a time when practically no one else had. We ran a massively distributed, high-concurrency virtual world at a time before microservices, sharded databases, config management virtualization, AWS, or SRE and DevOps.

I can understand why people now find this story horrifying

With the distance of time, I get why the Shrek ears might make you recoil. If you think “That sounds awful! What kind of monsters would do that to each other?”—you are far from alone. Any time I mention the story in public, a sizable minority of people are aghast and appalled. Representative quotes include:

“I hope you realize how many people you traumatized by doing this to them.”

“I wonder how many introverted people found this excruciating but were too
afraid to say so.”

“Office bullying is fucked up even with cute Shrek ears.”

Even:

“We heard about the Shrek ears from an engineer we interviewed. He was telling us how great they were, but we were all so horrified that we declined to hire him because of it.”

And they’re right. It sounds awful to us now. It really does! It sounds like we were singling people out for their failures, like a dunce cap. I wouldn’t be surprised to someday learn that, in fact, a small number of people did felt pressured into using the ears, or hated them and were too afraid to say something. But how do we account for the fact that this tradition was so deeply beloved by so many—and that we are still fondly reminiscing about it more than 15 years later? It had a purpose.

Linden Lab was an incredibly progressive company for its time: very anti-hierarchical, very much about empowering people to be creative and independent. It also was by far the most diverse company I’ve ever worked in (other than Honeycomb, which I cofounded and where I’m CTO), with lots of women and genderqueer and trans people and people of color. We were way out on the sensitive branch relative to tech at that time. It’s tough to square this knowledge of what Linden was like as a place with the reactions some people outside the organization have to the Shrek ears.

I think this is, above all, a sign of progress. So many questionable practices that were ordinary back then—like referring to everyone as “guys,” using terms like “master/slave” for replication, or throwing alcohol-sloshed parties—are now rightfully frowned upon. We have become more sensitive to people’s differences and more clued into the power dynamics of the workplace. It’s far from perfect, but it is a lot better.

As a ritual, the Shrek ears were powerful and did the job. They were also fun—proving once again that making something goofy is the best way to make it stick. But I can’t imagine plopping Shrek ears on a new hire who has just broken production in 2023. And honestly, I think that’s probably a good thing. It’s time for new rituals.

Ritual Brilliance: How a pair of Shrek ears shaped Linden Lab culture by making failure funny — and safe

Deploys Are The ✨WRONG✨ Way To Change User Experience

This piece was first published on the honeycomb.io blog on 2023-03-08.

….

I’m no stranger to ranting about deploys. But there’s one thing I haven’t sufficiently ranted about yet, which is this: Deploying software is a terrible, horrible, no good, very bad way to go about the process of changing user-facing code.

It sucks even if you have excellent, fast, fully automated deploys (which most of you do not). Relying on deploys to change user experience is a problem because it fundamentally confuses and scrambles up two very different actions: Deploys and releases.

Deploy

“Deploying” refers to the process of building, testing, and rolling out changes to your production software. Deploying should happen very often, ideally several times a day. Perhaps even triggered every time an engineer lands a change.

Everything we know about building and changing software safely points to the fact that speed is safety and smaller changes make for safer deploys. Every deploy should apply a small diff to your software, and deploys should generally be invisible to users (other than minor bug fixes).

Release

“Releasing” refers to the process of changing user experience in a meaningful way. This might mean anything from adding functionality to adding entire product lines. Most orgs have some concept of above or below the fold where this matters. For example, bug fixes and small requests can ship continuously, but larger changes call for a more involved process that could mean anything from release notes to coordinating a major press release.

A tale of cascading failures

Have you ever experienced anything like this?

Your company has been working on a major new piece of functionality for six months now. You have tested it extensively in staging and dev environments, even running load tests to simulate production. You have a marketing site ready to go live, and embargoed articles on TechCrunch and The New Stack that will be published at 10:00 a.m. PST. All you need to do now is time your deploy so the new product goes live at the same time.

It takes about three hours to do a full build, test, and deploy of your entire system. You’ve deployed as much as possible in advance, and you’ve already built and tested the artifacts, so all you have to do is a streamlined subset of the deploy process in the morning. You’ve gotten it down to just about an hour. You are paranoid, so you decide to start an hour early. So you kick off the deploy script at 8:00 a.m. PST… and sit there biting your nails, waiting for it to finish.

SHIT! 20 minutes through the deploy, there’s a random flaky SSH timeout that causes the whole thing to cancel and roll back. You realize that by running a non-standard subset of the deploy process, some of your error handling got bypassed. You frantically fix it and restart the whole process.

Your software finishes deploying at 9:30 a.m., 30 minutes before the embargoed articles go live. Visitors to your website might be confused in the meantime, but better to finish early than to finish late, right? 😬

Except… as 10:00 a.m. rolls around, and new users excitedly begin hitting your new service, you suddenly find that a path got mistyped, and many requests are returning 500. You hurriedly merge a fix and begin the whole 3-hour long build/test/deploy process from scratch. How embarrassing! 🙈

Deploys are a terrible way to change user experience

The build/release/deploy process generally has a lot of safeguards and checks baked in to make sure it completes correctly. But as a result…

  • It’s slow
  • It’s often flaky
  • It’s unreliable
  • It’s staggered
  • The process itself is untestable
  • It can be nearly impossible to time it right
  • It’s very all or nothing—the norm is to roll back completely upon any error
  • Fixing a single character mistake takes the same amount of time as doubling the feature set!

Changing user-visible behaviors and feature sets using the deploy process is a great way to get egg on your face. Because the process is built for distributing large code distributions or artifacts; user experience gets changed only as a side effect.

So how should you change user experience?

By using feature flags.

Feature flags: the solution to many of life’s software’s problems

You should deploy your code continuously throughout the day or week. But you should wrap any large, user-visible behavior changes behind a feature flag, so you can release that code by flipping a flag.

This enables you to develop safely without worrying about what your users see. It also means that turning a feature on and off no longer requires a diff, a code review, or a deploy. Changing user experience is no longer an engineering task at all.

Deploys are an engineering task, but releases can be done by product managers—even marketing teams. Instead of trying to calculate when to begin deploying by working backwards from 10:00 a.m., you simply flip the switch at 10:00 a.m.

Testing in production, progressive delivery

The benefits of decoupling deploys and releases extend far beyond timely launches. Feature flags are a critical tool for apostles of testing in production (spoiler alert: everybody tests in production, whether they admit it or not; good teams are aware of this and build tools to do it safely). You can use feature flags to do things like:

  • Enable the code for internal users only
  • Show it to a defined subset of alpha testers, or a randomized few
  • Slowly ramp up the percentage of users who see the new code gradually. This is super helpful when you aren’t sure how much load it will place on a backend component
  • Build a new feature, but only turning it on for a couple “early access” customers who are willing to deal with bugs
  • Make a perf improvement that should be bulletproof logically (and invisible to the end user), but safely. Roll it out flagged off, and do progressive delivery starting with users/customers/segments that are low risk if something’s fucked up
  • Doing something timezone-related in a batch process, and testing it out on New Zealand (small audience, timezone far away from your engineers in PST) first

Allowing beta testing, early adoption, etc. is a terrific way to prove out concepts, involve development partners, and have some customers feel special and extra engaged. And feature flags are a veritable Swiss Army Knife for practicing progressive delivery.

It becomes a downright superpower when combined with an observability tool (a real one that supports high cardinality, etc.), because you can:

  • Break down and group by flag name plus build id, user id, app id, etc.
  • Compare performance, behavior, or return code between identical requests with different flags enabled
  • For example, “requests to /export with flag “USE_CACHING” enabled are 3x slower than requests to /export without that flag, and 10% of them now return ‘402’”

It’s hard to emphasize enough just how powerful it is when you have the ability to break down by build ID and feature flag value and see exactly what the difference is between requests where a given flag is enabled vs. requests where it is not.

It’s very challenging to test in production safely without feature flags; the possibilities for doing so with them are endless. Feature flags are a scalpel, where deploys are a chainsaw. Both complement each other, and both have their place.

“But what about long-lived feature branches?”

Long-lived branches are the traditional way that teams develop features, and do so without deploying or releasing code to users. This is a familiar workflow to most developers.

But there is much to be said for continuously deploying code to production, even if you aren’t exposing new surface area to the world. There are lots of subterranean dependencies and interactions that you can test and validate all along.

There’s also something very psychologically different between working with branches. As one of our engineering directors, Jess Mink, says:

There’s something very different, stress and motivation-wise. It’s either, ‘my code is in a branch, or staging env. We’re releasing, I really hope it works, I’ll be up and watching the graphs and ready to respond,’ or ‘oh look! A development customer started using my code. This is so cool! Now we know what to fix, and oh look at the observability. I’ll fix that latency issue now and by the time we scale it up to everyone it’s a super quiet deploy.’

Which brings me to another related point. I know I just said that you should use feature flags for shipping user-facing stuff, but being able to fix things quickly makes you much more willing to ship smaller user-facing fixes. As our designer, Sarah Voegeli, said:

With frequent deploys, we feel a lot better about shipping user-facing changes via deploy (without necessarily needing a feature flag), because we know we can fix small issues and bugs easily in the next one. We’re much more willing to push something out with a deploy if we know we can fix it an hour or two later if there’s an issue.

Everything gets faster, which instills more confidence, which means everything gets faster. It’s an accelerating feedback loop at the heart of your sociotechnical system.

“Great idea, but this sounds like a huge project. Maybe next year.”

I think some people have the idea that this has to be a huge, heavyweight project that involves signing up for a SaaS, forking over a small fortune, and changing everything about the way they build software. While you can do that—and we’re big fans/users of LaunchDarkly in particular—you don’t have to, and you certainly don’t have to start there.

As Mike Terhar from our customer success team says, “When I build them in my own apps, it’s usually just something in a ‘configuration’ database table. You can make a config that can enable/disable, or set a scope by team, user, region, etc.”

You don’t have to get super fancy to decouple deploys from releases. You can start small. Eliminate some pain today.

In conclusion

Decoupling your deploys and releases frees your engineering teams to ship small changes continuously, instead of sitting on branches for a dangerous length of time. It empowers other teams to own their own roadmaps and move at their own pace. It is better, faster, safer, and more reliable than trying to use deploys to manage user-facing changes.

If you don’t have feature flags, you should embrace them. Do it today! 🌈

Deploys Are The ✨WRONG✨ Way To Change User Experience

The Future of Ops is Platform Engineering

First published on 2022-09-30 at https://www.honeycomb.io/blog/future-ops-platform-engineering.

Two years ago I wrote a piece in The New Stack about the Future of Ops Careers. Towards the end, I wrote:

The reality is that jack-of-all-trades systems infrastructure jobs are slowly vanishing: the world doesn’t need thousands of people who can expertly tune postfix, SpamAssassin, and ClamAV—the world has Gmail. (…)

Building infrastructure and operational expertise used to be bundled together into a single role. But the industry is now bifurcating along an infrastructure fault line, and the overlap between infrastructure-oriented engineers and operationally-minded engineers is swiftly eroding. Engineers who love this work increasingly have a choice to make. Either you can 1) go deep on infrastructure by joining a company that does infrastructure as a service, or 2) go broad on operability by joining a company to help them do as little infrastructure as possible.

I described the second category as “operations engineering minus the infrastructure,” dedicated to evaluating and assembling a production stack of third-party platform providers, enabling software engineers to self-serve their services and own their own code in production. I said:

  • Your job will be to aggressively minimize the cycles your org devotes to infrastructure by finding effective ways to outsource or minimize infra labor. Your job is to NOT go deep if there is any workable alternative.
  • Your job will be to work cross-functionally with all the other software engineering teams, looking for ways to speed up their time to value and helping them own their own code in production.
  • Your job will be to move past the kludgey old models of “outsourcing” to sophisticated understandings of how and where to leverage abstractions that can radically accelerate development.

That second category I was describing now has a name. We call those teams “platform engineering.”

The fifty-year arc of software careers

In the beginning, there were people who wrote and ran software. At some point, we spun away ops skills from dev skills into two different professions, but that turned out to be a ginormous mistake, so along came DevOps to reunify them. Nowadays, ops as an independent profession is in the process of fading out. Companies are spinning down their ops teams left and right. Engineers who formerly identified as sysadmins or operations have turned into DevOps engineers, and soon there will just be “software people” again. This is the way of things.

Please note that this is NOT the same thing as saying “ops is dead,” or “ops skills are no longer valuable or needed1.” Our systems are only getting more complex, more difficult to operate, and simultaneously more critical to life on earth, which means that operational excellence has never been more desperately needed (and if you don’t respect that, 🌈 you deserve to suffer 🌈).

The industry story of the past three to five years has been us trying to figure out how to help software engineers own their own code in production2, phasing out dedicated ops teams, and aggressively outsourcing as much infrastructure as possible.

As we should. Developer cycles are the scarcest resource in your company, and you want to spend as many of those as possible on your core product: the crown jewel, the code that makes you a business. Money is cheaper than engineering cycles, and teams that are focused on their core business will always outperform teams whose focus is spread across dozens of non-revenue-generating projects. Let someone else build and run all the dependencies and adjacencies.

Before: some engineers wrote code, and some engineers ran code.

Now: all engineers write code, and all engineers run the code they write.

Platform engineering is what stands between you and darkness

When you start talking about putting software engineers on call for their own code, and generally being more involved in production, some percentage of the time you will hear back a guttural wail of despair: “You can’t expect me to know EVERYTHING about EVERYTHING!”

Quite right; we can’t. Platform engineering teams are part of the answer to this perfectly reasonable complaint. It’s not that you’re being asked to do or understand more in toto, but the distribution of labor and responsibility is shifting:

Before: some engineers wrote code, and some engineers ran code.

Now: all engineers write code, and all engineers run the code they write—but we divide the areas of responsibility by layer or function.

The emergence of a minimum viable self-serve tier

In the earliest days of a company, your first few engineers end up bootstrapping an infrastructure by reading AWS docs or blog posts, or asking a friend for recommendations to get started. They might start by setting up a managed container service, or configuring Terraform, and for a while everybody deploys and owns their own code, just as god intended.

But cognitive limits kick in pretty quickly. The maze of APIs and SDKs and components out there is simply bewildering, even for an experienced ops hand. Before long, it becomes someone’s job to make good decisions, pick a suite of compute and storage options that serve the team’s needs, and write some tooling that pulls everything into a coherent whole—which, at a minimum, lets you:

  1. Run tests and generate new artifacts
  2. Deploy artifacts, version them, and roll back
  3. Instrument, monitor, and debug
  4. Store data somewhere, manage schemas and migrations
  5. Adjust capacity as needed
  6. Define and commit all components (and their relationships) as code

Once these are built, it should be trivial for an engineer to come along and spin up a new service using templates and components from existing services. It should be much simpler and easier to use the blessed paths than anything else, and there should be friction if you go off the beaten path.

Congratulations! You’ve just been platformed 🎉. One of the key principles of any developer platform is that it should be easy to do the right things, and hard to do the wrong things.

The differences between platform engineering and traditional ops

Platform teams are typically staffed by engineers who are comfortable writing software. Not just scripting and automation, but writing tests and doing code reviews. Platform teams also operate much more like product development teams do, with product managers (and occasionally, designers, developer advocates, or UX researchers).

This doesn’t mean that everybody on a platform team has to have originally been a software engineer; in fact, a super common failure condition for platform teams is simply thinking all they need to do is hire software engineers to build developer tools. A strong platform team has an equally deep grounding in operations experience and software development. Individuals who are experts in both areas are fairly rare, but you can pull together a strong, well-rounded team by assembling a mix of SWEs (with some ops experience) and ops or DevOps engineers (with some software experience) and having them learn and grow from each other.

Platform teams are decidedly cloud-native; they actually mostly involve platforms built atop the cloud itself—PaaS, IaaS, everything-aaS, serverless, and so forth.

Ops/DevOps teams are oriented around managing infrastructure, often several generations of infrastructure. Their turf is everything from data centers and bare metal up through virtualization, containers, and the cloud (they aren’t so much cloud-native as cloud-enabled). They measure themselves on things like SLOs and the DORA metrics. You know they’re doing a good job if the system is up/available and users are happy.

Platform teams are oriented around providing a good experience for developers to self-serve and self-manage their code. The more swiftly and easily developers can move, the better your platform team. Operational excellence, in the platform model, is actually more the responsibility of the other engineering teams (and/or an adjacent SRE team) than that of the platform team.

Platform teams typically work higher up the stack than operations, DevOps, or SRE teams do, and they involve a great deal less infrastructure. On the contrary, platform teams are bent on paying other people to run as much shit as possible, preserving their own scarce development cycles for their core product.

Here is a somewhat tongue-in-cheek table of the similarities and differences between the archetypes.

Platform engineers vs. DevOps engineers

Platform Engineer Ops (or DevOps) Engineer
% of job spent writing code > 50% < 50%
Rest of time spent Gathering product requirements, doing user research, architecture discussions, optimizing internal workflows, researching new tools and developer productivity ideas, reviewing other teams’ diffs for impact, performance tuning, helping other engineers own & scale their code, fixing CI/CD pipelines. Fixing cron jobs, automating old setup docs, converting PXE/rsync to Chef/Puppet, converting Chef/Puppet to Terraform, converting VMs to containers, deploying software, debugging broken deploys, writing monitoring checks, doing retros, building out new services, pairing with software engineers to understand and debug their code, investigating weird shit, documentation, etc.
Responsible for Enabling internal teams to self-serve their ability to run and own their code in production. Creating standard, reusable components and processes. Defining golden paths. Infrastructure capacity planning, scaling, performance tuning, upgrading. Reliability and resiliency, SLOs and monitoring/alerting. Delivering quality experience to customers.
Builds for Internal developer teams Customers
Development style Infrastructure as a product Infrastructure as code
Works with product managers Yes No
Works with UX researchers or designers Sometimes No
Dashboards & graphs Uses APM, observability, tracing. Cares a lot about instrumentation and OpenTelemetry. Uses metrics, logs, dashboards; monitoring, alerting, and agent/sidecar/blackbox telemetry.
What ‘coding’ means to them Developing new features & services, writing tests. These are (primarily) software people who do systems. Automation, configuration, DSLs, extending and debugging existing code. These are systems people who do software.
Preferred language Go, Rust Python, Ruby
Time spent in Linux Hardly any A lot
Succeeds when Developers can easily choose good defaults, self-serve their infra, and own their own code in production. Infrastructure is scalable, secure, cost-effective, reliable, and customers are happy.
Native terrain Serverless, *aaS, APIs for everything (cloud-native and above). Instances, VMs, containers, regions, multi-cloud (everything “below,” but up to and including the cloud).
Databases Uses hosted DBs Runs their own, blending automation & DBA expertise
SSH No Yes
Shell REPL bash/zsh
Mantra “Run Less Software” “Cattle, Not Pets”

What about DevOps vs. SRE?

Countless words have been spilled on the difference between DevOps and SRE3, which I won’t rehash.

Here’s what I’ll say: DevOps, to me, feels like a relevant concept for companies that have a lot of infrastructure to wrangle. Companies that do in fact have dev teams and ops teams, or dev teams and DevOps teams (🙄), tend to have a lot of operational shit to automate, test, and run. They use config management, virtualization, and containers, often managing several generations worth of technology, possibly even down to data centers and bare metal. DevOps is for companies that have some combination of bare metal, VMs, regions, AZs, multi-cloud, networking devices, self-managed databases, etc.

DevOps is capacious. It contains multitudes. DevOps writes code, and DevOps has a fuckload of code to manage.

It is also on its way to becoming irrelevant. We are swiftly entering a post-DevOps world.

SRE, to me, feels different. I associate SRE with very large companies, where they mostly have software engineers owning their own code in production, but maybe still struggle with it a bit. SREs are often embedded within software engineering teams or product groups, and they focus a lot on, well, reliability, as the name suggests.

This means they do less infrastructure jockeying or automating (although they still do some coding). They typically have a lot to say about instrumentation, monitoring and observability, and cross-functional coordination. They run incident response and do blameless retros, and they tend to be experts at scaling.

If a company has both a DevOps team and SRE, typically I expect to see the SRE team more on the frontlines, involved with incidents, telemetry, etc., and DevOps teams more on the backburner, slinging pipes and plumbing.

Observability engineering as a case study

In the same piece I referenced earlier, I also wrote about the role of observability teams. I said they should largely no longer be running their own monitoring and graphing software in-house. Yet there is still a place for observability teams to exist: they remain a critical link between outsourced solutions and internal developer needs.

That team should write libraries, generate examples, and drive standardization; ushering in consistency, predictability, and usability. They should partner with internal teams to evaluate use cases. They should partner with your vendors as roadmap stakeholders. They might also write glue code and helper modules to connect disparate data sources and create cohesive visualizations. Basically, that team becomes an integration point between your organization and the outsourced work.

I originally wrote this about observability, but it could just as easily be used to describe platform engineering as a whole. This is the role—being the bridge between other vendors and your own core software. It’s a very high-leverage place to sit.

Ops is dead, long live ops

I’ve spent a lot of time thinking about this because we’ve had such a hard time nailing down exactly who the Honeycomb customer is. Sometimes our buyer is an ops team buying it for their SWEs, sometimes it’s SREs in the midst of an outage, sometimes it’s a VP or director of engineering, or an architect, or a CTO, or a “full stack” engineering team, or even a product manager. It is hard to form a snappy answer out of that list.

The first couple questions every new go-to-market candidate asks us are “who is your buyer?” and “how do we help them?” To which I respond with a five minute ramble where I list every above persona and each of their pain points. Hardly the concrete answer they would like to receive.

As it goes, sociotechnical trends come and go. A year ago, Christine and I were speculating that platform engineering might be on the verge of consolidating the necessary ingredients that makes up our ideal buyer:

  1. Writing and shipping code, and needing to understand their own code
  2. Positioned to help other teams with their instrumentation patterns and tooling
  3. Firmly cloud-native+ and untethered to hardware or traditional infrastructure

To my delight, since that conversation, these trends have only accelerated—and I, for one, welcome our new platform engineering overlords to the observability table. ☺️

If you’d like to learn more about platform engineering, we’ll be running a Twitter space on ✨ October 20th ✨ at 12:00 p.m. PT. Come join us! I’ll be there along with two colleagues and we’ll be answering your questions and shedding more light on the topic.


1  I do hear people saying that, and it used to make me fucking furious, but now I just smugly remind myself how much self-inflicted suffering they are in for. Disrespecting operational expertise is the shortest path to never again sleeping through the night.

2 It is rather incredible how rapidly this idea has taken off. When we started talking about putting developers on call for their code in 2016, people got seriously angry with us. Before that, the only twitter mention I could find of putting devs on call was one by (of course) Adrian Cockcroft, but by 2019-2020 it had stopped being controversial and soon became common wisdom.

3 I actually wrote one of those myself: DevOps vs SRE: Delayed Coverage of the Dumbest War). LMAO. I think Liz had the final word on this back in … 2017? 2018? … when she said something like class SRE implements DevOps. And yes, DevOps is a philosophy or a methodology and not a job title, etc.

The Future of Ops is Platform Engineering

Why On-Call Pain Is A Sociotechnical Problem

Cross-posted from leaddev.com

Most people hate being on call, because most on-call rotations are terrible.

Pager bombs, flappy alerts, false positives going off night and day, sleepless nights… Who can blame them? Small wonder that so many people develop a Pavlovian response to the sound of their Pagerduty ringtone. Alert goes off; adrenaline soars.

Conventional wisdom tells us that being on call means you put your whole life on hold, then spend all week lurching between firefighting and false alarms as you get progressively more sleep-deprived. It sucks, but that’s just what you get when you own your code in production. Right?

Noooooo. Wrong wrongy wrong wrong. Being on call should not be a constant cycle of things breaking down and firefighting, or alerts going off at all hours. This is not ‘normal.’ These are telltale signs of a fragile system and lack of alert discipline.

If on-call pain is a constant source of pain at your organization, that is a PROBLEM. It’s a five-alarm fire. You should drop what you’re doing and fix it with urgency.

An eternally miserable on-call rotation is a violation of the pact we make to support these systems:

  1. It is engineering’s job to own their code in production.
  2. It is management’s job to make sure it doesn’t suck.

This is a two-way handshake. If management isn’t holding up their end, if they don’t allocate enough time to fix the underlying problems – if they run a feature factory that never stops to refactor or invest in reliability work – then on-call will never get better, and you should leave.

On-call rotations are sociotechnical systems

On-call rotations are a classic example of a sociotechnical problem. A sociotechnical system consists of three elements: in this case that’s your production system, the people who operate it, and the tools they use to enact change on it.

You cannot solve sociotechnical problems with purely people solutions or with purely technical solutions. You need to use both.

The technical problems are usually easier to diagnose. You need to automate failovers, instrument your code, build and test repairing code, audit your indexes, etc. The social problems can be trickier to spot, but here’s a tip: they usually manifest as organizational problems.

Some engineers spend their entire career actively avoiding roles where they would have to be on call. Other engineers cling to the safety buffer of ops teams on call for their code, so that only manual escalations reach them.

Responsibility for your code is increasingly non-optional

This is becoming a harder line to hold, as the consensus has shifted decisively towards engineers owning their own code in production. Our systems are becoming exponentially more complex, and feedback loops are tightening. The people best equipped to own software in production are the people who built it. And in order to own it effectively, they need to close the loop by receiving the signal when something breaks.

But the point is not to invite software engineers into the same circle of hell that ops engineers have traditionally inhabited. This isn’t an act of vengeance. The point is that tightening these feedback loops is how we make systems better. Being on call shouldn’t have to destroy your social life or your sleep schedule.

Yes, engineering owns their software. But ensuring that engineering’s time is respected and their rest time valued is on management. It’s management’s job to make sure time is allocated to fixing recurring or known issues – and that they don’t kick the proverbial can down the road to later turn into tech debt. If reliability or productivity is suffering, managers need to reassign engineering cycles away from feature work. Managers’ performance should be evaluated by the four DORA metrics, as well as a fifth; how often is their team alerted outside of working hours?

It’s reasonable to be woken up two to three times a year when on call. But more than that is not okay. It’s management’s responsibility to ensure enough resources are dedicated to maintaining system stability, and they should be held accountable – not the on-call engineers.

Humans doing human things

We all have lives outside of work – families, doctor appointments, dentist visits, and so on. Instead of being surprised when things come up, we can predict the ways people’s lives will conflict with on-call duty and come up with ways to ease the burden.

  • Kids. I would never ask a new parent to be on call. Being woken up by ONE instrument of chaos is all anyone should ever have to cope with at any given time.
  • Sleepy brain. People are never going to be at their best when they are woken up in the middle of the night. We should make sure alert text, documentation, and steps are all clear, simple, and otherwise tuned for 2 a.m. brain fog.
  • Getting sleep. Sometimes people struggle getting back to sleep, or they were up all night dealing with something. Establish that 1) no one is EVER to be on call two nights in a row after a bad night, and 2) they are entitled to sleep in, come in late, leave early – whatever works best to help them catch up on their sleep.
  • Anxiety. I’ve managed people before who had high anxiety about being on call. They were perfectly willing, but it didn’t matter how quiet the pager was – their anxiety knowing it was on made it impossible to sleep. We tried it for a while, and it wasn’t getting better, so we ultimately found other ways for them to pull their weight.

If someone is absolutely unable to participate in on-call rotations, well, it happens. If it’s a temporary situation, you might want to let it go. But if it’s a permanent thing, like in the ‘anxiety’ example above, the team should address this by finding other ways for that person to do their share of maintenance work.

For example, they could be in charge of failed builds or maintain the dev environment. What matters is that 1) the team as a whole feels like it’s a fair distribution of labor, and 2) there are enough people left in the on-call rotation that no one is overly burdened.

Technical stumbling blocks

  • Un-owned code. Everything you support, and every alert that can fire, should have a team that owns it.
  • Conversely, you may have architectural issues that make it impossible to isolate and alert only the owners. If you have ten different on-call rotations for various areas of the code base, but any time the database gets slow all ten of you get paged, this is a bad situation.
  • SLOs. As you scale up, there will come a point where you can no longer alert on individual services or symptoms. They will simply drown you. At this point, you need to migrate your alerts over to Service Level Objectives. SLOs align your engineering pain directly with user pain.
  • Paging too early. Ah, this always sounds like such a great idea. ‘Wouldn’t it be great if we could catch it and alert someone before the users are impacted?’ But it’s not. It’s a recipe for flappy alerts and aggravation. Alert when users are impacted, not before.
  • Two lanes. You need two types of alerts: ‘WAKE ME UP’ and ‘Deal with this later.’ No more, no less. Keep the list of ‘wake me up’ alerts as short, crisp, and carefully curated. Put everything that needs to be dealt with ‘soon’ in the second lane, and have your on-call engineer sweep through it at the start of the day and the end of the day. If it doesn’t need to be acted upon in the next day, it probably shouldn’t be an alert.

On-call problems are often organizational problems

Sometimes people don’t want to be on call, and it’s not due to life events. This is a bit trickier to address because they are actually the result of organizational problems that present themselves as on-call problems. For example:

  • Tribal knowledge, or the ‘bus factor.’ You’re the debugger of last resort because you’ve been responsible for a mission-critical component of the system from the very beginning. The team tried training new people, but you still get called every time something goes wrong, and it’s not clear if the issue would be fixed if you weren’t available (or how long it would take them if they did).
  • Individual ownership vs. team ownership. Software is owned by teams, not by individuals. In an ideal world, this means everyone on the team is capable of debugging and maintaining all the systems they collectively own. In the real world, this means everything is at least understood by more than one engineer.
  • Too little – or too much – coverage. If you have three to four people on call, that’s too much of your life spent lugging around a laptop. Tossing all 20-30 engineers into a single rotation is also the wrong way to go; engineers won’t be on call often enough to stay familiar with the systems. The ideal on-call rotation has seven to eight people; five people is a bare minimum. With eight people, you are on call for a highly sustainable one week out of every two months.
  • Lower the barriers to asking for help, swapping times, covering for each other, etc. When someone asks for help with their on-call shift, thank them for asking. If the on-call shift isn’t that arduous, it’s really no big deal to back someone up for the duration of a movie.
  • Appointing primary/secondary on-call engineers can be really helpful here. Only the primary needs to get alerted and lug their laptop around, but they have a designated point person to tag if they need to run to the grocery store, drive through the boonies, or otherwise go offline for a while.
  • Put managers on call. I’m not generally a fan of putting managers in the rotation, but they really are the ideal backup situation. Especially when it comes to picking up the pager the day after someone has had a rough night. This serves multiple purposes: it helps keep the manager fresh, it exposes them to the reality of what on-call is currently like, and their time doesn’t have to be swapped for someone else’s.

The next time someone doesn’t want to be on call, it may be time to take a closer look at your organization as a whole to see whether the problem really is resource allocation, risk mitigation, or something else.

Making on-call costs tangible

On the topic of paying people more to be on call: there are loads of opinions here – it’s a very fraught topic. I generally come down on the side of ‘no, it’s part of the job,’ just like it is for doctors. With one big exception.

If you’re having a hard time getting upper management to understand the value of spending engineering cycles on the infrastructure and reliability work that needs to be done, instead of just cranking features… by all means, pay people for being on call.

Pay them for every event they have to respond to.

Pay them well.

Pay them so goddamn well the finance team starts squawking about the need to pay down that reliability debt.

If that’s the only way you can make it real for them, well, use the tools you’ve got. Engineers should never have to quietly suffer the pain of flaky software and unhappy users alone. Give management pain too until they take their jobs seriously enough to see that reliability issues get fixed.

Why On-Call Pain Is A Sociotechnical Problem

The Truth About “MEH-TRICS”

First published on 2022-04-13 at https://www.honeycomb.io/blog/truth-about-meh-trics-metrics

A long time ago, in a galaxy far, far away, I said a lot of inflammatory things about metrics.

“Metrics are shit salad.”

“Metrics are simply nerfed dimensions.”

“Metrics suck,” “metrics are legacy,” “metrics and time series aggregates will fucking kneecap you.”

I cannot tell a lie; Twitter will testify that I’ve spent the past six years ragging on metrics. So much so that ever since we launched Honeycomb Metrics last year, our poor solution architects have been encountering skeptics in the field who repeat my quotes back to them and ask, dubiously, whether Honeycomb Metrics are any good or not, and whether we genuinely plan on investing in it or not, given our known anti-metrics sympathies.

That’s a great question. 😊

Metrics aren’t worthless; they’re just limited.

Metrics are a mature technology that’s been around for over 30 years, and they have some real advantages. They’re tiny, fast, and cheap; you can hold a bunch of them in memory as counters, summaries, and gauges. They aggregate well and take up a fixed amount of storage space. The entire monitoring industry is built on top of metrics.

When it comes to workloads like, “How heavy is the write load on my hard drive?” or “What is the temperature or fan status inside my chassis?” or “What is the traffic rate in and out of this interface on my switch?”  metrics are what you should use. In fact, pretty much any time you want to know the health of a system or component in toto, metrics are the right tool.

Because that’s what metrics do best—report statistics in aggregate, from the perspective of any system or component. They can tell you that your Ruby HTTP worker pool is 70% utilized or that your nginx webserver is returning 502s 1% of the time. What they can’t tell you is what this means for any one of your users, applications, delivery vehicles, and so forth.

Until recently, metrics-based tools or logs were the only game in town. People were trying to sell us metrics tools for observability use cases, and that’s what got my goat so badly. If you simply append “… for observability” to each of my inflammatory statements, then I stand by them completely.

“Metrics are shit salad … for observability.

Yup, rings true.

You’re never going to make a metrics tool like Prometheus or Datadog into an observability tool. You’re just not. Observability is about unknown-unknowns, while metrics are a tool for known-unknowns.

If you need a refresher on the differences between observability and monitoring, I’ll refer you to pieces like thisthis, and this. What I want to talk about here is slightly different. In a post-observability world, what is the true and proper place for metrics tooling?

Metrics and observability have different use cases.

Metrics aren’t completely useless, even if you have a robust observability presence. We still use metrics at Honeycomb to this day for certain workloads—and always will because they’re the right tool for the job.

There are two kinds of workloads, roughly speaking: your code—the code you write, review, ship, debug and maintain on a daily basis. And other people’s code—the code you have to run and use in order to support your code. Some examples of the latter might be: Linux, Docker, MySql, Amazon RDS, Kafka, AWS Lambda, GCP gateways, memcache, CI/CD pipelines, Kubernetes, etc.

Your code is your crown jewels, the code you need to survive and succeed as a business. It changes constantly—many times per week, if not per day. You are expected to understand its inner workings intimately, and spend lots of time chasing down bugs or understanding and reproducing behavior. You care about the way it performs and interacts with each and every individual user, with changing infrastructure state, and under a variety of different load conditions.

That is why your code demands observability. In order to understand your software, you must first instrument it, in a way that collects lots of rich context and bundles it up around each event end-to-end. Then you need to stream those events into a tool that lets you slice and dice and trace and explore with support for high-cardinality and high-dimensionality data. That’s the only way you’re going to be able to correlate errors, track down outliers, and reflect each user’s experience.

But what about the rest of the software? You can’t instrument Amazon RDS, and only crazy people would instrument, rebuild, and repackage things like Kafka or Docker or nginx. The whole point of third-party software is that you DON’T USE IT until it’s stable enough to be taken more or less for granted. Sure, you roll updates, but usually on the order of months or years—not every day. You don’t need to be intimately familiar with its inner workings because you aren’t changing it every day. Those aren’t your crown jewels.

You do care about their health though, only differently. You care about whether you need to provision more capacity or not. You care about knowing how hard you’re hammering on the underlying hardware or hypervisor. That’s why metrics and monitoring are the right tools to use for third-party code. They don’t let you peer under the hood in the same way, or slice and dice in the same way, but that’s okay. You shouldn’t have to.

With third-party stuff, you don’t care about the code, you care about the health of the service. In aggregate.

(There are some kinds of in-between software, like databases, where event-level information is super useful for debugging things like slow queries and lock percentages, and you can use various black box techniques to approximate observability without instrumentation. But in general this model holds up quite well.)

In a post-observability world, what are metrics for?

I’ve often pointed out that observability is built on top of arbitrarily wide structured data blobs, and that metrics, logs, and traces can be derived from those blobs while the reverse is not true—you can’t take a bunch of metrics and reformulate a rich event.

And yes, people who have observability typically find themselves using metrics and dashboards less and less. They’re simply not as versatile or useful as events that you can slice and dice and manipulate in infinite ways. And you can derive aggregates and trends from the events you have stored.

But metrics will always be useful for understanding third-party software, from the perspective of the service, cluster, or node. They will always be the right tool for the job when it comes to software interfacing with hardware. And they can be super complementary when you are investigating your code using events and instrumentation.

If you’re an engineer writing and shipping code, you’re never not going to want to know if your change caused memory usage to triple, or CPU utilization to skyrocket, or disk usage or network throughput to saturate. That’s why we built Honeycomb Metrics as an overlay, a way to enhance or validate your understanding of the impact your code changes have had on the underlying system.

Metrics are also valuable as a bridge to the past. People have been instrumenting software for metrics for 30 years—they’re never going away completely, and not everything can or should be reinstrumented with events. Lots of people already have robust monitoring systems that slurp in millions of metrics. Nobody wants to have to redo all that work just because they’re moving to a different tool, so people tend to point their metrics firehose at Honeycomb as a way of getting started as they roll observability out into their code.

The Truth About “MEH-TRICS”