Ten Platform Commandments

October 24, 2018July 14, 2023 mipsytipsyengineering, platform engineering, vendors3 Comments

On Monday I gave a talk at DOES18 called “All the World’s a Platform”, where I talked about a bunch of the lessons learned by using and abusing and running and building platforms at scale.

I promised to do a blog post with the takeaways, so here they are.

Platform Commandment #1: Any time you have to think about one particular user, you have failed in some way. It doesn’t scale. Just a few one-offs a day will drag you down and drown your forward momentum.

Corollary: you will always have to do this every day. Solution: turn one-offs into a support problem, not an engineering problem.

Platform Commandment #2: keep your critical path as small and independent as possible. Have explicit tiers of importance. You cannot care about everything equally, sacrifices must be made.

Example: at Parse the core API was tier 1, push was tier 2, website was somewhere down around tier 10. We always knew what to bring up and care about first.

Platform Commandment #3: It is the job of the platform to protect itself at all costs, including at the expense of your app.

Platform Commandment #4: Remember that your platform is a magical black box to your users. You can’t expect them to behave reasonably without feedback loops and a rich mental model. Help them out — esp your super-users. It will save you time if you can help them help themselves.

Platform Commandment #5: Always expose a visible request id, shard id, uuid, trace id, any other relevant diagnostic information in user-visible errors. Up to the point where it reveals too much exploitable information about your service, which is probably much farther than you think. Poorly obfuscated infrastructure decisions are usually less of a threat to your business than befuddled users are.

Platform Commandment #6: Your observability must center your users’ perspective, not your own. The health of the system doesn’t matter. The health of every request, and every high-cardinality grouping of requests — those are what matter.

You must be able to care about and inspect the perf and quality from the perspective of every single application and/or user and their users, as richly as though theirs was the *only* application. In real-time.

Dashboards are practically useless unless you can drill down into them. Top-10 lists are useless — your biggest customers may not be your most important customers.

Solution: Invest in tooling (like Honeycomb) that lets you slice and dice on dimensions of arbitrary cardinality, so you can do things like a) break down by one uuid out of millions, b) break down by endpoint, latency percentile, raw query, data store, etc — to see what the experience actually looks like for that user, not for a high level aggregate like a dashboard.

Platform Commandment #7: Use end-to-end checks to traverse all the key code paths and architecture paths.

You will be tempted to disable them because they seem flappy and flaky and need to be fixed. But this is actually what your users are suffering through every day they use your platform. Don’t disable them, fix them.

Platform Commandment #8: Invest early in every kind of throttle, blacklist, velvet rope, in-flight rewrite, custom url/error responder, content inspection, etc … both partial and total, for every slice of events or users. You will need all these fine-grained controls to keep your platform alive for 99.9% of users while you debug the .1% who are outliers and bad actors.

Platform Commandment #9: And use a multi-threaded language ffs.

Platform Commandment #10: USE YOUR OWN PLATFORM. For work, if possible. Feel the pain that you inflict on others.

Bonus Commandment: all cotenancy isolation guarantees are bullshit**

**from a perf standpoint, not security

Post-mortem: feminist advice meltdown (March 2nd)

March 10, 2018March 10, 2018 mipsytipsymanagement, tech culture15 Comments

Okay! As of today it’s been one week since I wrote some advice and the internet exploded in my face, so now it’s time to do what I always do: post mortem that shit.

This is going to be long. I erred by making my first post too short, so I’m going to ship $(allthedetail) this time. Duly warned.

What happened?

Around 8 am on Friday, March 2nd, after pulling an all-nighter, I decided to pound out a quick blog post that has been on my todo list forever: the only advice I feel equipped to give on how to succeed in tech.

My advice, in brief, was this:

as a junior engineer, tough it out. work hard, learn everything, earn your stripes.
stay technical. don’t get sucked into an offramp unless you are god damn sure you want out for good.
once you are senior, use your power to advocate for others and fuck that shit up.

Money, power, credibility. This is the best way I know how to earn these things. This is what worked for me and most of the senior technical women I know and admire.

First of all: I don’t think there should be anything controversial at all about this advice. It’s good advice, if a bit bluntly put. Pick your battles, show strategic impact, leverage your influence into power and use that power to fuck shit up in the manner of your choosing.

The fact is, we are far too chickenshit about telling young women straight up how to succeed at work. We praise them for all kinds of dumb shit and second shift work and emotional labor that has little if any strategic impact to the bottom line, and wonder why they’re burned out and resentful.

We live in a fallen world. I didn’t make it this way, I just want to help you level up to be a powerful destroyer being so you can make it better.

So I hit “publish”.

Around 9:30 am, Camille Fournier gave me a bunch of unsolicited criticism. Unfortunately, due to some sour personal history with Camille I was extremely not disposed to receive this from her. I can be a resentful little shit: as soon as she told me to change it in certain ways, it was the last fucking thing in the world I was going to do.

For a few hours, all the feedback was good. People liked my advice to stay technical (“god I wish someone had told me that 15 years ago”) and my pointing out the loophole that lets women advocate for each other without being penalized.

A few people nailed what I was trying to say even better than I did:

https://twitter.com/peterbourgon/status/969654547697168385

This has been my biggest struggle with some of the talking points - it sounds as if I’m completely deprived of any agency in how my life pans out. Which I find very disturbing - since it then makes me feel like I have no control over anything. That’s very disempowering.
— Cindy Sridharan (@copyconstruct) March 2, 2018

But by the end of the day I was receiving a steady stream of angry tweets from people I had never heard of, with objections that seemed puzzling and ridiculous to me.

They were acting as though the sum total of my advice had been ordering bullied and abused people to just shut up and tough it out. Soon I was getting tweets accusing me of trashing all diversity work, trashing all women, only being out for myself and my own career, erasing sexual assault, being insensitive and destructive to people of color, and on and on.

People were subtweeting me like crazy, or DM’ing me telling me how much they liked my piece but were afraid to say so in public. Others were harassing my engineering managers and people who follow me.

I have never received textual scrutiny of this type before, where every single word was turned over and macerated and peered at for evidence of traitorous views. It sucks. (And it’s pretty hypocritical, to say the least … some of these same women who were gleefully bashing me for clumsy words remain good friends with men who are actual known harassers and abusers of women.)

Lots of people wanted me to take the post down immediately, or publish a retraction or correction immediately. Some prominent feminists publicly chided me and refused to talk to me until I repented of my sins. 🙄

https://twitter.com/vaurorapub/status/970544808086331392

Let’s be clear. I have no problem admitting my errors and making amends. I do it all the fucking time. But I am disinclined to grovel before a howling mob. It wasn’t even clear to me what I had done wrong, given all the contradictory noises.

So I decided to wait a week before responding, so I could talk to people and figure out what to take away from the mess.

(Also last week: traveled to multiple continents, flew a few dozen hours, wrote multiple talks, delivered presentations at various conferences and meetups, visited and pitched to potential customers, managed a handful of teams, fit 1x1s in between hops and time zones and you know just tried to do my fucking job while dealing with crazed nuts screaming abuse at me online.)

I had a couple of hard but helpful conversations with people like Alice Goldfuss and Courtney Nash, who took the time to walk me through ways that what I wrote may be misinterpreted or wrongly received. This feedback can mostly be bucketed into the following categories:

“Assume the reader knows nothing about you and considers you hostile until proven otherwise.” Well shit, I am not used to writing defensively. I live my life in high trust, high transparency environments and prefer it that way.
“Your advice doesn’t apply to $x.” True! I didn’t bracket it in layers of padding — “this is just what worked for me”, “may not apply to every situation” — because I thought that was freaking obvious.
“It sounds like you are shit talking all diversity efforts.” No, but I was waving vaguely in the direction of some very cynical and tired feelings on the subject. I’m pretty over corporate diversity issues and pinkwashing that doesn’t expand opportunity or share power.
“It sounds like you are shitting on all women.” Oof. This is the one that is really painful, because this is the one I have been working hard on for close to 20 years… and should have seen coming. I did intend to put some space between myself and women in tech, because I don’t exactly identify as a woman.. exactly. I grew up fundamentalist and misogynist af, and have been working hard to recover from that ever since I left home at 15.
“Maybe you shouldn’t give advice to women at all.” Courtney challenged me on whether I should speak to women, given my ambivalence wrt my own gender identification. Which is an interesting question that I have pondered a lot.

This was all desperately inevitable and predictable, however, and I made some unforced errors. So let’s talk about what I do and don’t regret about all this, and what I would or would not do differently.

Regrets/No regrets

NO REGRETS: giving the advice. It’s good advice, it needed to be said. I’m tired of seeing women burn themselves out on shitty corporate diversity work that only diverts their energy from amassing real power and strategic impact. Not sorry.

REGRETS: I was sloppy about waving in the direction of my gender issues. I intended to put some space between myself and “women’s issues”, because I don’t exactly identify as a woman, exactly, and I have always felt uncomfortable in women’s spaces. Given the historic devaluation of women’s spaces and issues, I should have been clearer. I am sorry.

NO REGRETS: I think it’s fine for me to give advice to women if they ask, which they do. After all I was raised as a woman, have always been read and treated as one, and assumed there was no other option for 30+ years. I get to speak. Not sorry.

SEMI-REGRETS: I still can’t figure how anyone managed to project into my piece that I was slamming all diversity work. I said somewhat colorfully that a lot of the advice didn’t work for me and wasn’t my favorite thing to dwell on, in the same grouchy grumbly tone that I use when bitching about query planners and terraform variable interpolation. I don’t think this would have been a big deal if the frenzy hadn’t gotten whipped up, but if anyone genuinely felt hurt or dismissed by it, I can be sorry for that.

REGRETS: the impact it had on my poor engineering managers and other people who work with me. They are still being asked to denounce me or defend me and their decision to work with me. So deeply not ok. I am sorry — but that’s really on you, internet assholes.

BIGGEST REGRETS: any accidental cover given to misogynists. By far the most annoying thing about the brouhaha has been when men with toxic views compliment me because they think I’m agreeing with them. I am NOT, so get off me. Sorry not sorry.

SADDEST REGRETS: my plummeting opinion of the feminist internet trash mob. I am a feminist and damn proud of it, but I am also disgusted by the hyperperformative boundary policing of certain self-proclaimed “tech feminists”. If your great joy in life is roving the interwebs looking for any toes pressing a line so you can rapturously castigate them and shun them until they have licked your boots and begged for forgiveness … if you love performing elaborate outrage rituals and whipping up a frenzy of whispers or a witch hunt… then:

Fuck. The Fuck. Off. You are an embarrassment. This is about your ego, and your manufactured grievance machines are Not Helping.

I honestly thought these feminist pile-on mobs were a right-wing fantasy, and I’m sad that I was wrong. I’m also pretty sad about all the folks who know me and have every reason to know better. In my world you check in with your friends before leaping to judgment, and you help teach each other when you’re being stupid. A pretty dismal number of people I would have called friends just leapt excitedly into the fray passing judgment.

So now I know more about who my friends are.

In conclusion

Why even stick my neck out? I guessed something might go wrong, I just didn’t know what. So why?

Because I want to help, dammit. The farther I get in my career the more time I spend pondering how to bring others along with me, how to open the gates a little wider.

I’ve gotten to do a few things. I have tried to create an equitable, respectful working environment where everyone can do their best work, with managers who are passionate about diversity and strong where I am weak.

But … I have felt very often alienated by the messaging and attempts to help women. I can’t be the only one who responds more to a strategic message than an empathetic one, who feels condescended to and patronized by the mainstream corporate efforts.

I can’t be the only one who feels simmering resentment every time I get held up as a successful “woman in tech” (the world’s worst participation trophy). I don’t want a fucking consolation prize. I want to sweep the competition, I want to change the world. I can’t be the only one who hungers for power, money and credibility.

I know I’m not, actually. I know because they are telling me. The response has been at least 100-1 positive in private — from junior women especially — thanking me for being brutally honest and treating them like adults, like equals. (I’ve been told there are armies of women who feel dreadfully hurt but too afraid to say so. Pity if true, as they say.)

There has always been tension between the people who see the world as it is and fight to succeed in it, and the people who opt out and refuse to participate because it’s compromised. The world needs us both. So shut the fuck up and let the kids pick for themselves.

And maybe stop persecuting the people who stand with you.

charity.

P.S. check out Jen Andre’s eloquent restatement of it all. it’s so great.

Charity todo items

If I ever again write anything about women or diversity, have someone I trust proof before publishing
Remember how much of my audience doesn’t know shit about me, and won’t or can’t assume the best of my intentions
Wrap statements in exception handlers about this being my experience blah blah
Try not to let people get under my skin and spark a personal reaction
Derive somewhat less pleasure from smacking down assholes on the internet, even when they deserve it. [ASPIRATIONAL] [WONTFIX]

Money, power and credibility

March 2, 2018June 10, 2020 mipsytipsymanagement, tech culture6 Comments

I don’t really do “women stuff” (awkward umbrella term for gender-segregated events and spaces). I don’t really identify with any gender and I find a lot of the advice to be condescending and overly delicate, and it’s just a really boring thing to think and talk about. For me.

But I’m feeling guilty after turning down a bunch of requests to do shit for International Women’s Day next week. So I’m gonna do a thing I’ve been avoiding doing for years, and write down my (deeply problematic but practical) advice.

Toughen up. For your first 10 years or 3 jobs in the industry, you’re a junior contributor. You need them way more than they need you, so suck it up. Try not to dwell on the bullshit. Work hard and level up and always angle for more money and power when you can.
Stay technical. There are a thousand paved ramps out of engineering roles and only a few hard paths back in. Technical excellence is currency in this industry, even more so if your credibility is gonna get challenged again and again. So don’t stop engineering til you’re great at it.
Use your power for good. Once you become a senior contributor — and i’m not talking bullshit titles but real seniority, when people are coming to you for help far more than you go to them — then you can afford to get sensitive. …On behalf of others. The research convincingly shows that women get punished for advocating for themselves, but not for advocating for others. It’s a sweet loophole, use it.

If you feel like table flipping out of tech, just remember the rest of the world is at LEAST as sexist as tech is, but without the money and power and ridiculous life-coddling. Where exactly do you think you’re going to go?

Don’t quit tech: quit your job. There are LOTS of tolerable-to-great companies out there. If you stay and suffer, you’re just rewarding the shitholes with your presence. Don’t reward the shitholes any more than you can help it.

Learn shit, save your money, amass great power. Then use it to fuck shit up.

DevOps vs SRE: delayed coverage of the dumbest war

June 30, 2016July 1, 2016 mipsytipsydevops, operations, sre, tech culture, velocity18 Comments

Last week was the West Coast Velocity conference. I had a terrific time — I think it’s the
best Velocity I’ve been to yet. I also slipped in quite late, the evening before last, to catch Gareth’s session on DevOps vs SRE.

I had to catch it, because Gareth Rushgrove (of DevOps Weekly glory) was taunting @lusis and me about it on the Internet.

my plan for this @velocityconf talk is increasingly to make @mipsytipsy and @lusis blush
— Gareth Rushgrove (@garethr) June 16, 2016

And it was worth it! Holy crap, this was such a fun barnburner of a talk, with Gareth schizophrenically arguing both for and against the key premise of the talk, which was about “Google Infrastructure for Everyone Else (GIFEE)” and whether SRE is a) the highest, noblest goal that we should all aspire towards, or b) mostly irrelevant to anyone outside the Google confines.

Which Gareth won? Check out the slides and judge for yourself. 🙃

At some point in his talk, though, Gareth tossed out something like “Charity probably already has a blog post on this drafted up somewhere.” And I suddenly remembered “Fuck! I DO!” it’s been sitting in my Drafts for months god dammit.

So this is actually a thing I dashed off back in April, after CraftConf. Somebody asked me for my opinion on the internet — always a dangerous proposition — and I went off on a bit of a rant about the differences and similarities between DevOps and SRE, as philosophies and practices.

Time passed and I forgot about it, and then decided it was too stale. I mean who really wants to read a rehash of someone’s tweetstorm from two months ago?

Well Gareth, apparently.

Anyway: enjoy.

SRE vs DevOps: TWO PHILOSOPHIES ENTER, BOTH ARE PHENOMENALLY SUCCESSFUL AND MUTUALLY DUBIOUS OF ONE ANOTHER

So in case you haven’t noticed, Google recently published a book about Site Reliability Engineering: How Google Runs Production Systems. It contains some really terrific wisdom on how to scale both systems and orgs. It contains chapters written by dear friends of mine. It’s a great book, and you should buy it and read it!

It also has some really fucking obnoxious blurbs. Things like about how “ONLY GOOGLE COULD HAVE DONE THIS”, and an whiff of snobbery throughout the book as though they actually believe this (which is far worse if true).

You can’t really blame the poor blurb’ers, but you can certainly look askance at a massive systems engineering org when it seems as though they’ve never heard of DevOps, or considered how it relates to SRE practices, and may even be completely unaware of what the rest of the industry has been up to for the past 10-plus years. It’s just a little weird.

So here, for the record, is what I said about it.

1) a lot of the philosophical volleying between devops / SRE comes down to a failure to recognize the overwhelming power of context.
— Charity Majors (@mipsytipsy) May 1, 2016

Google is a great company with lots of terrific engineers, but you can only say they are THE

BEST at what they do if you’re defining what they do tautologically, i.e. “they are the best at making Google run.” Etsyans are THE BEST at running Etsy, Chefs are THE BEST at building Chef, because … that’s what they do with their lives.

Context is everything here. People who are THE BEST at Googling often flail and flame out in early startups, and vice versa. People who are THE BEST at early-stage startup engineering are rarely as happy or impactful at large, lumbering, more bureaucratic companies like Google. People who can operate equally well and be equally happy at startups and behemoths are fairly rare.

And large companies tend to get snobby and forget this. They stop hiring for unique strengths and start hiring for lack of weaknesses or “Excellence in Whiteboard Coding Techniques,” and congratulate themselves alot about being The Best. This becomes harmful when it translates into to less innovation, abysmal diversity numbers, and a slow but inexorable drift into dinosaurdom.

2) operations engineering is a specialized skill set *at large scale* or *on hard ops problems*. many -- most? companies don't have those.
— Charity Majors (@mipsytipsy) May 1, 2016

Everybody thinks their problems are hard, but to a seasoned engineer, most startup problems are not technically all that hard. They’re tedious, and they are infinite, but anyone can figure this shit out. The hard stuff is the rest of it: feverish pace, the need to reevaluate and reprioritize and reorient constantly, the total responsibility, the terror and uncertainty of trying to find product/market fit and perform ten jobs at once and personally deliver to your promises to your customers.

At a large company, most of the hardest problems are bureaucratic. You have to come to terms with being a very tiny cog in a very large wheel where the org has a huge vested interest in literally making you as replicable and replaceable as possible. The pace is excruciatingly slow if you’re used to a startup. The autonony is … well, did I mention the politics? If you want autonomy, you have to master the politics.

3) the outcomes associated with operations (reliability, scalability, operability) are the responsibility of *everyone* from support to CEO.
— Charity Majors (@mipsytipsy) May 1, 2016

Everyone. Operational excellence is everyone’s job. Dude, if you have a candidate come in and they’re a jerk to your office manager or your cleaning person, don’t fucking hire that person because having jerks on your team is an operational risk (not to mention, you know, like moral issues and stuff).

But the more engineering-focused your role is, the more direct your impact will be on operational outcomes.

4) therefore, the more literate you are with operational skills, the more effective and powerful you can be -- esp as a software engineer.
— Charity Majors (@mipsytipsy) May 1, 2016

As a software engineer, developing strong ops chops makes you powerful. It makes you better at debugging and instrumentation, building resiliency and observability into your own systems and interdependent systems, and building systems that other people can come along and understand and maintain long after you’re gone.

As an operations engineer, those skills are already your bread and butter. You can increase your power in other ways, like by leveling up at software engineering skills like test coverage and automation, or DBA stuff like query optimization and storage engine internals, or by helping the other teams around you level up on their skills (communication and persuasion are chronically underrecognized as core operations engineering skills).

5) specialization is not a bad thing. specialization is how we scale and do capitalism! the problem is when this becomes compartmentalizing.
— Charity Majors (@mipsytipsy) May 1, 2016

This doesn’t mean that everyone can or should be able to do everything. (I can’t even SAY the words “full stack engineer” without rolling my eyes.) Generalists are awesome! But past a certain inflection point, specialization is the only way an org can scale.

It’s the only way you make room for those engineering archetypes who only want to dive deep, or who really really love refactoring, or who will save the world then disappear for weeks. Those engineers can be incredibly valuable as part of a team … but they are most valuable in a large org where you have enough generalists to keep the oars rowing along in the meantime.

6) so: Google SRE has an incredibly powerful set of best practices, that enable them to run the largest site in the world incredibly well.
— Charity Majors (@mipsytipsy) May 1, 2016

So, back to Google. They’ve done, ahem, rather well for themselves. Made shitbuckets of money, pushed the boundaries of tech, service hardly ever goes down. They have operational demands that most of us never have seen and never will, and their engineers are definitely to be applauded for doing a lot of hard technical and cultural labor to get there.

So why did this SRE book ruffle a few feathers?

7) we can learn a lot of things from google SRE. but if you just blindly apply them to your org you will fuck your org up, because context.
— Charity Majors (@mipsytipsy) May 1, 2016

Mostly because it comes off a little tone deaf in places. I’m not personally pissed off by
the google SRE book, actually, just a little bemused at how legitimately unaware they seem to be about … anything else that the industry has been doing over the past 10 years, in terms of cultural transformation, turning sysadmins into better engineers, sharing on-call rotations, developing processes around empathy and cross-functionality, engineering best practices, etc.

If you try and just apply Google SRE principles to your own org according to their prescriptive model, you’re gonna be in for a really, really bad time.

However, it happens that Jen Davis and Katherine Daniels just published a book called Effective DevOps, which covers a lot of the same ground with a much more varied and inclusive approach. And one of the things they return to over and over again is the power of context, and how one-size-fits-all solutions simply don’t exist, just like unisex OSFA t-shirts are a dirty fucking lie.

Google insularity is … a thing. On the one hand it’s great that they’re opening up a bit! On the other hand it’s a little bit like when somebody barges onto a mailing list and starts spouting without skimming any of the archives. And don’t even get me started on what happens when you hire long, longterm ex-Googlers back into to the real world.

So, so many of us have had this experience of hiring ex-Googlers who automatically assume that the way Google does a thing is CORRECT, not just contextually appropriate. Not just right for Google, but right for everyone, always. Which is just obviously untrue. But the reassimilation process can be quite long and exhausting when the Kool-Aid is so strong.

8) DevOps as a philosophy is much more sensitive to context than SRE philosophy, because it grew from a broader collaborative base.
— Charity Majors (@mipsytipsy) May 1, 2016

Because yeah, this is a conversation and a transformation that the industry has been having for a long time now. Compared with the SRE manifesto, the DevOps philosophy is much more crowd-sourced, more flexible, and adaptable to organizations of all stages of developments, with all different requirements and key business differentiators, because it’s benefited from loud, mouthy contributors who aren’t all working in the same bubble all along.

And it’s like Google isn’t even aware this was happening, which is weird.

9) that's it, basically all i'm saying is "all blanket statements are false" including probably this one 🙂 #devops #sre
— Charity Majors (@mipsytipsy) May 1, 2016

Orrrrrr, maybe I’m just a wee bit annoyed that I’ve been drawn into this position of having to defend “DevOps”, after many excellent years spent being grumpy about the word and the 10000010101 ways it is used and abused.

(Tell me again about your “DevOps Engineering Team”, I dare you.)

(^^ thanks to @kellan and others who particularly influenced/clarified my thinking around #8, the crowdsourcing of devops)
— Charity Majors (@mipsytipsy) May 1, 2016

P.S. I highly encourage you to go read the epic hours-long rant by @matthiasr that kicked off the whole thing. some of which I definitely endorse and some of which not, but I think we could go drink whiskey and yell about this for a week or two easy breezy <3

https://twitter.com/matthiasr/status/726703064799961088

Anyway what the fuck do I know, I’ve never worked in the Google lair, so maybe I am just under-equipped to grasp the true glory, majesty and superiority of their achievements over us all.

Or maybe they should go read Katherine and Jen’s book and interact with the “UnGoogled” once in a while. ☺️

Operational Best Practices #serverless

May 31, 2016June 1, 2016 mipsytipsyaws, devops, operations, serverless29 Comments

This post is part two of my recap of last week’s terrific Serverless conference. If you feel like getting bitchy with me about what serverless means or #NoOps or whatever, please refer back to the prequel post, where I talked about operations engineering in the modern world.

*Then* you can get bitchy with me. (xoxoxxooxo)

The title of my talk was:

the tooth fairy is a useful abstraction for the pain of life and rewards of having people who care about you ^_^

— Charity Majors (@mipsytipsy) May 23, 2016

The theme of my talk was basically: what should software engineers know and care about when it comes to operations in a world where we are outsourcing more and more core functionality?

If you care about running a quality service or product, or providing your customers with a reasonable level of service, you have to care about operational concerns like design, resiliency, instrumentation and debuggability. No matter how many abstractions there are between you and the bare metal.

If you chose a provider, you do not get to just point your finger at them in the post mortem and say it’s their fault. You chose them, it’s on you. It’s tacky to blame the software or the service, and besides your customers don’t give a shit whose “fault” it is.

So given an infinite number of things to care about, where do you start?

What is your mission, and what are your differentiators?

The first question must always be: what is your mission? Your mission is not writing software. Your mission is delivering whatever it is your customers are paying you for, and you use software to get there. (Code is kind of a liability so you should write as little of it as necessary. hey!! sounds like a good argument for #serverless!)

Second: what are your core differentiators? What are the things that you are doing that are unique, and difficult to replicate, or the things where you have to actually be world class experts in those things?

Those are the things that you will have the hardest time outsourcing, or that you should think about very carefully before outsourcing.

Facts

You can outsource labor, but you can’t outsource caring. And nobody but you is in the position to think about your core differentiators and your product in a holistic way.

If you’re a typical early startup, you’re probably using somewhere between 5 and 20 SaaS products to get rid of some of the crap work and offload it to dedicated teams who can do it better than you can, much more cheaply, so you are freed up to work on your core value proposition.

GOOD.

But you still have to think about things like reliability, your security model, your persistent storage models, your query performance, how all these lovely services talk to each other, how you’re going to debug them, how you’re going to repro when things go wrong, etc. You still own these things, even if you don’t run them.

For example, take AWS Lambda. It’s a pretty great service on many dimensions. It’s an early version of the future. It is also INCREDIBLY irritating and challenging to debug in a practically infinite number of insanity-inducing ways.

** Important side note — I’m talking about actual production systems. Parse, Heroku, Lambda, etc are GREAT for prototyping and can take you a long, long way. Early stage startups SHOULD optimize for agility and rapid developer iteration, not reliability. Thx to @joeemison for reminding me that i left that out of the recap.

Focus on the critical path

Your users don’t care if your internal jenkins builds are broken. They don’t care about a whole lot of things that you have to care about … eventually. They do care a lot if your product isn’t actually functional. Which means you have to think through the behavioral and failure characteristics of the providers you’re relying on in any user visible fashion.

Ask lots of questions if you can. (AWS often won’t tell you much, but smaller providers will.) Find out as much as you can about their cotenancy model (shared hardware or isolation?), their typical performance variance (run your own tests, don’t trust their claims), and the underlying storage systems.

Think about how you can bake in resiliency from the user’s perspective, that doesn’t rely on provider guarantees. If you’re on mobile, can you provide a reasonable offline experience? Like Parse did a lot of magic here in the APIs, where it would back off and retry saves if there were any errors.

Can you fail over to another provider if one is down? Is it even worth it at your company’s stage of maturity and engineering resources to invest in this?

How willing are you to be locked into a vendor or provider, and what is the story if you find yourself forced to switch? Or if that service goes away, as so many, many, many of them have done and will do. (RIP, parse.com.)

Tradeoffs

Listen, outsourcing is awesome. I do it as much as I can. I’m literally helping build a service that provides outsourced metrics, I believe in this version of the future! It’s basically the latest iteration of capitalism in a nutshell: increased complexity –> increased specialization –> you pay other people to do the job better than you –> everybody wins.

But there are tradeoffs, so let’s be real.

The service, if it is smart, will put strong constraints on how you are able to use it, so they are more likely to deliver on their reliability goals. When users have flexibility and options it creates chaos and unreliability. If the platform has to choose between your happiness vs thousands of other customers’ happiness, they will choose the many over the one every time — as they should.

Limits may mysteriously change or be invented as they are discovered, esp with fledgling services. You may be desperate for a particular feature, but you can’t build it. (This is why I went for Kafka over Kinesis.)

You need to think way more carefully and more deeply about visibility and introspection up front than you would if you were running your own services, because you have no ability to log in and use strace or gdb or tail a logfile or run any system profiling commands when things go dark.

In the best case, you’re giving up some control and quality in exchange for experts doing the work better than you could for cheaper (e.g. i’m never running a fucking physical data center again, jesus. EC24lyfe). In a common worse case, it’s less reliable than what you would build AND it’s also opaque AND you can’t tell if it’s down for you or for everyone because frankly it’s just massively harder to build a service that works for thousands/millions of use cases than for any one of them individually.

Stateful services

Ohhhh and let’s just briefly talk about state.

The serverless utopia mostly ignores the problems of stateful services. If pressed they will usually say DynamoDB, or Firebase, or RDS or Aurora or something.

Real question how does state get persisted with #serverless?

I understand scale out of stateless servers, but who stores the state?

— Caitie McCaffrey (@caitie) May 26, 2016

This is a big, huge, deep, wide lake of crap to wade in to so all I’m going to say is that there is no such thing as having the luxury of not having to understand how your storage systems work. Queries will get slow, and you’ll need to be able to figure out why and fix them. You’ll hit scaling cliffs where suddenly a perfectly-usable app just starts timing everything out because of that extra second of latency coming from …

¯\_(ツ)_/¯

The hardware underlying your instance will degrade (there’s a server somewhere under all those abstractions, don’t forget). The provider will have mysterious failures. They will be better than you, probably, but less inclined to give you satisfactory progress updates because there are hundreds or thousands or millions of you all clamoring.

The more you understand about your storage system (and the more you stay in the lane of how it was intended to be used), the happier you’ll be.

In conclusion

These trends are both inevitable and, for the most part, very good news for everyone.

Operations engineering is becoming a more fascinating and specialized skill set. The best engineers are flocking to solve category problems — instead of building the same system at company after company, they are building SaaS solutions to solve it for the internet at large. Just look at the massive explosion in operational software offerings over the past 5-6 years.

This means that the era of the in-house dedicated ops team, which serves as an absorbent buffer for all the pain of software development, is mostly on its way out the door. (And good riddance.)

People are waking up to the fact that software quality improves when feedback loops are tighter for software engineers, which means being on call and owning services end to end. The center of gravity is shifting towards engineering teams owning the services they built.

This is awesome! You get to rent engineers from Google, AWS, Pagerduty, Pingdom, Heroku, etc for much cheaper than if you hired them in-house — if you could even get them, which you probably can’t because talent is scarce.

But the flip side of this is that application engineers need to get better at thinking in traditionally operations-oriented ways about reliability, architecture, instrumentation, visibility, security, and storage. Figure out what your core differentiators are, and own the shit out of those.

Nobody but you can care about your mission as much as you can. Own it, do it. Have fun.

WTF is operations? #serverless

May 31, 2016May 31, 2016 mipsytipsyaws, devops, operations, serverless, tech culture40 Comments

I just got back from the very first ever @serverlessconf in NYC. I have a soft spot for well-curated single-track conferences, and the organizers did an incredible job. Major kudos to @iamstan and team for pulling together such a high-caliber mix of attendees as well as presenters.

I’m really honored that they asked me to speak. And I had a lot of fun delivering my talk! But in all honesty, I turned it down a few times — and then agreed, and then backed out, and then agreed again at the very last moment. I just had this feeling like the attendees weren’t going to want to hear what I was gonna say, or like we weren’t gonna be speaking the same language.

Which … turned out to be mmmmostly untrue. To the organizers’ credit, when I expressed this concern to them, they vigorously argued that they wanted me to talk *because* they wanted a heavy dose of real talk in the mix along with all the airy fairy tales of magic and success.

So #serverless is the new cloud or whatever

Hi, I’m grouchy and I work with operations and data and backend stuff. I spent 3.5 years helping Parse grow from a handful of apps to over a million. Literally building serverless before it was cool TYVM.

So when I see kids saying “the future is serverless!” and “#NoOps!” I’m like okay, that’s cute. I’ve lived the other side of this fairytale. I’ve seen what happens when application developers think they don’t have to care about the skills associated with operations engineering. When they forget that no matter how pretty the abstractions are, you’re still dealing with dusty old concepts like “persistent state” and “queries” and “unavailability” and so forth, or when they literally just think they can throw money at a service to make it go faster because that’s totally how services work.

I’m going to split this up into two posts. I’ll write up a recap of my talk in a sec, but first let’s get some things straight. Like words. Like operations.

What is operations?

Let’s talk about what “operations” actually means, in the year 2016, assuming a reasonably high-functioning engineering environment.

At a macro level, operational excellence is not a role, it’s an emergent property. It is how you get shit done.

Operations is the sum of all of the skills, knowledge and values that your company has built up around the practice of shipping and maintaining quality systems and software. It’s your implicit values as well as your explicit values, habits, tribal knowledge, reward systems. Everybody from tech support to product people to CEO participates in your operational outcomes, even though some roles are obviously more specialized than others.

Saying you have an ops team who is solely responsible for reliability is about as silly as saying that “HR defines and owns our company culture!” No. Managers and HR professionals may have particular skills and responsibilities, but culture is an emergent property and everyone contributes (and it only takes a couple bad actors to spoil the bushel).

Thinking about operational quality in terms of “a thing some other team is responsible for” is just generally not associated with great outcomes. It leads to software engineers who are less proficient or connected to their outcomes, ops teams who get burned out, and an overall lower quality of software and services that get shipped to customers.

These are the specialized skill sets that I associate with really good operations engineers. Do these look optional to you?

It depends on your mission, but usually these are not particularly optional. If you have customers, you need to care about these things. Whether you have a dedicated ops team or not. And you need to care about the tax it imposes on your humans too, especially when it comes to the cognitive overhead of complex systems.

So this is my definition of operations. It doesn’t have to be your definition. But I think it is a valuable framework for helping us reason about shipping quality software and healthy teams. Especially given the often invisible nature of operations labor when it’s done really well. It’s so much easier to notice and reward shipping shiny features than “something didn’t break”.

The inglorious past

Don’t get me wrong — I understand why “operations” has fallen out of favor in a lot of crowds. I get why Google came up with “SRE” to draw a line between what they needed and what the average “sysadmin” was doing 10+ years ago.

Ops culture has a number of well-known and well-documented pathologies: hero/martyr complexes, risk aversion, burnout, etc. I understand why this is offputting and we need to fix it.

Also, historically speaking, ops has attracted a greater proportion of nontraditional oddballs who just love debugging and building things — fewer Stanford CS PhDs, more tinkerers and liberal arts majors and college dropouts (hi). And so they got paid less, and had less engineering discipline, and burned themselves out doing too much ad hoc labor.

But — this is no longer our overwhelming reality, and it is certainly not the reality we are hurtling towards. Thanks to the SRE movement, and the parallel and even more powerful & diverse open source DevOps movement, operations engineers are … engineers. Who specialize in infrastructure. And there’s more value than ever in empathy and fluid skill sets, in engineers who are capable of moving between disciplines and translating between specialties. This is where the “full-stack developer” buzzword comes from. It’s annoying, but reflects a real craving for generalist skill sets.

The BOFH stereotype is dead. Some of the most creative cultural and technical changes in the technical landscape are being driven by the teams most identified with operations and developer tooling. The best software engineers I know are the ones who consistently value the impact and lifecycle of the code they ship, and value deployment and instrumentation and observability. In other words, they rock at ops stuff.

The Glorious Future

And so I think it’s time to bring back “operations” as a term of pride. As a thing that is valued, and rewarded. As a thing that every single person in an org understands as being critical to success. Every organization has unique operational needs, and figuring out what they are and delivering on them takes a lot of creativity and ingenuity on both the cultural and technical front.

“Operations” comes with baggage, no doubt. But I just don’t think that distance and denial are an effective approach for making something better, let alone trash talking and devaluing the skill sets that you need to deliver quality services.

You don’t make operational outcomes magically better by renaming the team “DevOps” or “SRE” or anything else. You make it better by naming it and claiming it for what it is, and helping everyone understand how their role relates to your operational objectives.

And now that I have written this blog post I can stop arguing with people who want to talk about “DevOps Engineers” and whether “#NoOps” is a thing and maybe I can even stop trolling them back about the nascent “#NoDevs” movement. (Haha just kidding, that one is too much fun.)

I mean how hard can it be to just glue together APIs that other people have written and support and scale? 🤔 #serverless #NoDevs
— Charity Majors (@mipsytipsy) May 23, 2016

sufficiently advanced trolling is indistinguishable from thought leadering. #serverless #NoDevs
— Charity Majors (@mipsytipsy) May 23, 2016

Part 2: Operations in a #Serverless World

Nail polish: the superior paint

April 4, 2016April 4, 2016 mipsytipsypretty4 Comments

Here is a thing that more people need to know: nail polish is the toughest, brightest, sparkliest, most durable paint in the world. It makes everything prettier and brighter and it lasts forever.

I have a tradition of always painting a new keyboard every time I start a new job. Please witness exhibit A, my freshly painted Hound keyboard & num pad:

Doesn’t that just make you happy to look at? The idea of typing on a plain beige or black keyboard all day long is … actually one of the more depressing things I can imagine.

Now check out the one I painted when I started at Parse and used nearly every day for almost 4 years:

old-school microsoft natural keyboard, RIP

Still pretty cute, right? It’s a little dingy and a few chips here or there but is there literally any other paint on earth that would have taken this abuse for four years? (And you can’t even tell how pretty the shimmery holographic silver keys are in this pic.)

Dude, it gets better. My work laptop:

work laptop, aluminum MBP with hound swag

And since I got my work and personal laptops around the same time, I made them fraternal twins.

IMG_4811 (1) — personal laptop, dark grey MBA

Yo, this shit can bang around in my laptop bag for *years* and never chip or wear off or fade. Nail polish is the most badass paint in the entire world.

(The only thing I will not paint is my fingernails, I hate painted nails.)

I really really wish I had saved pictures of all the laptops and keyboards and monitors and mice and cell phones and other crap I have painted over the last decade or so. But here are a few of the pics and pieces I still have lying around:

former personal laptop

rainbow helmet

flowery helmet

flowery helmet side 2

after riding over GG bridge

a former work laptop

another old work laptop

shopkick keyboard

People keep asking me where I get my laptop “stickers”, or my awesome keyboards, and now you know.

Nail polish. It is the shit.

Pro tip #1: if you want to paint bright colors on a black background, you need to lay down a very light base first. Surprisingly, white doesn’t work well, it’s not opaque enough and ends up looking streaky. You need something reflective, like silver aluminum foil or silver chrome — maybe this? Dunno, I have a big stash so I haven’t really bought anything new in years and can’t recommend specific brands.

Pro Tip #2: cheap polish is A-OK. Those 99 cent bottles are fine, just select for opacity. Sheer colors don’t work very well, if you have to apply more than 2-3 coats it will get globby and weird as well as being tedious.

Pro Tip #3: sketch your designs out with a pencil before you paint them (easily erasable), and apply a top coat the next day to make everything even shinier and longer-lasting.

somewhere in the presidio

some random magazine holders

helmet on a sunny day

pretty little shelves

Please send me your pics. 🙂

Happy painting!!

Two weeks with Terraform

February 23, 2016 mipsytipsycloudformation, devops, terraform39 Comments

I’ve been using terraform regularly for 2-3 weeks now. I have terraformed in rage, I have terraformed in delight. I thought it might be helpful to share some of my notes and lessons learned.

Why Terraform?

Because I am fucking sick and tired of not having versioned infrastructure. Jesus christ, the ways my teams have bent over backwards to fake infra versioning after the fact (nagios checks running ec2 diffs, anyone?).

Because I am starting from scratch on a green field project, so I have the luxury of experimenting without screwing over existing customers. Because I generally respect Hashicorp and think they’re on the right path more often than not.

If you want versioned infra, you basically get to choose between 1) AWS CloudFormation and its wrappers (sparkleformation, troposphere), 2) chef-provisioner, and 3) Terraform.

The orchestration space is very green, but I think Terraform is the standout option. (More about why later.) There is precious little evidence that TF was developed by or for anyone with experience running production systems at scale, but it’s … definitely not as actively hostile as CloudFormation, so it’s got that going for it.

First impressions

Stage one: my terraform experiment started out great. I read a bunch of stuff and quickly spun up a VPC with public/private subnets, NAT, routes, IAM roles etc in < 2 days. This would be nontrivial to do in two days *without* learning a new tool, so TOTAL JOY.

Stage two: spinning up services. This is where I started being like … “huh. Has anyone ever actually used this thing? For a real thing? In production?” Many of the patterns that seemed obvious and correct to me about how to build robust AWS services were completely absent, like any concept of a subnet tier spanning availability zones. I did some inexcusably horrible things with variables to get the behavior I wanted.

Stage three: … modules. Yo, all I wanted to do was refactor a perfectly good working config into modules for VPC, security groups, IAM roles/policies/users/groups/profiles, S3 buckets/configs/policies, autoscaling groups, policies, etc, and my entire fucking world just took a dump for a week. SURE, I was a TF noob making noob mistakes, but I could not believe how hard it was to debug literally anything..

This is when I started tweeting sad things.

my kingdom for a printf, or a linter, or — dear god — the ability to step through the run and print out types & values.

— Charity Majors (@mipsytipsy) February 10, 2016

The best (only) way of debugging terraform was just reading really, really carefully, copy-pasting back and forth between multiple files for hours to get all the variables/outputs/interpolation correct. Many of the error messages lack any context or line numbers to help you track down the problem. Take this prime specimen:

Error downloading modules: module aws_vpc: Error loading .terraform
/modules/77a846c64ead69ab51558f8c5be2cc44/main.tf: Error reading 
config for aws_route_table[private]: parse error: syntax error

Any guesses? Turned out to be a stray ‘}’ on line 105 in a different file, which HCL vim syntax highlighting thought was A-OK. That one took me a couple hours to track down.

Or this:

aws_security_group.zookeeper_sg: cannot parse '' as int: 
strconv.ParseInt: parsing "": invalid syntax

Which *obviously* means you didn’t explicitly define some inherited port as an int, so there’s a string somewhere there lurking in your tf tree. (*Obviously* in retrospect, I mean, after quite a long time poking haplessly about.)

Later on I developed more sophisticated patterns for debugging terraform. Like, uhhh, bisecting my diffs by commenting out half of the lines I had just added, then gradually re-adding or re-commenting out more lines until the error went away.

Security groups are the worst for this. SO MANY TIMES I had security group diffs run cleanly with “tf apply”, but then claim to be modifying themselves over and over. Sometimes I would track this down to having passed in a variable for a port number or range, e.g. “cidr_blocks = [“${var.ip_range}”]”. Hard-coding it to “cidr_blocks [“10.0.0.0/8″]” or setting the type explicitly would resolve the problem. Or if I accidentally entered a CIDR range that AWS didn’t like, like 10.0.20.0/20 instead of 10.0.16.0/20. The change would apply and usually it would work, it just didn’t think it had worked, or something. TF wasn’t aware there was a problem with the run so it would just keep “successfully” reapplying the diff every time it ran.

Some advice for TF noobs

As @phinze told me, “modules are basically like functions — a variable is an argument, output is a return value”. This was helpful, because that was completely unintuitive to me when I started refactoring. It took a few days of wrestling with profoundly inscrutable error messages before modules really clicked for me.
Strings. Lists. You can only pass variables around as strings. Split() and join() are your friends. Oh my god I would sell so many innocent children for the ability to pass maps back and forth between modules.
No interpolation for resource names makes me so sad. Basically you can either use local variable maps, or multiple lists and just … run those index counters like a boss I guess..
Use AWS termination protection for stateful services or anything risky once you’re in production. Use create_before_destroy on resources like ASG launch configs. Use “don’t destroy” where you must — but as sparingly as possible, because that basically breaks the entire TF model.
If you change the launch config for an ASG, like replacing the AMI for example, you might expect TF to kick off an instance recycle. It will not. You must manually terminate the instances to pick up the new config.
If you’re collaborating with a team — ok, even if you’re not — find a remote place to store the tfstate files. Try S3 or github, or shell out for Atlas. Local state on laptops is for losers.
TF_LOG=DEBUG has never once been helpful to me. I can only assume it was written for the Hashicorp developers, not for those of us using the product.

Errors returned by AWS are completely opaque. Like “You were not allowed to apply this update”. Huh? Ok well if it fails on “tf plan”, it’s probably a bad terraform config. If it successfully plans but fails on “tf apply”, your AWS logic is probably at fault.

Terraform does not do a great job of surfacing AWS errors.

For example, here is some terraform output:

tf output: "* aws_route_table.private: InvalidNatGatewayID.NotFound
: The natGateway ID 'nat-0e5f4ea507113b423' does not exist"

Oh!~ Okay, I go to the AWS console and track down that NAT gateway object and find this:

"Elastic IP address [eipalloc-8583b7e1] is already associated"

Hey, that seems useful! Seems like TF just timed out bringing up one of the route tables, so it tried assigning the same EIP twice. It would be nice to surface more of this detail into the terraform output, I hate having to resort to a web console.

Last but not least: one time I changed the comment string on a security group, and “tf plan” went into an infinite dependency loop. I had to roll back the change, run terraform destroy against all resources in a bash for loop, and create an new security group with all new instances/ASGs just to change the comment string. You cannot change comment strings or descriptions for resources without the resources being destroyed. This seems PROFOUNDLY weird to me.

Wrapper scripts

Lots of people seem to eventually end up wrapping terraform with a script. Why?

There is no concept of a $TF_ROOT. If you run tf from the wrong directory, it will do some seriously confusing and screwed up shit (like duping your config, but only some of it).
If you’re running in production, you prob do not want people to be able to accidentally “terraform destroy” the world with the wrong environment
You want to enforce test/staging environments, and promotion of changes to production after they are proven good
You want to automatically re-run “tf plan” after “tf apply” and make sure your resources have converged cleanly.
So you can add slack hooks, or hipchat hooks, or github hooks.
Ummm, have I mentioned that TF can feel somewhat undebuggable? Several people have told me they create rake tasks or YML templates that they then generate .tf files from so they can debug those when things break. (Erf …)

Okay, so …..

God, it feels I’ve barely gotten started but I should probably wrap it up.[*] Like I said, I think terraform is best in class for infra orchestration. And orchestration is a thing that I desperately want. Orchestration and composability are the future of infrastructure.

But also terraform is green as fuck and I would not recommend it to anyone who needs a 4-nines platform.

Simply put, there is a lot of shit I don’t want terraform touching. I want terraform doing as little as possible. I have already put a bunch of things into terraform that I plan on taking right back out again. Like, you should never be running a userdata.sh script after TF has bootstrapped a node. Yuck.. That is a job for your cfg management, or possibly a job for packer or a custom firstboot script, but never your orchestration tool! I have already stuffed a bunch of Route53 DNS into TF and I will be ripping that right back out soon. Terraform should not be managing any kind of dynamic data. Or service registry, or configs, or ….

Terraform is fantastic for defining the bones of your infrastructure. Your networking, your NAT, autoscaling groups, the bits that are robust and rarely change. Or spinning up replicas of production on every changeset via Travis-CI or Jenkins — yay! Do that!

But I would not feel safe making TF changes to production every day. And you should delegate any kind of reactive scaling to ASGs or containers+scheduler or whatever. I would never want terraform to interfere with those decisions on some arbitrary future run.

Which is why it is important to note that terraform does not play nicely with others. It wants to own the whole thing. Monkeypatching TF onto an existing infra is kind of horrendous. It would be nice if you could tag certain resources or products as “this is managed by some other system, thx”.

So: why terraform?

Well, it is fairly opinionated. It’s actively developed by some really smart people. It’s moving fast and has most of the momentum in the space. It’s composable and interacts well with other players iff you make some good life choices. (Packer, for example, is amazing, by far the most unixy utility of the Hashicorp library.)

Just look at the rate of bug fixes and releases for Terraform vs CloudFormation. Set aside crossplatform compatibility etc, and just look at the energy of the respective communities. Not even a fair fight.

Want more? Ok, well I would rather adopt one opinionated philosophy for my infrastructure, supplementing where necessary, than duct tape together fifty different half baked philosophies about how software and infrastructure should work and spend all my time mediating their conflicts. (This is one of my beefs with CloudFormation: AWS has no opinions, only slobbering, squidlike, directionless-flopping optionalities. And while we’re on the topic it also has nothing like “tf plan” for previewing changes, so THAT’S PRETTY STUPID TOO.)

I do have some concerns about Hashicorp spreading themselves too thin on too many products. Some of those products probably shouldn’t exist. Meh.

Terraform has a ways to go before it feels predictable and debuggable, but I think it’s heading in the right direction. It’s been a fun couple weeks and I’m excited to start contributing to the ecosystem and integrating with other components, like chef-solo & consul.

[*] OMGGGGGGG, I never even got to the glorious horrors of the terraforming gem and how you are most definitely going to end up manually editing your *.tfstate files. Ahahahahaa.

[**] Major thanks to @phinze, @solarce, @ascendantlogic, @lusis, @progrium and others who helped me limp through my first few weeks.

How to survive an acquisition

February 3, 2016February 3, 2016 mipsytipsytech culture20 Comments

Last week Facebook announced that they will be shutting down Parse.

I have so, so many feelings. But this isn’t about my feelings now. This is about what I wish I had known when we got acquired.

When they told us we were getting acquired, it felt like a wrecking ball to the gut. Nothing against Facebook, it just wasn’t what I signed up for. Big company, bureaucracy, 3-4 hour daily commute, and goddammit Parse was my *baby*. What the fuck was this big company going to do to my baby?

Here’s the first thing I wish I had known: this is normal. Trauma is the norm. Acquisitions are *always* massively disruptive and upsetting, even the good ones.

This was all compounded by the jarring disconnect between internal and external perceptions of the acquisition. The surreality of being written up in the press and having everyone act like this was the greatest thing that ever happened to us, versus the shock and dismay and confusion that many of us were actually experiencing.

Here are a few more things I wish I had understood.

You don’t own your product anymore. Your product is now “strategic alignment”. Look at yourself in the mirror and repeat that five times every morning.

Your customers are not your customers anymore. Your customer is now your corporate overlord. You can resist this and try to serve your old customers first, but it will wear your engineering team out and eventually drive you mad.

Cultures will clash. Yours will lose. You *must* learn the tribal and hierarchal games of the new org if you want to succeed. They don’t make sense to you, and you didn’t choose into them, but you must learn them anyway if you want to succeed.

Assume good intent. Aggressively assume good intent. If a big company buys up your tiny startup, lots of people are going to condescend to you and assume they know how to solve your own problems better than you do. In retrospect, I realize that a lot of tiny acquired startups *don’t* know what the fuck they’re doing and so the behemoth just gets used to assuming that everyone is like that. Just grit your teeth and take it.

And if you honestly can’t see yourself ever embracing the new parent company, whether for cultural or ethical or technical reasons, you should leave sooner rather than later. (UNLESS it’s for the golden handcuffs in which case for god’s sake try to emotionally disengage and do not be a people manager.)

One of Facebook’s internal slogans is something like “always do what’s best for Facebook”. I never gave a shit about Facebook’s mission; I cared about Parse’s. I remain incredibly proud of everything we accomplished for Parse, MongoDB, and RocksDB. If only our strategic alignments had, well, aligned.

2015 in recaps and life changes

December 29, 2015December 30, 2015 mipsytipsy1 Comment

2015 is dead, long live 2016

A pretty serious list of ridiculous/amazing shit happened in 2015. My face was on a fucking pillar. I got to meet KEITH BOSTIC. A bunch of my favorite engineers got dressed up in cheetah kigurumi pajamas and ran around playing booth babes. And that’s just a three day sampling of mischief.

IMG_4213 (1) — THIS IS MY FACE ON A PILLAR. you are allowed to find this incredibly creepy.

There are at least three key things about 2015 that I feel like I am always going to associate with this year, so that’s what I want to write about. (I’ll do a boring second post later with links to all the talks, articles etc for the year.)

The three things that stood out for me this year were:

Conference Overload
MongoDB + RocksDB (aka “mongo grows up”)
Leaving Parse.

The Year of Conference Insanity

In 2015 I gave talks at 21 conferences, which was … excessive.

This wasn’t really intentional. But I was bored and restless and unhappy a lot and had many erratically available spare cycles. Being a manager at a big company makes this depressing Swiss cheese out of your calendar. You can’t really carve out the heads-down time you would need in order to do challenging technical work, but it’s easy to go on lots of short trips to conferences. And travel is nice. So I just kind of said “yes” to everything.

I really appreciate all the brilliant people and organizers who welcomed me to their events this year, and all the many random, unexpectedly lovely connections I made there.

Next year will be very different.

IMG_4135 (1) — that one day i packed lighter fluid in to work to burn things because reasons

MongoDB + RocksDB

When I started running MongoDB in 2012, we were on version 2.0 and it had a single lock. Yep, just the one! This was a SUPER fun way to try and build a sophisticated multi-tenant platform.

Mongo was rolling out performance improvements with every release. But shit actually got real when they put the storage engine API on the roadmap and acquired WiredTiger. The Facebook RocksDB team enthusiastically pitched in too, providing feedback on the API design and ultimately delivering a full-fledged implementation of MongoDB with RocksDB storage engine.

the RocksDB “marketing” team, with bonus Peter Zaitsev

Parse spent the past 1.5 years being the alpha customer for MongoRocks. We developed a load testing framework for replaying production workloads, evaluated TokuMX and WiredTiger as well as Rocks, worked with the RocksDB team (aka Igor Canadi) to iron the kinks out of the MongoRocks implementation and figure out how to run and monitor the damn thing, and then got it rolled it out to 100% of our production replica sets in under a year. (Worth it? with Rocks we used 1/10th the storage space and writes were 50-200x faster compared to mmap.)

Asya, Domas and Mark Callaghan at the inaugural MongoDB Storage Engine Summit

I’ve said this before but I really believe it: 2015 is the year that MongoDB grew up and became a “real database”. This was Mongo’s InnoDB moment. Remember how much shit everyone used to give MySQL back in the MyISAM days? Hey guess what, software actually can get better! This is one reason it has been a special delight getting to work with Mark Callaghan. Mark was one of those engineers who turned MySQL from a punch line into a “real database” which powers much of the internet. So it was kind of like having this guy around.

I loved having a front row seat to mongo’s awkward adolescent phase. I loved getting to play a small role in promoting storage engine diversity. With Percona now offering enterprise support for both MyRocks and MongoRocks, I think the project is in good hands, and I’m really really fucking proud of this.

Leaving Parse

I said goodbye to Parse in early September. This was one of those emotionally overwhelming life changes that somehow manages to feel both too soon and way, way overdue. I spent 3.5 years working on Parse — one year pre-acquisition, 2.5 years at Facebook.

IMG_4631 (1) — my going-away cake. yes, it is a unicorn shitting rainbows with a purple tutu and sparkly silver slippers. thank you nancy!!

I like to think of myself as this crusty backend engineer who sighs wearily and doesn’t give a shit about product, but the fact is I fell in love with Parse on the very first day. I am so, so proud of what we built. We built a thing that people really cared about. We empowered so many developers to build and create and make whole new businesses on our platform.

There is nothing as heady as getting to pour yourself into a product that people are passionate about, tackle problems that are hard and might actually be impossible, with a team that blows your socks off with their brilliance and joy and hilariousness on a daily basis.

the amazing parse production engineering team. my boys 4 lyfe

So yeah, it was hard to let go. But it was time. The 3-4 hour commute was fucking killing me, and it had been a long, *long* time since I felt like I was learning new things or pushing my boundaries as an engineer or a leader. I need a certain amount of chaos and panic in my life. Being comfortable for too long just makes me miserable.

Parse is going to be a hard act to follow. Thanks for raising the bar, motherfuckers.

In summary

2015 was ridiculous and awesome, and I want 2016 to be ridiculous and awesome but in some radical and fantastically new ways. I have many hopes and intentions for 2016. but I’ll save that for another time.

Oh but my single greatest achievement of 2015? is *clearly* this tattoo.

12 months, 11 appts, about 30 hours of inking. thank you micah riot.

Hello world

December 27, 2015April 18, 2025 mipsytipsy2 Comments

“Start a blog” has been on my todo list for so long, it’s going to leave a gaping void in my life when I check it off.

There are some things I have been wanting to talk about for a while now, in a way that’s less ephemeral than twitter. Like how the tech industry is terrible at interviewing and hiring, and some ideas for making it slightly less terrible. Or the appalling proliferation of new ways to bootstrap your infrastructure over the past couple years. Tips for how to survive when your startup gets acquired. Stuff like that.

The first thing on the list tho will be a wrap-up of all the talks and articles and shit that I did in 2015, because Caitie did it and it was amazing and inspiring. <3

This post is basically for me to try and figure out how to use wordpress themes. But here, just so that it isn’t 100% devoid of content let me state for the record where “@mipsytipsy” comes from. When I was in school we had an assembly language class where we had to use SPIM to emulate the MIPS R3000 processor. The class was wretched, but “spimmy” became my irc nick. During my Everquest years I played an enchanter named Mipsy Tipsy, and now I’m just too lazy to come up with anything new.

A surprising number of people still call me “Spimmy” in real life. Heh.

photo credit: i don’t remember whose hands these were. maybe @mosheroperandi?