Outsource Your O11y: Now Roll It Out And Keep Them Happy (part 3/3)

This is part three of a three-part series of guest posts:

  1. How To Be A Champion, on how to choose a third-party vendor and champion them successfully to your security team.  (George Chamales)
  2. Get Aligned With Security, how to work with your security team to find the best possible outcome for all sides (Lilly Ryan)
  3. Now Roll It Out And Keep Them Happy, on how to operationalize your service by rolling out the integration and maintaining it — and the relationship with your security team — over the long run (Andy Isaacson)

All this pain will someday be worth it.  🙏❤️  charity + friends


“Now Roll It Out And Keep Them Happy”

This is the third in a series of blog posts; previously we analyzed the security challenges of using a third party service, and we worked together with the security team to build empathy to deliver the project.  You might want to read those first, since we are going to build on a lot of the ideas there to ship and maintain this integration.

Ready for launch

You’ve convinced the security team and other stakeholders, you’ve gotten the integration running, you’re getting promising results from dev-test or staging environments… now it’s time to move from proof-of-concept to full implementation.  Depending on your situation this might be a transition from staging to production, or it might mean increasing a feature flipper flag from 5% to 100%, or it might mean increasing coverage of an integration from one API endpoint to cover your entire developer footprint.

Taking into account Murphy’s Law, we expect that some things will go wrong during the rollout.  Perhaps during coverage, a developer realizes that the schema designed to handle the app’s event mechanism can’t represent a scenario, requiring a redesign or a hacky solution.  Or perhaps the metrics dashboard shows elevated error rates from the API frontend, and while there’s no smoking gun, the ops oncall decides to rollback the integration Just In Case it’s causing the incident.

This gives us another chance to practice empathy — while it’s easy, wearing the champion hat, to dismiss any issues found by looking for someone to blame, ultimately this poisons trust within your organization and will hamper success.  It’s more effective, in the long run (and often even in the short run), to find common ground with your peers in other disciplines and teams, and work through to solutions that satisfy everybody.

Keeping the lights on

In all likelihood as integration succeeds, the team will rapidly develop experts and expertise, as well as idiomatic ways to use the product.  Let the experts surprise you; folks you might not expect can step up when given a chance.  Expertise flourishes when given guidance and goals; as the team becomes comfortable with the integration, explicitly recognize a leader or point person for each vendor relationship.  Having one person explicitly responsible for a relationship lets them pay attention to those vendor emails, updates, and avoid the tragedy of the “but I thought *you* were” commons.  This Integration Lead is also a center of knowledge transfer for your organization — they won’t know everything or help every user come up to speed, but they can help empower the local power users in each team to ramp up their teams on the integration.

As comfort grows you will start to consider ways to change your usage, for example growing into new kinds of data.  This is a good time to revisit that security checklist — does the change increase PII exposure to your vendor?  Would the new data lead to additional requirements such as per-field encryption?  Don’t let these security concerns block you from gaining valuable insight using the new tool, but do take the chance to talk it over with your security experts as appropriate.

Throughout this organic growth, the Integration Lead remains core to managing your changing profile of usage of the vendor they shepherd; as new categories of data are added to the integration, the Lead has responsibility to ensure that the vendor relationship and risk profile are well matched to the needs that the new usage (and presumably, business value) is placing on the relationship.

Documenting the Intergation Lead role and responsibilities is critical. The team should know when to check in, and writing it down helps it happen.  When new code has a security implication, or a new use case potentially amplifies the cost of an integration, bringing the domain expert in will avoid unhappy surprises.  Knowing how to find out who to bring in, and when to bring them in, will keep your team getting the right eyes on their changes.

Security threats and other challenges change over time, too.  Collaborating with your security team so that they know what systems are in use helps your team take note of new information that is relevant to your business. A simple example is noting when your vendors publish a breach announcement, but more complex examples happen too — your vendor transitions cloud providers from AWS to Azure and the security team gets an alert about unexpected data flows from your production cluster; with transparency and trust such events become part of a routine process rather than an emergency.

It’s all operational

Monitoring and alerting is a fact of operations life, and this has to include vendor integrations (even when the vendor integration is a monitoring product.)  All of your operations best practices are needed here — keep your alerts clean and actionable so that you don’t develop pager fatigue, and monitor performance of the integration so that you don’t get blindsided by a creeping latency monster in your APIs.

Authentication and authorization are changing as the threat landscape evolves and industry moves from SMS verification codes to U2F/WebAuthn.  Does your vendor support your SSO integration?  If they can’t support the same SSO that you use everywhere else and can’t add it — or worse, look confused when you mention SSO — that’s probably a sign you should consider a different vendor.

A beautiful sunset

Have a plan beforehand for what needs to be done should you stop using the service.  Got any mobile apps that depend on APIs that will go away or start returning permission errors?  Be sure to test these scenarios ahead of time.

What happens at contract termination to data stored on the service?  Do you need to explicitly delete data when ceasing use?

Do you need to remove integrations from your systems before ending the commercial relationship, or can the technical shutdown and business shutdown run in parallel?

In all likelihood these are contingency plans that will never be needed, and they don’t need to be fully fleshed out to start, but a little bit of forethought can avoid unpleasant surprises.

Year after year

Industry best practice and common sense dictate that you should revisit the security questionnaire annually (if not more frequently). Use this chance to take stock of the last year and check in — are you getting value from the service?  What has changed in your business needs and the competitive landscape? 

It’s entirely possible that a new year brings new challenges, which could make your current vendor even more valuable (time to negotiate a better contract rate!) or could mean you’d do better with a competing service.  Has the vendor gone through any major changes?  They might have new offerings that suit your needs well, or they may have pivoted away from the features you need. 

Check in with your friends on the security team as well; standards evolve, and last year’s sufficient solution might not be good enough for new requirements.

 

Andy thinks out loud about security, society, and the problems with computers on Twitter.


 

❤️ Thanks so much reading, folks.  Please feel free to drop any complaints, comments, or additional tips to us in the comments, or direct them to me on twitter.

Have fun!  Stay (a little bit) Paranoid!!

— charity

Outsource Your O11y: Now Roll It Out And Keep Them Happy (part 3/3)

Operational Best Practices #serverless

This post is part two of my recap of last week’s terrific Serverless conference.  If you feel like getting bitchy with me about what serverless means or #NoOps or whatever, please refer back to the prequel post, where I talked about operations engineering in the modern world.

*Then* you can get bitchy with me.  (xoxoxxooxo)

The title of my talk was:

Screen Shot 2016-05-30 at 8.43.39 PM

The theme of my talk was basically: what should software engineers know and care about when it comes to operations in a world where we are outsourcing more and more core functionality?

If you care about running a quality service or product, or providing your customers with a reasonable level of service, you have to care about operational concerns like design, resiliency, instrumentation and debuggability.  No matter how many abstractions there are between you and the bare metal.

If you chose a provider, you do not get to just point your finger at them in the post mortem and say it’s their fault.  You chose them, it’s on you.  It’s tacky to blame the software or the service, and besides your customers don’t give a shit whose “fault” it is.

So given an infinite number of things to care about, where do you start?

What is your mission, and what are your differentiators?

The first question must always be: what is your mission?  Your mission is not writing software.  Your mission is delivering whatever it is your customers are paying you for, and you use software to get there.  (Code is kind of a liability so you should write as little of it as necessary.  hey!! sounds like a good argument for #serverless!)

Second: what are your core differentiators?  What are the things that you are doing that are unique, and difficult to replicate, or the things where you have to actually be world class experts in those things?

Those are the things that you will have the hardest time outsourcing, or that you should think about very carefully before outsourcing.

Screen Shot 2016-05-30 at 7.57.06 PM

Facts

You can outsource labor, but you can’t outsource caring.  And nobody but you is in the position to think about your core differentiators and your product in a holistic way.

If you’re a typical early startup, you’re probably using somewhere between 5 and 20 SaaS products to get rid of some of the crap work and offload it to dedicated teams who can do it better than you can, much more cheaply, so you are freed up to work on your core value proposition.

GOOD.

But you still have to think about things like reliability, your security model, your persistent storage models, your query performance, how all these lovely services talk to each other, how you’re going to debug them, how you’re going to repro when things go wrong, etc.  You still own these things, even if you don’t run them.

For example, take AWS Lambda.  It’s a pretty great service on many dimensions.  It’s an early version of the future.  It is also INCREDIBLY irritating and challenging to debug in a practically infinite number of insanity-inducing ways.

** Important side note — I’m talking about actual production systems.  Parse, Heroku, Lambda, etc are GREAT for prototyping and can take you a long, long way.  Early stage startups SHOULD optimize for agility and rapid developer iteration, not reliability.  Thx to @joeemison for reminding me that i left that out of the recap.

Screen Shot 2016-05-30 at 8.03.01 PM

Focus on the critical path

Your users don’t care if your internal jenkins builds are broken.  They don’t care about a whole lot of things that you have to care about … eventually.  They do care a lot if your product isn’t actually functional.  Which means you have to think through the behavioral and failure characteristics of the providers you’re relying on in any user visible fashion.

Ask lots of questions if you can.  (AWS often won’t tell you much, but smaller providers will.)  Find out as much as you can about their cotenancy model (shared hardware or isolation?), their typical performance variance (run your own tests, don’t trust their claims), and the underlying storage systems.

Think about how you can bake in resiliency from the user’s perspective, that doesn’t rely on provider guarantees.  If you’re on mobile, can you provide a reasonable offline experience?  Like Parse did a lot of magic here in the APIs, where it would back off and retry saves if there were any errors.

Can you fail over to another provider if one is down?  Is it even worth it at your company’s stage of maturity and engineering resources to invest in this?

How willing are you to be locked into a vendor or provider, and what is the story if you find yourself forced to switch?  Or if that service goes away, as so many, many, many of them have done and will do.  (RIP, parse.com.)

Screen Shot 2016-05-30 at 8.11.10 PM

Tradeoffs

Listen, outsourcing is awesome.  I do it as much as I can.  I’m literally helping build a service that provides outsourced metrics, I believe in this version of the future!  It’s basically the latest iteration of capitalism in a nutshell: increased complexity –> increased specialization –> you pay other people to do the job better than you –> everybody wins.

But there are tradeoffs, so let’s be real.

The service, if it is smart, will put strong constraints on how you are able to use it, so they are more likely to deliver on their reliability goals.  When users have flexibility and options it creates chaos and unreliability.  If the platform has to choose between your happiness vs thousands of other customers’ happiness, they will choose the many over the one every time — as they should.

Limits may mysteriously change or be invented as they are discovered, esp with fledgling services.  You may be desperate for a particular feature, but you can’t build it.  (This is why I went for Kafka over Kinesis.)

You need to think way more carefully and more deeply about visibility and introspection up front than you would if you were running your own services, because you have no ability to log in and use strace or gdb or tail a logfile or run any system profiling commands when things go dark.

In the best case, you’re giving up some control and quality in exchange for experts doing the work better than you could for cheaper (e.g. i’m never running a fucking physical data center again, jesus.  EC24lyfe).  In a common worse case, it’s less reliable than what you would build AND it’s also opaque AND you can’t tell if it’s down for you or for everyone because frankly it’s just massively harder to build a service that works for thousands/millions of use cases than for any one of them individually.

Screen Shot 2016-05-30 at 8.21.36 PM.png

Stateful services

Ohhhh and let’s just briefly talk about state.

The serverless utopia mostly ignores the problems of stateful services.  If pressed they will usually say DynamoDB, or Firebase, or RDS or Aurora or something.

This is a big, huge, deep, wide lake of crap to wade in to so all I’m going to say is that there is no such thing as having the luxury of not having to understand how your storage systems work.  Queries will get slow, and you’ll need to be able to figure out why and fix them.  You’ll hit scaling cliffs where suddenly a perfectly-usable app just starts timing everything out because of that extra second of latency coming from …

¯\_(ツ)_/¯

The hardware underlying your instance will degrade (there’s a server somewhere under all those abstractions, don’t forget).  The provider will have mysterious failures.  They will be better than you, probably, but less inclined to give you satisfactory progress updates because there are hundreds or thousands or millions of you all clamoring.

The more you understand about your storage system (and the more you stay in the lane of how it was intended to be used), the happier you’ll be.

Screen Shot 2016-05-30 at 8.28.29 PM

In conclusion

These trends are both inevitable and, for the most part, very good news for everyone.

Operations engineering is becoming a more fascinating and specialized skill set.  The best engineers are flocking to solve category problems — instead of building the same system at company after company, they are building SaaS solutions to solve it for the internet at large.  Just look at the massive explosion in operational software offerings over the past 5-6 years.

This means that the era of the in-house dedicated ops team, which serves as an absorbent buffer for all the pain of software development, is mostly on its way out the door.  (And good riddance.)

People are waking up to the fact that software quality improves when feedback loops are tighter for software engineers, which means being on call and owning services end to end.  The center of gravity is shifting towards engineering teams owning the services they built.

This is awesome!  You get to rent engineers from Google, AWS, Pagerduty, Pingdom, Heroku, etc for much cheaper than if you hired them in-house — if you could even get them, which you probably can’t because talent is scarce.

Screen Shot 2016-05-30 at 8.39.42 PM

But the flip side of this is that application engineers need to get better at thinking in traditionally operations-oriented ways about reliability, architecture, instrumentation, visibility, security, and storage.  Figure out what your core differentiators are, and own the shit out of those.

Nobody but you can care about your mission as much as you can.  Own it, do it.  Have fun.

 

Operational Best Practices #serverless