devoops DevOops 2019 SPb (29.10.2019)

The curse of infrastructure team

img

Infrastructure is hard: if you make it too hard to use, nobody will use it; if you make it too easy to use—everyone will abuse it. How do we find a balance? You will learn this during this session.

You’re in luck: your job is a dream job. Your job is to make infrastructure tools for other developers. No business bullshit, no coding of online order forms for web marketplaces. Every day you masterfully apply Helm charts to your Kubernetes cluster and send distributed traces to your fault-tolerant Jaeger instance. Your user base consists of skilled and professional engineers who read documentation and optimally use cloud resources. Sounds like a dream?

In this talk, Alexey will tell a chilling story of the Curse of Infrastructure Team. A heartbreaking story of one line of code that generates 100s of gigabytes of log records per day. A scary story of engineers who create tens of temporary VMs and forget to delete them afterward. A terrifying story of millions of unique metrics that nobody ever uses.

A team that creates infrastructure for other teams is doomed. It is squeezed between two mutually exclusive goals: to make their tools as easy to use as possible and to avoid devastating results of inefficient usage of resources that engineers got their hands on so easily. Infrastructure teams learn to create services that bring teams more value than a nuisance: services that have convenient APIs, reliable on-call engineers and so on. But when a service finally reaches this level of quality, users overrun it, selfishly draining common resources and causing infrastructure team immense pain.

We can cope with this. And Alexey has some ideas, how. But is there a happy ending to the story of the Curse of Infrastructure Team?