heisenbug Heisenbug 2019 СПб (17.05.2019)

A systematic approach to building reliable distributed systems


We’ll being looking at how applying TLA+ and random testing can catch hard-to-find bugs in our designs and implementations of distributed systems. People will see the utility of these techniques and where to start learning about them so they can apply the techniques themselves.

Building distributed systems is hard. Even mature projects can continue to have design defects and implementation bugs.

Design defects are the worst as the cost of repair is higher due to the amount of code rework and testing required. We’ll take a look at how we can rigorously check designs with TLA+ in order to catch defects before coding begins.

Implementation bugs can be less costly to fix, but extremely costly to end users and the project’s reputation. We’ll look at finding implementation bugs with different types of testing such as rigorous randomized testing with failure injection. How we can automate this testing process and log the history of events that lead to failure.

The target audience is anyone involved in distributed systems development and testing.

The takeaway is that building reliable distributed systems requires a systematic approach but that all the tools we need are available to us today.