heisenbug Heisenbug 2019 Мск (05.12.2019 — 06.12.2019)

CrashMonkey & Ace: Systematically Testing File-System Crash Consistency


The talk discusses the importance of crash-consistency in storage systems and how we build an efficient infrastructure to find crash-consistency bugs.

Problem description: Crash-consistency is the ability of a storage system to recover to a correct state after a crash due to power loss or a kernel panic. What we mean by a correct state is to check if the internal metadata structures are consistent, and that the files and directories persisted before the crash are not lost or contain corrupt data. The potential consequences of crash-consistency bugs could be disastrous, leading to loss of user data or resulting in unusable system state after recovery. However, there is little to no crash-consistency testing today for widely-used Linux file systems such as ext4, xfs, btrfs, and F2FS. Linux file-system developers use xfstests, an ad-hoc collection of correctness tests, to perform regression testing. xfstests contains a total of around 500 correctness tests that are applicable to all POSIX file systems. Of these tests, only about 30 (5%) are crash-consistency tests. Thus, file-system developers have no easy way of systematically testing the crash consistency of their file systems.

Solution: We present a new approach to testing file-system crash consistency: bounded black-box crash testing (B3). B3 tests the file system in a black-box manner using workloads of file-system operations. Since the space of possible workloads is infinite, B3 bounds this space based on parameters such as the number of file-system operations or which operations to include, and exhaustively generates workloads within this bounded space. Each workload is tested on the target file system by simulating power-loss crashes while the workload is being executed and checking automatically if the file system recovers to a correct state after each crash. B3 builds upon insights derived from our study of crash-consistency bugs reported in Linux file systems in the last five years. We observed that most reported bugs can be reproduced using small workloads of three or fewer file-system operations on a newly-created file system, and that all reported bugs result from crashes after fsync() related system calls. We built the tool CrashMonkey to demonstrate the effectiveness of this approach. CrashMonkey revealed 10 new crash-consistency bugs in widely-used, mature Linux file systems, seven of which existed in the kernel since 2014. It also revealed a data loss bug in a verified file system, FSCQ. The new bugs result in severe consequences like broken rename atomicity and loss of persisted files.

Комментарий программного комитета:

В файловых системах много что может пойти не так, и протестировать их непросто. В докладе разбирается подход к такой задаче.