What experiments to run on a Chaos Day

Context

It can be tempting to launch into chaos engineering with the desire to break things in diverse and spectacular ways, and see what happens.

This approach is definitely chaotic and may generate new insights, but is not the desired type of chaos and is likely to be a poor return on investment. To avoid this pitfall, remind yourself and the team why you’re doing chaos engineering: to improve system resilience through learning how the whole system (product, process, people) responds to injected failures.

Improved resilience comes through learning, and learning comes in many forms. For example, brainstorming experiments will help participants learn about the focal product’s architecture and characteristics, and this learning may get drawn upon in the next production incident, leading to reduced time to recover (measured over time, the Mean Time To Recover, MTTR, is one of the four key indicators of software delivery performance). Experiments should therefore be identified, selected and designed to optimise for learning potential, instead of maximising inflicted damage, the number of remediation items identified, or how time to recover. Experiment themes Experiments take many forms and provide different types of lessons. Depending on the team’s context, it can be useful to group the experiments around a particular theme, or just aim for diversity. In running Chaos Days at various organisations we’ve seen the following themes emerge:

Last updated