Who to involve in a Chaos Day

Once you have decided which service(s) to experiment on, you should identify three groups of people:

  1. Stakeholders are anyone connected with the service who would benefit from the insights created by a Chaos Day. It’s not just the team’s engineers who can gain new insights: designers, product owners and delivery managers all contribute to a system’s design, implementation and operation and therefore also have a stake in its resilience. For example, a designer and a product owner both need to be aware of how failures ripple through a system and ultimately impact end-users.

  2. Responders are whoever normally responds to service incidents. If you’re practising the principle of You Build It, You Run It (YBIYRI - for more on this, see our You Build It, You Run It playbook, by Steve Smith and Bethan Timmins), then this will be the team that owns the service. In other cases, it may be a support or operations team removed from the engineering team. To maximise the realism and learning opportunities of a Chaos Day, include responders as much as possible throughout the process.

  3. Agents of chaos are the engineers who will be designing and running the actual experiments. For a team’s first Chaos Day, these people should be two or more of the most experienced engineers on the team(s). If several teams are involved, then take the most experienced engineer from each team. This has two benefits. Firstly, it uncovers knowledge gaps within the team, because these engineers are non-contactable during the Chaos Day itself. Secondly, the experiments they design tend to reflect real-world failures more realistically, and explore system failure modes less known to other engineers.

Common problems in identifying these groups are:

  1. Not all collaborators are identified (e.g. teams that own downstream systems) and/or involved in the Chaos Day process - Nora Jones describes this as Trap 2 of the 8 traps of Chaos Engineering. Although it’s costly getting everyone involved in the various meetings, people learn new things throughout the process, as different mental models are discussed, experiments proposed, executed and reviewed.

  2. The team struggles to cope without their most experienced engineer. In one Chaos Day we ran, the responding team had to drag their agent of chaos back in, to resolve the failure they had induced. If you think this might happen to your team, then consider making the experient design and execution process open to the team, as described in the Experiment Design and Execution sections.

Last updated