Learning from a Chaos Day

Insights that help improve resilience are generated at every step of a Chaos Day, from sketching the system architecture at the beginning to responding to simulated failures on the day itself.

As with real production incidents, further lessons can be extracted by holding post-mortem style retrospectives after the event, to analyse what happened. Some great resources on running incident reviews include:

If several teams were involved in the Chaos Day, ask each team to run their own Chaos retrospective, then feed their top insights into a team-of-teams retrospective.

Chaos retrospective format

Supporting a production service is a team sport, so ideally involve the whole team in the Chaos retrospective, not just the Agents of Chaos and responders. Split the time into sections, or even separate meetings, focusing on:

Sharing the knowledge

Chaos Days surface a lot of knowledge that can help teams improve service resilience. To get further benefit, share this knowledge across the organisation, making it easy for other teams to find and consume. Consider writing publically about what you’ve done and learnt - chaos engineering is still in its infancy in the software engineering discipline and making experience reports public will help this important practice mature.

Last updated