Chaos Day Playbook
Equal ExpertsContact UsPlayBooks
  • Creating Chaos
  • What & Why
  • 5-minute guide
  • Complementary approaches
    • Running a mini chaos event
  • Ready for chaos?
  • How
    • Timeline
    • Who to involve in a Chaos Day
    • What experiments to run on a Chaos Day
      • Experiment brainstorm
      • Experiment design and preparation
    • When to run a Chaos Day
    • Where to run a Chaos Day
    • How a Chaos Day unfolds
    • Learning from a Chaos Day
  • What next?
  • Licence
  • Contributing
    • Contributors
    • How to contribute
Powered by GitBook
On this page
  • Chaos retrospective format
  • Sharing the knowledge

Was this helpful?

  1. How

Learning from a Chaos Day

PreviousHow a Chaos Day unfoldsNextWhat next?

Last updated 2 years ago

Was this helpful?

Insights that help improve resilience are generated at every step of a Chaos Day, from sketching the system architecture at the beginning to responding to simulated failures on the day itself.

As with real production incidents, further lessons can be extracted by holding post-mortem style retrospectives after the event, to analyse what happened. Some great resources on running incident reviews include:

  • Jeli.io's

If several teams were involved in the Chaos Day, ask each team to run their own Chaos retrospective, then feed their top insights into a team-of-teams retrospective.

Chaos retrospective format

Supporting a production service is a team sport, so ideally involve the whole team in the Chaos retrospective, not just the Agents of Chaos and responders. Split the time into sections, or even separate meetings, focusing on:

What was learnt from the experiments? Use your standard incident review process, remembering to focus on surfacing new knowledge about system behaviour rather than generating a long list of improvement tasks. If ideas for improvement do come up, note them and assign an owner to review them later (see item 3). Don’t get drawn into solutionising, because that needs a separate thinking space, and will take valuable time away from documenting what was learnt.

What would improve the next Chaos Day? Spend the last 10 minutes of the meeting on this, noting down for future reference what changes would make the next Chaos Day even more successful.

What improvement considerations/ideas should progress onto a team’s backlog?

Sharing the knowledge

Chaos Days surface a lot of knowledge that can help teams improve service resilience. To get further benefit, share this knowledge across the organisation, making it easy for other teams to find and consume. Consider writing publically about what you’ve done and learnt - chaos engineering is still in its infancy in the software engineering discipline and making experience reports public will help this important practice mature.

Etsy’s facilitation guide
Google’s SRE handbook
incident review guide