Chaos Day Playbook
Equal ExpertsContact UsPlayBooks
  • Creating Chaos
  • What & Why
  • 5-minute guide
  • Complementary approaches
    • Running a mini chaos event
  • Ready for chaos?
  • How
    • Timeline
    • Who to involve in a Chaos Day
    • What experiments to run on a Chaos Day
      • Experiment brainstorm
      • Experiment design and preparation
    • When to run a Chaos Day
    • Where to run a Chaos Day
    • How a Chaos Day unfolds
    • Learning from a Chaos Day
  • What next?
  • Licence
  • Contributing
    • Contributors
    • How to contribute
Powered by GitBook
On this page

Was this helpful?

Complementary approaches

Previous5-minute guideNextRunning a mini chaos event

Last updated 11 months ago

Was this helpful?

Chaos Days are one of many tools for improving system resilience. Others include:

  1. . AWS runs these days to teach design and diagnosis techniques for improving resilience using an AWS-based fake production service. They are intense and great fun but don’t teach you anything about your own system.

  2. Per feature chaos testing. When a team builds a new feature, they run manual or automated experiments to explore the feature’s impact on system resilience as part of its testing. This can be a good way to introduce chaos-engineering principles, as well as help teams operability thinking (i.e., consider it earlier in the engineering process, instead of when the first product issue hits).

  3. . These exercises help to identify vulnerabilities and weaknesses in a product by simulating the behaviours and techniques of malicious attackers in the most realistic way possible.

  4. Automated failure injection. Tools such as , and can be used to inject regular but random failures to test the system response on an ongoing basis.

  5. Production incidents. Treat production incidents as learning opportunities, or in the , “Incidents are unplanned investments”. If managed well (see and ), then valuable, firsthand insights can be gained due to everything about the chaos being real! Live issues can be costly to the business. Therefore, it is beneficial to extract as much business value from them as possible, which can be achieved through a better understanding of the system and possible resilience improvements.

  6. Running a mini chaos event, as described next.

AWS Game Days
shift-left
AWS's Fault Injection Simulator
Gremlin
Netflix’s Chaos Monkey
words of John Allspaw
Google’s SRE book
Etsy’s debriefing guide
Page cover image
Purple team security exercises