Chaos Day Playbook
Equal ExpertsContact UsPlayBooks
  • Creating Chaos
  • What & Why
  • 5-minute guide
  • Complementary approaches
    • Running a mini chaos event
  • Ready for chaos?
  • How
    • Timeline
    • Who to involve in a Chaos Day
    • What experiments to run on a Chaos Day
      • Experiment brainstorm
      • Experiment design and preparation
    • When to run a Chaos Day
    • Where to run a Chaos Day
    • How a Chaos Day unfolds
    • Learning from a Chaos Day
  • What next?
  • Licence
  • Contributing
    • Contributors
    • How to contribute
Powered by GitBook
On this page

Was this helpful?

  1. How
  2. What experiments to run on a Chaos Day

Experiment design and preparation

PreviousExperiment brainstormNextWhen to run a Chaos Day

Last updated 2 years ago

Was this helpful?

We often use as a collaboration tool for Chaos Days, with each card representing a single experiment. Our template card uses the format:

Heading
Example

Experiment title: <Component> <Failure> <Expected result>

Third-party service Y being unavailable leads to our service X queuing and retrying requests

What failure are you invoking?

Response time for requests from service X to Y starts exceeding 10 seconds. Simulate by stopping service Y.

Expected impact

This should have no visible client impact for N minutes, during which service X will queue and retry requests, whilst giving clients an OK response. After N minutes the response will change to Service not available.

Expected response

Queue and retry kicks in, then alerts fire if 10 Service not available responses occur within 60 seconds. The team’s support person will respond to the alert within 15 minutes and use the runbook to investigate connectivity to service Y.

Rollback steps

Restart service Y.

Other experiments to run in parallel

Intermittent failures in the Queuing platform.

Having designed the experiment, the next step is to determine its execution. This requires consideration of:

  1. What load does the system need to be under to trigger the failure and make it observable?

  2. For the host environment, how will the load occur? For example, if the experiment is running in production, will steady-state production load be sufficient, or does additional synthetic load need injecting? If a pre-production environment is used, how would a production load be simulated?

  3. When the failure occurs in the given environment, will configured alerts automatically fire at the error level generated by the induced load, or will the alert threshold need temporarily adjusting?

How will the system be modified to simulate or cause the failure condition? Will the engineer need to create a script or program to cause the condition? Can this be programmed so the failure can be repeatedly run? Consider if any or chaos engineering tools might help.

will the experiment be run in?

When designing experiments, be mindful of the (see also p224); the means of injecting the failure condition might alter the system in such a way to cause alternative or unrealistic chaos, thus distorting the experiment.

third party
cloud-provided
Which environment
probe effect
Chaos Engineering
Trello
Template trello card for a single experiment