Experiment brainstorm
Last updated
Last updated
Unless you’re running a mini Chaos Event, schedule a planning session at least two weeks before a Chaos Day. This allows time for the selected experiments to be designed, implemented and tested ahead of Chaos Day.
Bring the participants together and run a brainstorming session, with the above context in mind. Ask the team to sketch out the system architecture and steady-state behaviour, such as typical traffic and expected error levels. A boxes and arrows sketch is fine; you just need a high enough level that’s easy to visualise and reason about, but low enough to identify components that might fail. If the number of participants is small and time allows, have each person draw their own sketch, then compare them one by one, starting with the sketch done by the person least familiar with the system.
This activity can improve the resilience of the system because it corrects misconceptions within the team and gives each member deeper architectural and behavioural understanding, which they can draw on during the next production incident.
Before brainstorming about failures, decide what the scope of experiments should be, considering:
Which parts of the system do you not have control over? Failures will be out of scope if their remediations are outside your team’s control (e.g. an internal failure within a cloud provider, such as an AWS network outage).
Which components are off-limits for experimentation? If another team has critical work in progress that requires component X to be highly available on Chaos Day, it might be best to preserve team harmony by declaring that component to be off-limits.
What is the desired blast radius? Failures in one component or whole service will often impact connected components and downstream services. Decide if there should be a constraint on how far failures are allowed to ripple.
Is there a desired focus area or theme to orient experiments around? Review the experiment themes and decide if Chaos Day will be a smorgasbord or a set menu of experiment types.
With a common architecture diagram and scope agreed upon and visible to everyone, the next step is to brainstorm experiments. Ask each participant to spend 5-10 minutes brainstorming 3-8 failures they’d like to learn more about, writing each failure on a sticky note (or virtual equivalent) and placing it on the architecture diagram near the target component/connection. For each failure, they should consider the failure event (what, where, how), expected impact and anticipated response.
At the end of the brainstorm, group similar ideas and discuss groups of particular interest (use dot-voting if necessary). Firm facilitation is required here, as these discussions can easily get lost in detail. Keep the group focused on reaching the point where they can agree on the top 4-8 experiments that optimise for learning within the agreed scope and theme.
The prioritisation process should be a discussion where the following attributes are identified and compared for each experiment:
Having shortlisted 4-8 experiments, agree on an owner for each experiment and set the expectation that, before the Chaos Day, the owners will have fleshed out their experiments to the point they are confident in executing them. The next section outlines a template we’ve found useful in guiding experiment design.
Impact - if the failure really happened how significant would the business and/or reputational damage be? How would this change over the course of the failure, until normal service is resumed?
Likelihood - how likely is the failure? Has it happened before? What conditions would have to be present for it to occur and how likely are those conditions? Don’t spend time on failures whose conditions represent a much wider impact, such as national or planet-scale outages, but equally, don’t rule out failures that seem unlikely. At one client we used a highly available, cloud database service, fully managed by a leading public cloud provider. We had absolute confidence it would never fail, until one day a regular infrastructure-as-code deployment deleted the database and its backups (ironically due to cloud configuration changes designed to increase resilience). Question all assumptions!
Expected response (MTTR) - how do you expect the system and team around it to respond when the failure occurs? Will a circuit breaker kick in and requests be later retried? Will alerts fire? Will the team quickly notice and know what to do?
Mental model - how well understood is the failure and the impacted architectural parts? Complex systems that are hard to reason about are also hard to support, especially when something goes wrong.
Organisational benefits, such as cross-team relationships - resilient systems are built and operated by teams. Failures that span multiple teams require a cross-team response. How well do the impacted teams know each other? Do they have established points of contact and communication protocols?