LogoLogo
Equal ExpertsContact UsPlaybooks
  • Overview
  • Authors
  • Introduction
  • What is Ops Run It
    • Deployment throughput in Ops Run It
    • Service reliability in Ops Run It
    • Learning culture in Ops Run It
    • Benefits of Ops Run It
    • Drawbacks of Ops Run It
  • What is You Build It You Run It
    • Deployment throughput in You Build It You Run It
    • Service reliability in You Build It You Run It
    • Learning culture in You Build It You Run It
    • Benefits of You Build It You Run It
    • Drawbacks of You Build It You Run It
  • Principles
  • Practices
    • Selection
    • Governance
    • Build
    • Operational Enablers
    • Incident Response
    • Measurement
    • Scale
  • Pitfalls
  • Resources
  • Contributions
    • Contributors
    • How to Contribute
  • About Equal Experts
Powered by GitBook
On this page
  • Availability protection in You Build It You Run It
  • Availability restoration in You Build It You Run It
  • Service reliability costs in You Build It You Run It

Was this helpful?

  1. What is You Build It You Run It

Service reliability in You Build It You Run It

PreviousDeployment throughput in You Build It You Run ItNextLearning culture in You Build It You Run It

Last updated 11 months ago

Was this helpful?

An on-call product team offers 24/7 production support, and they can modify all aspects of a digital service:

  • Alert definitions

  • Code

  • Configuration

  • Data

  • Deployments

  • Infrastructure definitions

  • Logging

  • Monitoring dashboards

For governance, the product teams are responsible for day-to-day work in availability protection and availability restoration. They track their costs on a team-by-team basis, and make them visible for senior leaders.

Availability protection in You Build It You Run It

A product team also updates their digital services by adding infrastructure capacity, making configuration changes, applying fixes, and updating alerts as necessary. This is prioritised alongside feature development, as it is the product team themselves who are on-call.

Availability restoration in You Build It You Run It

An on-call product team is responsible for availability restoration. This means reactively responding to production alerts in and out of working hours, including evenings, weekends, and bank holidays. All service alerts are expected to be resolved by the on-call product team, and incident response is consistently prioritised over feature development.

Responders observe their real-time service telemetry data, to understand the drift from normal to abnormal operating conditions. They diagnose the incident via their heuristics, innate service knowledge, and telemetry data. They attempt to restore availability via additional infrastructure, code changes, data fixes, configuration changes, rollbacks, and telemetry changes.

For a high priority incident, the entire team may swarm on incident response, to minimise service unavailability. One team member acts as incident commander, to coordinate with other product teams involved in incident response, and to manage communications with senior stakeholders. Alternatively, You Build It You Run It is 100% compatible with ITIL v3, and an incident manager could be invited by the team to act as incident commander for the duration of an incident.

Service reliability costs in You Build It You Run It

Cost Type
Frequency
Description
Impact
TCO %

Setup cost

One-off

Launch costs incurred in

  • License purchases

  • Product team time for telemetry install

  • Product team time for on-call schedule agreement

  • Product team time to setup live access Product team time for any operational training necessary

Capex cost

Medium

Transition cost

One-off

Launch costs incurred in

  • Product team time for runbooks

Capex cost

Low

Run cost

Ongoing

Regular costs incurred in product team time for

  • Deploying code changes

  • Applying data fixes

  • Adding infrastructure capacity

  • Monitoring operating conditions

  • Performing rollbacks

  • Updating telemetry tools

  • Doing on-call standby out of hours

Capex cost

Medium

Incident cost

Per incident

Incident response costs incurred in product team time for

  • Investigating and diagnosing problems

  • Identifying and agreeing on solutions including code changes

  • configuration updates, adding infrastructure capacity

  • On-call callout out of hours

Capex cost

Low to medium

Opportunity cost

Per incident

Can be measured as the cost of delay between incident start and incident finish. Caused by service unavailability, missed opportunities with customers, and delays in further feature development

Revenue loss and costs incurred

Low

You Build It You Run It co-exists in a hybrid operating model with , and it's important to have specialist operational teams as well as generalist, cross-functional product teams. The same operational enablers offer their scarce, deep expertise to on-call product teams for digital services, and to the application support team for foundational systems. For example, a shared DBA team can assist with database provisioning, performance, and operations.

An on-call product team is responsible for availability protection. This means proactively monitoring service telemetry and updating digital services during working hours. A product team observes service health checks, logs, and metrics plumbed into different dashboards, in telemetry tools such as or . They can alter the monitorable events whenever necessary, to improve the information value of their telemetry data.

When an availability target is breached, an on-call product team developer receives an automated alert, via an incident response platform such as or . The responder acknowledges the alert, and a ticket is automatically created in a ticketing system such as ServiceNow, by the incident response platform. The responder classifies and prioritises the incident, and adds more team members to the incident as required.

Ops Run It
AWS CloudWatch
Grafana
PagerDuty
VictorOps