Service reliability in You Build It You Run It
Last updated
Last updated
An on-call product team offers 24/7 production support, and they can modify all aspects of a digital service:
Alert definitions
Code
Configuration
Data
Deployments
Infrastructure definitions
Logging
Monitoring dashboards
You Build It You Run It co-exists in a hybrid operating model with Ops Run It, and it's important to have specialist operational teams as well as generalist, cross-functional product teams. The same operational enablers offer their scarce, deep expertise to on-call product teams for digital services, and to the application support team for foundational systems. For example, a shared DBA team can assist with database provisioning, performance, and operations.
For governance, the product teams are responsible for day-to-day work in availability protection and availability restoration. They track their costs on a team-by-team basis, and make them visible for senior leaders.
An on-call product team is responsible for availability protection. This means proactively monitoring service telemetry and updating digital services during working hours. A product team observes service health checks, logs, and metrics plumbed into different dashboards, in telemetry tools such as AWS CloudWatch or Grafana. They can alter the monitorable events whenever necessary, to improve the information value of their telemetry data.
A product team also updates their digital services by adding infrastructure capacity, making configuration changes, applying fixes, and updating alerts as necessary. This is prioritised alongside feature development, as it is the product team themselves who are on-call.
An on-call product team is responsible for availability restoration. This means reactively responding to production alerts in and out of working hours, including evenings, weekends, and bank holidays. All service alerts are expected to be resolved by the on-call product team, and incident response is consistently prioritised over feature development.
When an availability target is breached, an on-call product team developer receives an automated alert, via an incident response platform such as PagerDuty or VictorOps. The responder acknowledges the alert, and a ticket is automatically created in a ticketing system such as ServiceNow, by the incident response platform. The responder classifies and prioritises the incident, and adds more team members to the incident as required.
Responders observe their real-time service telemetry data, to understand the drift from normal to abnormal operating conditions. They diagnose the incident via their heuristics, innate service knowledge, and telemetry data. They attempt to restore availability via additional infrastructure, code changes, data fixes, configuration changes, rollbacks, and telemetry changes.
For a high priority incident, the entire team may swarm on incident response, to minimise service unavailability. One team member acts as incident commander, to coordinate with other product teams involved in incident response, and to manage communications with senior stakeholders. Alternatively, You Build It You Run It is 100% compatible with ITIL v3, and an incident manager could be invited by the team to act as incident commander for the duration of an incident.
Cost Type | Frequency | Description | Impact | TCO % |
---|---|---|---|---|
Setup cost
One-off
Launch costs incurred in
License purchases
Product team time for telemetry install
Product team time for on-call schedule agreement
Product team time to setup live access Product team time for any operational training necessary
Capex cost
Medium
Transition cost
One-off
Launch costs incurred in
Product team time for runbooks
Capex cost
Low
Run cost
Ongoing
Regular costs incurred in product team time for
Deploying code changes
Applying data fixes
Adding infrastructure capacity
Monitoring operating conditions
Performing rollbacks
Updating telemetry tools
Doing on-call standby out of hours
Capex cost
Medium
Incident cost
Per incident
Incident response costs incurred in product team time for
Investigating and diagnosing problems
Identifying and agreeing on solutions including code changes
configuration updates, adding infrastructure capacity
On-call callout out of hours
Capex cost
Low to medium
Opportunity cost
Per incident
Can be measured as the cost of delay between incident start and incident finish. Caused by service unavailability, missed opportunities with customers, and delays in further feature development
Revenue loss and costs incurred
Low