What is Ops Run It

Many organisations have an IT department with segregated Delivery and Operations functions. Delivery teams in the former are responsible for building services, and operations teams in the latter are responsible for running services. We call this Ops Run It. We often find it's entwined with long-established IT management standards such as ITSM and ITIL v3.

Ops Run It has been a de facto operating model for decades, and was codified by the COBIT management framework in 1996. COBIT recommended functionally-oriented teams of specialists, working within separate Plan, Build, and Run phases. That was justified by the unavoidably high compute and transaction costs for on-premise software in the 1990s. That pre-Internet rationale doesn't hold true today.

Ops Run It creates a hard divide between Delivery and Operations, powered by radically different incentives. Delivery teams are short-lived, and project-based. They're told to move fast and achieve their deliverables. Operations teams are long-lived, and told to move carefully and provide reliability.

It's important to remember that every organisation is different, and every Ops Run It implementation is different.

Recognising the problem

A large telco customer I worked with was having trouble with its monolithic systems. They had 100% outsourced delivery and operations teams.

  • They were unable to achieve the deployment throughput they needed to meet demand. The operations team managed overnight releases, with a delivery team available to help solve release issues.

  • They often had reliability problems after large deployments and peak trading events. Incidents were managed by an operations team with 'pass the buck' handovers to delivery teams, and their Mean Time To Repair (MTTR) was six hours. Delivery teams were asked to do L3 unpaid support, and this led to resentment. In addition, the commercial model for incident response meant it wasn't financially viable to pay the outsourced operations team to fix incidents that weren't P1 or P2.

  • They had a strong predisposition towards rush to fix. Many fix actions remained on a problem backlog without ever being prioritised or fixed.

The special character and the million pound outage

Last updated