MLOps
Equal ExpertsContact UsPlaybooks
  • Overview
    • Key terms
  • What is MLOps
  • Principles
    • Solid data foundations
    • Provide an environment that allows data scientists to create and test models
    • A machine learning service is a product
    • Apply continuous delivery
    • Evaluate and monitor algorithms throughout their lifecycle
    • MLOps is a team effort
  • Practices
    • Collect performance data
    • Ways of deploying your model
    • How often do you deploy a model?
    • Keep a versioned model repository
    • Measure and proactively evaluate quality of training data
    • Testing through the ML pipeline
    • Business impact is more than just accuracy - understand your baseline
    • Regularly monitor your model in production
    • Monitor data quality
    • Automate the model lifecycle
    • Create a walking skeleton/steel thread
    • Appropriately optimise models for inference
  • Explore
  • Pitfalls (Avoid)
    • User Trust and Engagement
    • Explainability
    • Avoid notebooks in production
    • Poor security practices
    • Don’t treat accuracy as the only or even the best way to evaluate your algorithm
    • Use machine learning judiciously
    • Don’t forget to understand the at-inference usage profile
    • Don’t make it difficult for a data scientists to access data or use the tools they need
    • Not taking into consideration the downstream application of the model
  • Contributors
Powered by GitBook
On this page
Export as PDF
  1. Principles

Solid data foundations

PreviousPrinciplesNextProvide an environment that allows data scientists to create and test models

Last updated 11 months ago

Have a store of good quality, ground-truth historical data that is accessible by your data scientists

A machine learning solution is fundamentally dependent on the data used to train it. To maintain and operate an ML solution, the data used to develop the model/algorithm must be available to the maintainers. They will need the data to monitor performance, validate continued performance and find improvements. Furthermore, in many cases the algorithm is modelling an external world that is undergoing change, and they will want to update or retrain the model to reflect these changes, so will need data updates.

The data needs to be accessible by data science teams and it will also need to be made available to automated processes that have been set-up for retraining the model.

In most applications of ML, ground-truth data will need to be captured alongside the input data and it is essential to capture these data points as well.

It is common to create data warehouses, data lakes or data lakehouses and associated data pipelines to store this data. Our data covers our approach to providing this data.

The below diagram shows the two processes involved in building machine learning systems and the data they need to access:

  • An evaluation process that makes predictions (model scoring). This may be real-time.

  • A batch process that retrains the model, based on fresh historical data.

pipelines playbook