Testing through the ML pipeline
Last updated
Last updated
An ML pipeline is a complex mixture of software (including complex mathematical procedures), infrastructure, and data storage and we want to be able to rapidly test any changes we make before promoting to production.
We have found the following test types to be valuable:
Contract testing - if the model is deployed as a microservice endpoint, then we should apply standard validations of outputs to inputs.
Unit testing - many key functions such as data transformations, or mathematical functions within the ML model are stateless and can be easily covered by unit-tests.
Infrastructure tests - e.g. Flask/FastAPI models start and shutdown.
‘ML smoke test’ - we have found it useful to test deployed models against a small set of known results. This can flush out a wide range of problems that may occur. We don’t recommend a large number - around five is usually right. For some types of model e.g. regression models the result will change every time the model is trained so the test should check the result is within bounds rather than a precise result.
In addition to the tests above, which are typical for any complex piece of software, the performance of the model itself is critical to any machine learning solution. Model performance testing is undertaken by data scientists on an ad-hoc basis throughout the initial prototyping phase. Before a new model is released you should validate that the new model performs at least as well as the existing one. Test the new model against a known data set and performance compared to a specified threshold or against previous versions.
We don’t usually do load testing on our models as part of the CI/CD process. In a modern architecture load is typically handled by auto-scaling so we usually monitor and alert rather than test. In some use cases, such as in retail where there are days of peak demand (e.g. Black Friday), load testing takes place as part of the overall system testing.
Data engineer