A machine learning solution is fundamentally dependent on the data used to train it. To maintain and operate an ML solution, the data used to develop the model/algorithm must be available to the maintainers. They will need the data to monitor performance, validate continued performance and find improvements. Furthermore, in many cases the algorithm is modelling an external world that is undergoing change, and they will want to update or retrain the model to reflect these changes, so will need data updates.
The data needs to be accessible by data science teams and it will also need to be made available to automated processes that have been set-up for retraining the model.
In most applications of ML, ground-truth data will need to be captured alongside the input data and it is essential to capture these data points as well.
It is common to create data warehouses, data lakes or data lakehouses and associated data pipelines to store this data. Our data covers our approach to providing this data.
The below diagram shows the two processes involved in building machine learning systems and the data they need to access:
An evaluation process that makes predictions (model scoring). This may be real-time.
A batch process that retrains the model, based on fresh historical data.