Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
A machine learning solution is fundamentally dependent on the data used to train it. To maintain and operate an ML solution, the data used to develop the model/algorithm must be available to the maintainers. They will need the data to monitor performance, validate continued performance and find improvements. Furthermore, in many cases the algorithm is modelling an external world that is undergoing change, and they will want to update or retrain the model to reflect these changes, so will need data updates.
The data needs to be accessible by data science teams and it will also need to be made available to automated processes that have been set-up for retraining the model.
In most applications of ML, ground-truth data will need to be captured alongside the input data and it is essential to capture these data points as well.
The below diagram shows the two processes involved in building machine learning systems and the data they need to access:
An evaluation process that makes predictions (model scoring). This may be real-time.
A batch process that retrains the model, based on fresh historical data.
It is common to create data warehouses, data lakes or data lakehouses and associated data pipelines to store this data. Our data covers our approach to providing this data.
We want to be able to amend how our machine learning models consume data and integrate with other business systems in an agile fashion as the data environment, downstream IT services and needs of the business change. Just like any piece of working software, continuous delivery practices should be adopted in machine learning to enable regular updates of those integrations in production. Teams should adopt typical continuous delivery techniques, use Continuous Integration and Deployment (CI/CD) approaches; utilise Infrastructure as Code (Terraform, ansible, packer, etc.) and work in small batches to have fast and reasonable feedback, which is key to keeping a continuous improvement mindset.
ML solutions are different from standard software delivery because we want to know that the algorithm is performing as expected, as well as all the things we monitor to ensure the software is working correctly. In machine learning, performance is inherently tied to the accuracy of the model. Which measure of accuracy is the right one is a non-trivial question - which we won’t go into here except to say that usually the Data Scientists define an appropriate performance measure.
This performance of the algorithm should be evaluated throughout its lifecycle:
During the development of the model - it is an inherent part of initial algorithm development to measure how well different approaches work, as well as settling on the right way to measure the performance.
At initial release - when the model has reached an acceptable level of performance, this should be recorded as a baseline and it can be released into production.
In production - the algorithm performance should be monitored throughout the lifetime to detect if it has started performing badly as a result of data drift or concept drift.
We believe that an ML service should be developed and treated as a product, meaning that you should apply the same behaviours and standards as you would when developing any other software product. These behaviours include:
Identify, profile and maintain an active relationship with the end-users of your ML service. Work with your users to identify requirements that feed into your development backlog, involve your users in validation of features and improvements, notify them of updates and outages, and in general, work to keep your users happy.
Maintain a roadmap of features and improvements. Continue to improve your service throughout its lifetime.
Provide good user documentation.
Actively test your service.
Capture the iterations of your service as versions and help users migrate to newer versions. Clearly define how long you will support versions of your service, and whether you will run old and new versions concurrently.
Understand how you will retire your service, or support users if you choose not to actively maintain it any longer.
Have an operability strategy for your service. Build in telemetry that is exposed through monitoring and alerting tools, so you know when things go wrong. Use this data to gain an understanding of how your users actually use your service.
Define who is supporting your service and provide runbooks that help support recovery from outages.
Provide a mechanism for users to submit bugs and unexpected results, and work toward providing fixes for these in future releases.
Developing a machine learning model is a creative, experimental process. The data scientists need to explore the data and understand the features/fields in the data. They may choose to do some feature engineering - processing on those fields - perhaps creating aggregations such as averages over time, or combining different fields in ways that they believe will create a more powerful algorithm. At the same time they will be considering what the right algorithmic approach should be - selecting from their toolkit of classifiers, regressors, unsupervised approaches etc. and trying out different combinations of features and algorithms against the datasets they have been provided with. They need a set of tools to explore the data, create the models and then evaluate their performance.
Ideally, this environment should:
Provide access to required historical data sources (e.g. through a data warehouse or similar).
Provide tools such as notebooks to view and process the data.
Allow them to add additional data sources of their own choosing (e.g. in the form of csv files).
Allow them to utilise their own tooling where possible e.g. non-standard python libraries.
Make collaboration with other data scientists easy e.g. provide shared storage or feature stores.
Have scalable resources depending on the size of the job (e.g. in AWS Sagemaker you can quickly specify a small instance or large GPU instance for deep learning).
Be able to surface the model for early feedback from users before full productionisation.
Some of the approaches we have successfully used are:
Development on a local machine with an IDE or notebook.
Development on a local machine , deployment and test on a local container and run in a cloud environment.
Using cloud first solutions such as AWS Sagemaker or GCP Collab.
Using dashboarding tools such as Streamlit and Dash to prototype and share models with end users.
Local development using an IDE may lead to better structured code than with a notebook, but make sure that the data is adequately protected (data with PII should not be handled in this way), and that the dependencies needed to run the model are understood and captured.
Platform/Machine Learning engineer(s) to provide the environment to host the model.
Data engineers to create the production data pipelines to retrain the model.
Data scientists to create and amend the model.
Software engineers to integrate the model into business systems (e.g. a webpage calling a model hosted as a microservice)
MLOps is easier if everyone has an idea of the concerns of the others. Data Scientists are typically strong at mathematics and statistics, and may not have strong software development skills. They are focused on algorithm performance and accuracy metrics. The various engineering disciplines are more concerned about testing, configuration control, logging, modularisation and paths to production (to name a few).
It is helpful if the engineers can provide clear ways of working to the data scientist early in the project. It will make it easier for the data scientists to deliver their models to them. How do they want the model/algorithm code delivered (probably not as a notebook)? What coding standards should they adhere to? How do you want them to log? What tests do you expect? Create a simple document and spend a session taking them through the development process that you have chosen. Engineers should recognise that the most pressing concern for data scientists is prototyping, experimentation and algorithm performance evaluation.
When the team forms, recognise that it is one team and organise yourself accordingly. Backlogs and stand-ups should be owned by and include the whole team.
I started as a data scientist but quickly realised that if I wanted my work to be used I would need to take more interest in how models are deployed and used in production, which has led me to move into data engineering and ML Operations, and now this has become my passion! There are many things that I have learned during this transition.
In general, models are developed by data scientists. They have the maths and stats skills to understand the data and figure out which algorithms to use, whilst the data engineers deploy the models. New features can get added by either of these groups.
In my experience, data scientists usually need to improve their software development practices. They need to become familiar with the separation of environments (e.g. development, staging, production) and how code is promoted between these environments. I’m not saying they should become devops experts, but algorithms are software and if the code is bad or if it can’t be understood then it can’t be deployed or improved. Try to get your code out of the notebook early, and don’t wait for perfection before thinking about deployment. The more you delay moving into production, the more you end up with a bunch of notebooks that you don’t understand. Right now I’m working with a great data scientist and she follows the best practice of developing the code in Jupyter Notebooks, and then extracts the key functionality into libraries which can be easily deployed.
For data engineers - find time to pair with data scientists and share best dev practices with them. Recognise that data science code is weird in many respects - lots of stuff is done with Data Frames or similar structures, and will look strange compared to traditional application programming. It will probably be an easier experience working with the data scientists if you understand that they will be familiar with the latest libraries and papers in Machine Learning, but not with the latest software dev practices. They should look to you to provide guidance on this - try to provide it in a way that recognises their expertise! Matteo Guzzo Data specialist
Equal Experts, EU
As the lead data scientist in a recent project, my role was to create an algorithm to estimate prices for used vehicles. There was an intense initial period where I had to understand the raw data, prototype the data pipelines and then create and evaluate different algorithms for pricing. It was a really intense time and my focus was very much on data cleaning, exploration and maths for the models.
We worked as a cross-functional team with a data engineer, UX designer and two user interface developers. Wehad shared stand-ups; and the data engineering, machine learning and user experience were worked on in parallel. I worked closely with our data engineer to develop the best way to deploy the ETL and model training scripts as data pipelines and APIs. He also created a great CI/CD environment and set up the formal ways of working in this environment, including how code changes in git should be handled and other coding practices to adopt. He paired with me on some of the initial deployments so I got up to speed quickly in creating production-ready code. As a data scientist I know there are 100 different ways someone can set up the build process - and I honestly don’t have any opinion on which is the right way. I care about the performance of my model! I really appreciated working together on it - that initial pairing meant that we were able to bring all our work together very quickly and has supported the iteration of the tool since then. Adam Fletcher Data scientist
Equal Experts, UK