Provide an environment that allows data scientists to create and test models

Developing a machine learning model is a creative, experimental process. The data scientists need to explore the data and understand the features/fields in the data. They may choose to do some feature engineering - processing on those fields - perhaps creating aggregations such as averages over time, or combining different fields in ways that they believe will create a more powerful algorithm. At the same time they will be considering what the right algorithmic approach should be - selecting from their toolkit of classifiers, regressors, unsupervised approaches etc. and trying out different combinations of features and algorithms against the datasets they have been provided with. They need a set of tools to explore the data, create the models and then evaluate their performance.

Ideally, this environment should:

Provide access to required historical data sources (e.g. through a data warehouse or similar).
Provide tools such as notebooks to view and process the data.
Allow them to add additional data sources of their own choosing (e.g. in the form of csv files).
Allow them to utilise their own tooling where possible e.g. non-standard python libraries.
Make collaboration with other data scientists easy e.g. provide shared storage or feature stores.
Have scalable resources depending on the size of the job (e.g. in AWS Sagemaker you can quickly specify a small instance or large GPU instance for deep learning).
Be able to surface the model for early feedback from users before full productionisation.

Some of the approaches we have successfully used are:

Development on a local machine with an IDE or notebook.
Development on a local machine , deployment and test on a local container and run in a cloud environment.
Using cloud first solutions such as AWS Sagemaker or GCP Collab.
Using dashboarding tools such as Streamlit and Dash to prototype and share models with end users.

Local development using an IDE may lead to better structured code than with a notebook, but make sure that the data is adequately protected (data with PII should not be handled in this way), and that the dependencies needed to run the model are understood and captured.

Taking a container approach eases the promotion of a development model to production and often reduces turn-around times across the full model lifecycle

PreviousSolid data foundations NextA machine learning service is a product

Last updated 3 years ago