Measure and proactively evaluate quality of training data

ML models are only as good as the data they’re trained on.

In fact, the quality (and quantity) of training data is often a much bigger determiner of your ML project’s success than the sophistication of the model chosen. Or to put it another way: sometimes it pays much more to go and get better training data to use with simple models than to spend time finding better models that only use the data you have.

To do this deliberately and intentionally, you should be constantly evaluating the quality of training data.

You should try to:

  • Identify and address class imbalance (i.e. find ‘categories’ that are underrepresented).

  • Actively create more training data if you need it (buy it, crowdsource it, use techniques like image augmentation to derive more samples from the data you already have).

  • Identify statistical properties of variables in your data, and correlations between variables, to be able to identify outliers and any training samples that seem wrong.

  • Have processes (even manual random inspection!) that check for bad or mislabelled training samples. Visual inspection of samples by humans is a good simple technique for visual and audio data.

  • Verify that distributions of variables in your training data accurately reflect real life. Depending on the nature of your modelling, it’s also useful to understand when parts of your models rely on assumptions or beliefs (“priors”), for example the assumption that some variable has a certain statistical distribution. Test these beliefs against reality regularly, because reality changes!

  • Find classes of input that your model does badly on (and whose poor performance might be hidden by good overall “evaluation scores” that consider the whole of the test set). Try to supplement your training data with more samples from these categories to help improve performance.

  • Ideally, you should also be able to benchmark performance against your dataset rather than aim for getting metrics ‘as high as possible’. What is a reasonable expectation for accuracy at human level, or expert level? If human experts can only achieve 70% accuracy against the data you have, developing a model that achieves 75% accuracy is a terrific result! Having quantitative benchmarks against your data can allow you to know when to stop trying to find better models, and when to start shipping your product.

Last updated