What’s the best way of improving a machine learning model?
There are plenty of tricks in the field of Machine Learning (ML) for squeezing out more performance from a model, from regularization of the parameters to using ensembles. But arguably the simplest and yet most effective technique has to do with the data. No, not the distribution of classes or quantity of examples (though these are important)—the number one thing to examine first in the hunt for better performance is the quality of the labels.
Noisy or incorrect labels just damage model performance like nothing else. On the flip-side, improving these labels is likely to make the difference between a weak and a strong model.
That’s why in Henesis we invest great effort into understanding our data, visualising them, and tweaking and enriching them. We’ve even developed powerful in-house software for adding or fixing the labels of these data, automatically and manually, on images and time-series. The benefit has been huge, and our models now predict far more effectively than if we had just used the original labels.