There’s no doubt that increases in computational power and model complexity continue to benefit machine learning (ML). But in spite of these advances, both the size and quality of training datasets provided to models has remained stagnant. This stagnation has capped ML’s ability to reach its promised heights in real-world use cases.
The importance of training data will be familiar to data scientists working to improve the accuracy of their models. But it doesn’t have to be this problematic. By providing your model with only the highest quality labeled training data — and lots of it — you can ensure that you build and maintain an accurate model capable of high performance in the real world.
When building new, more complex ML models in an academic context, your energy goes into tweaking the model itself. But, as discussed in our previous post, if you decided to carry your model through to production, you’d find that this carefully built model will get you only 5%¹ of the way there. A production-ready model needs to perform well on real-world data. To do this, the model needs to be trained on heaps of high-quality training data that closely reflects what the model is going to face in production.
Seldom do publicly available datasets provide the size, quality and, perhaps most importantly, the specificity necessary to ensure good performance. And when these datasets come close — ImageNet comprises over 14 million hand-annotated images, for example — this public availability may reduce the competitive advantage that high quality training data brings.
What’s ideal is not just huge, high-quality training datasets, but huge, high-quality training datasets that are tailored to your use case. When training data is calibrated to reflect the real-world problem you are solving, your model will be better equipped to handle relevant edge cases.
Let’s be clear: better data does not always mean more data. In other words, adding more data to your training dataset will not necessarily result in a better model. The key to improving your model’s accuracy lies in reducing its bias and variance. And to do that, your training data will need different types of improvements. But first, what are bias and variance?
Bias is the difference between a datapoint’s true value and the model’s predicted value. The further apart these two values are, the higher your model’s bias. Generally, overly simple models are susceptible to bias. If a model suffers from high bias, it will ignore important details lurking in the data and consistently provide incorrect predictions.
Variance is the difference between the performance a model demonstrates on its training dataset versus its test set. If a model performs well on its training set and displays inconsistent accuracy on test sets, it likely suffers from high variance. Overly complex models tend to show high variance. The model is fitting itself too tightly to the training data, which makes for a poor ability to generalize, leading to poor performance on any data that isn’t closely aligned with the training data.
We will be exploring bias and variance more deeply in future posts, so stay tuned.
If the key to an accurate model lies in minimizing both bias and variance, how can you improve your training data to meet this challenge?
If your model suffers from a bad case of high variance, the solution’s simple: feed it more training data. Your model is picking up a pattern, but it’s being too specific and rigid. More data allows your model to identify the true underlying pattern and improve its accuracy across the board. Do note, however, that your data still needs to be accurately labeled: think, “garbage in, garbage out”.
If it’s high bias that affects your model, the remedy is more refined. While a larger training dataset might prove effective, often you can only address the issue through improvements to your existing dataset. Here are just a few examples of how you could do this:
We’ll be exploring these and more methods, along with how to execute them, in a future post.
No matter what field your machine learning algorithm is directed at, and no matter how generalized or precise its task, the quality of your training data is the single most significant factor in increasing your model’s accuracy to production-ready levels. As long as a further improvement in accuracy serves your users, any time you spend improving your dataset is time well spent.
If you want to refine the way you turn your raw data into highly accurate labeled training data, you can test one or more of our data labeling products for free using your own data.