Machine learning (ML) models for academic papers follow a set pattern: there is a fixed dataset, a model is trained, then you write a paper explaining the results. This is the end in academia, but just the beginning in the real world. This reality has profound impacts on the lifecycle of an ML model in production.
This is the sixth in a series of seven posts dissecting a 2015 research paper, Hidden Technical Debt in Machine Learning Systems, and its implications when using ML to solve real-world problems. Any block quotes throughout this piece come from this paper.
In this post, I look at how the fundamental paradigm shift I explored in the first post in this series—that the algorithms ML models learn are determined by input data, rather than by a programmer’s handwritten rules—has profound system-level performance effects in ML.
ML system updates often result in changes to the units of a dataset or the frequency at which the data is collected. If a system fails, there could even be missing input data.
These issues, while small in themselves, have large impacts on ML model performance—because the ML model is learning its algorithms off this input data.
The first and easiest way to account for data changes is to create simple, deterministic data tests. Similar to how unit tests or integration tests monitor changes in code, data tests monitor changes in input data by looking for missing values in data, checking if the data is using the correct units, etc.
If you train a model on dataset A and test it on dataset B, which has the same input distribution as dataset A, then the model’s performance will be identical. However, if datasets A and B do not have the same distributions of input data, the model will give very different results.
It is essential to detect when the distribution of your input data changes. Such changes could occur if you start collecting data from a different geographical location, new social trends emerge, new customers start using your product, etc. If the input data changes for any reason, your ML model will perform differently than you expect.
It is essential to ensure your test data and production data have the same distribution of inputs.
Imagine you train your model on a training dataset (purple distribution on the left plot of figure 1 below) that has a different distribution than your production data (green distribution on the left plot of figure 1). The model that is learned using the production data (green line on the right plot of figure 1) is different than the model learned using the training data (purple line on the right plot of figure 1). Two different input distributions yield two different behaviors. Here, you think your model is performing like the purple line, but in reality it’s performing like the green like. In the real world, this poses a significant problem.
In this situation, you should retrain your model using the production data distribution (purple distribution). But, as a first step it is important to detect this dataset difference. This is called detecting covariance shift.
Use covariance shift adaptation1. This is an efficient ML technique which resamples datasets so that their distributions are again equivalent. When you detect a covariate shift in your data, you can use covariate shift adaptation on your training and test data so that it again matches the distribution of inputs in the real world.
In a system that is working as intended, it should usually be the case that the distribution of predicted labels is equal to the distribution of observed labels.
This means that if the test set that you used to assess the performance of your classifier yielded 30% of predictions ending up in class a and 70% in class b, then, when you run the model in production, you would expect to see a prediction of class a 30% of the time, and a prediction of class b 70% of the time as well.
If your predicted labels do not match your observed labels, it usually indicates a problem that requires attention. For example:
Detect prediction shift using one of these two methods:
Slicing prediction bias by various dimensions isolate issues quickly, and can also be used for automated alerting.
Detecting prediction shift is very simple and in practice extremely useful.
Shifting input data can quickly lead an ML model astray. Learning how to monitor for changes in your input data is an essential part of maintaining an accurate system in production.
The next post in this series continues to explore the effects of messy real-world data on ML systems in production. I’m going to be looking at calibrating ML models automatically, setting prediction limits, and central monitoring and alert systems.
[1] M. Sugiyama and M. Kawanabe, Machine Learning in Non-Stationary Environments: Introduction to Covariate Shift Adaptation, The MIT Press, 2012.