Implementing machine learning (ML) in the real world can quickly create a complex web of interrelated systems. Any change you make, even a simple, seemingly obvious improvement, can quickly spiral out of control through a chain of unintended consequences. Being aware of how ML models interact can go a long way to avoiding the second costly surprise of machine learning: changing anything changes everything.
This is the third in a series of posts dissecting a 2015 research paper, Hidden Technical Debt in Machine Learning Systems, and its implications when using ML to solve real-world problems. Any block quotes throughout this piece are from this paper.
Our last post in this series explored abstraction and how machine learning’s leaky abstraction demands a deeper understanding of a number of principles to avoid costly surprises.
The first principle we will discuss is a common surprise and something important to understand: no inputs into an ML model are independent. ML models entangle and mix inputs. If you change anything in the input, you change the whole system. This is known as the CACE (pronounced “cake”) principle: changing anything changes everything in ML systems.
For example, suppose an ML model has inputs x1, x2, and x3. If we exchange input x2 with x4 and retrain the ML model (either in batch mode or online), we might expect the model to only update a small part of itself relating to x2. But no… if we do this, the entire model changes. Now suppose instead of exchanging features, we add the feature x4 to x1, x2, and x3 and retrain. We may expect the original part of the model to remain fixed, while the new part of the model changes. But, again, no… The entire model changes.
This can be dangerous. You might believe you are improving an input in an isolated fashion when the opposite is true. For example, an engineer comes up with a better estimate of location from web cookies. Brilliant. Surely, this improvement will improve the classifier. [buzzer sound] The opposite is true. Because you changed a single input, you need to retrain the model, creating a completely new algorithm.
One possible mitigation strategy is to isolate models and serve ensembles. This approach is useful in situations in which sub-problems decompose naturally [into subsystems]… However… Relying on the combination creates a strong entanglement: improving an individual component model may actually make the system accuracy worse if the remaining errors are more strongly correlated with the other components.
In effect, the CACE entanglement just gets moved to the ensemble and you are back where you started.
The best solution is to monitor for input and output changes and retrain the models automatically when the input or output distributions change. We will cover this in detail in a later blog post.
There are often situations in which model Ma for problem A exists, but a solution for a slightly different problem A′ is required. In this case, it can be tempting to learn a model M′a that takes Ma as input and learns a small correction as a fast way to solve the problem.
Imagine you already have a cat classifier and you want to classify dogs. It would seem your existing classifier has already done a lot of the heavy lifting: figuring out that cats stand on four legs, have two eyes, ears, etc. It seems intuitive to leverage that work and train a small new model on the output of your cat classifier and just make small, dog-specific modifications.
The problem with this approach is that it makes the system highly unstable and significantly more difficult to make changes to in the future. This is the problem common to any model trained on the output of another model: model Mb has learned the details, subtleties, and flaws of model Ma. Improvements made to Ma will have a detrimental impact on Mb if it is not retrained.
Even then, if you do retrain Mb and all the other models trained on its outputs, it’s difficult to assess whether any subsequent improvements in the output are a result of the improvements we made to Ma or due to the retraining of Mb.
Once in place, [these cascades of models] can create an improvement deadlock, as improving the accuracy of any individual component actually leads to system-level detriments.
This is as though making a small change to a web page, such as adjusting the font size, required you to rewrite the entire site’s code from scratch.
For a system to remain nimble, it needs to retain as much abstraction as possible. Isolated improvements should improve the system in a predictable way. It should be easy to swap out and update pieces of the system without worrying about detrimental side effects.
There are two related solutions to this problem.
The first, and often the simplest, is to retrain your model Ma on the original input data (your cat data) plus additional data (your dog data) and features to distinguish the new use case.
The second, which is more powerful but also slightly more technical, is to use create a new model Mb using a technique called transfer learning. There are several methods to conduct transfer learning. One method, called pre-training, is to use the parameters of the original model Ma as a starting point for Mb and train on the new data you want your model to learn. Another common method, called fine tuning, is to use Ma as a feature extractor for Mb and train from scratch using the new data.
Oftentimes, a prediction from a machine learning model Ma is made widely accessible, either at runtime or by writing to files or logs that may later be consumed by other systems. Without access controls, some of these consumers may be undeclared, silently using the output of a given model as an input to another system.
Undeclared consumers are particularly dangerous because these dependencies are hidden.
Undeclared consumers are expensive at best and dangerous at worst, because they create a hidden tight coupling of model Ma to other parts of the stack. Changes to Ma will very likely impact these other parts, potentially in ways that are unintended, poorly understood, and detrimental.
When trying to make improvements to Ma, there will be many hidden, unexpected effects. These effects make it difficult and expensive to understand why isolated improvements to Ma are detrimental to system-level performance, the performance that matters to the end user.
This can also lead to feedback loops where model Mb can also affect Ma, which continues to affect Mb, and so on. We’ll be exploring feedback loops in detail in the next post in this series.
The way to prevent undeclared consumers is to create access controls and service-level agreements (SLAs) between the ML model and consumers of that model. This enforces an explicit declaration and contract between model and consumer which removes hidden dependencies. The output of an ML model will have a digital signature that needs a digital key to use the output.
We’ve explored input entanglement, as well as the dangers of a model being trained on another model’s output, whether intentionally or not. Just being aware of these potentially costly surprises already goes a long way to making sure you implement ML more efficiently in real-world scenarios.
In the next post in this series, we’re going to look at data dependencies, unnecessary input data, and direct and hidden feedback loops. All of these are issues that can greatly impact the performance of an ML model in production in the real world.