7 costly surprises of machine learning: part eight
I’ve already explored how the complexity of machine learning (ML) systems can create hidden dangers in production but there are still two more areas where concealed complexity can come back to bite.
This is the seventh in a series of seven posts dissecting a 2015 research paper, Hidden Technical Debt in Machine Learning Systems, and its implications when using ML to solve real-world problems. Any block quotes throughout this piece come from this paper.
Last week, I concluded my look at the dangers of messy real-world input data. This week, I present the final two costly surprises: configuration and R&D debt.
What is configuration debt?
The configuration of machine learning (ML) systems is commonly considered an afterthought by both academics and engineers. Worse still, testing and validation of configuration systems is sometimes viewed as unimportant. In reality, the amount of configuration code in a mature ML system can far exceed the amount of traditional code. It’s in this code that our next costly surprise awaits.
Production ML systems typically have a large range of potential configuration combinations: feature selection, data selection, pre- and post-processing, validation methods, etc. Each line of configuration code has the potential for large, system-level errors if not properly handled.
Consider this scenario:
- A bug is discovered in feature A, which takes a week to fix
- Feature B gets pushed into production but is not available on data prior to its launch date
- For privacy reasons, feature C needs to be modified
- Feature D is in use in North America but is not available in Europe for another month
- Feature Z can not be used on mobile devices because of a large memory constraint caused by a lookup table dependency
- In production, if you use feature Q you cannot use feature R because of a latency constraint.
There are only 6 situations presented here but things are already quite messy. In typical ML systems, there are orders of magnitude more configurations that are difficult to modify correctly and reason about the consequences of the modification. To further exacerbate this problem, small mistakes in configuration are costly, leading to serious losses of time and money, wasting of computing resources, and production errors.
A good solution involves a set of principles that any configuration system should follow:
It should be easy to specify a configuration as a small change from a previous configuration
It should be hard to make manual errors, omissions, or oversights
It should be easy to see, visually, the difference in configuration between two models
It should be easy to automatically assert and verify basic facts about the configuration: number of features used, transitive closure of data dependencies, etc.
It should be possible to detect unused or redundant settings
Configurations should undergo a full code review and be checked into a repository
All configurations should be unit tested
What is reproducibility debt?
The exponential growth of technology is made possible by people being able to build off the work of others. This fact has led to the rapid progress of computing, the internet, and other information technologies.
As mentioned in the first post in this series, ML forms leaky abstractions and weak contracts, making it difficult to leverage the work of others.
An example: you train a ML model on data from a database before running on a test example, returning an output of .8. You are satisfied with your result. Two months later, a member of your team is picking up the project to improve upon your results. She runs the same ML model and gets an output of .1 on a test example. She’s using the same code and the same environment, but the model is different. It turns out the data in the database changed and she needs to retrain the model while being unable to leverage the work put into the original model. This results in large amounts of time wasted having to redo other people’s work. Fixing this is a great way to drastically increase the efficiency of engineers.
If you want to use or reproduce someone’s software, all you need is the code and the environment in which it was run. If you download the GitHub for a website, you can run the website on your local computer and it will be exactly the same. But, to reproduce ML models—in addition to the code and environment—you need the input data.
Without this additional information, the system is not reproducible and therefore one would need to start from scratch training a model. For ML experiments in general, it’s advisable to save your code, inputs, outputs, and environment for every single project.
What is experimental debt?
When developing ML models it typically takes 100s if not 1000s of iterations to get a usable model. An ML practitioner needs to try out different input features, model types, hyperparameters, etc., to discover the best model.
Typically results across experiments and across the team are not in a central location.
For example, maybe Bob forgot to save the results of experiment 1 but remembers the accuracy was .8. In experiment 2, he saves the result (.33 accuracy) to a log file on his local computer. Some results from an AB test were saved to an SQL database. Moreover, Mary is working on a similar problem and saves her results on a local computer.
In this typical situation, it’s difficult to determine the best model. For example, maybe you want the highest recall model for Europe, where the model size is under 3Mb and where the precision is greater than .9. In practice, without being able to create complex queries over experiments, this is a difficult and time-consuming endeavor, since you will need to amalgamate the data across all the systems and format it all consistently.
In addition to this, the metric which determines the “best” model changes over time. Perhaps, when the model was first designed the optimization metric was customer retention, but later down the line customer satisfaction becomes the optimization metric. Typically, in this case you would need to retrain a model, optimizing for the new metric. This is expensive and tedious.
You need an experimentation platform which creates a global shared context to track experimental results, and within which you can search and discover results and launch complex queries in order to gain insights into models.
These final two surprises conclude my seven costly surprises of ML. Over the course of eight posts, we’ve gone from establishing the differences between ML and traditional, rule-based software, through levels of abstraction, and explored many of the ways input data and hidden code complexities can derail ML systems in production.
In next week’s final post, I’m going to explore what these surprises mean for ML in the real world and offer some concluding thoughts on how best to move forward with this exciting and powerful new technology.