Nov 25, 2019

Min Read

SUMMARY

Have you tried using ML for real-world problems and found it to be unreliable compared to software? Found it was powerful in some cases but then failed on seemingly obvious ones? Often this is because the weak abstractions offered by ML are unfamiliar and differ greatly to the strong abstractions offered by software. ML demands a deeper understanding if you’re to harness its power.

This is the second post in a series in which we dissect the 2015 research paper, Hidden Technical Debt in Machine Learning Systems. We use it as a launching point to go into detail about how to make ML work in real-world applications.

In the first post in this series, we saw that in traditional software programmers manually write code to create an algorithm, but in ML the algorithm is programmed from labeled data by an optimization algorithm. In traditional software the programmer dictates the quality of the algorithm, but in ML the labeled data and the quality of the optimization dictate the quality of the algorithm.

We are used to software and have an intuition of its strengths and weaknesses. The same is not yet broadly true for ML. The main reason is that software allows you to have strong abstractions, and ML currently does not.

Abstraction is the process of using something independent of its attributes or internals. You can safely use a bolt without knowing it’s made of alloy steel and features rolled threads, assuming you just adhere to its service-level agreements (SLAs), e.g., that it holds a maximum weight of 1 ton.

Abstraction is essential to building complex systems. Without it, we would not have computers, spaceships, or even mathematics. Most importantly, it’s key to the exponential progress that has shaped our society. But why?

Abstraction lets you leverage the work of others and design in terms of SLAs, rather than underlying details. When using a calculator you don’t need to prove all the theorems of arithmetic or understand how a computer chip works, you simply type in the numbers and press equals — what goes on behind the scenes is irrelevant to you.

Abstraction means you don’t have to start from scratch. Instead, when using abstractions you start where other people left off. This is represented by the following differential equation:

Here we can see that the current state of the system y t is proportional to the previous state y t-1. You may have noticed that this is exactly the differential equation for exponential growth. Because of abstractions, you get exponential progress.

Without abstraction, systems would be too complex to design from scratch. If you wanted to own a car in a world without abstractions, you would literally need to “reinvent the wheel”.

Not all abstractions are perfect. Certain abstractions leak, which means you cannot completely trust the abstraction. For example, you can’t just simply drive your car without any knowledge of its underlying details. You need to know properties of the engine, such as that in the cold it takes longer to warm up, and properties of the tires, such as that they behave differently when driving in rain or snow. Therefore, even though it is an abstraction, it’s a leaky abstraction. You need to get a driver’s license and understand some internal details about the car to use it. Self-driving cars create a stronger abstraction, but even this is not perfect.

Compare this to mathematics, which does offer perfect abstractions. You can use theorems without revisiting the proof. Having perfect abstraction allows you to design with contracts. Knowing the specifications of your building block is enough — the internal details become irrelevant.

The key point here is the stronger your abstraction, the more complex the system you can build. Since software has very strong abstractions, it has allowed us to build very complex software systems. But because ML is a weak abstraction, it has drastically limited the scale and complexity of what we can build.

Machine learning is a weak abstraction because it relies on statistics to generate its algorithm. That is why in forms such as deep learning and XGBoost, it’s called statistics-based machine learning.

The essence of why statistical machine learning is a weak abstraction is as follows:

- The statistics (mean, standard deviation, etc.) of your data depends on how it’s distributed. E.g., if you increase the amount of large values in your data, it will increase the mean.
- Changing the distribution of your data, changes the statistics of your data
- Since ML learns its algorithm based on labeled training data statistics from a particular distribution, it can only be reliably applied to data that comes from the same distribution

If you change the distribution of your data, you change the statistics. Therefore, the algorithm your ML model learned is no longer valid. So you can’t just use an ML model without understanding the internals of how the data you are applying it to is distributed. We’re left with a weak abstraction.

Traditional software does not suffer from this phenomena, because it instead relies on binary logic instead of statistics.

For example, imagine a software engineer creates a clever heap sort algorithm that sorts numbers from smallest to largest. This algorithm would work regardless of the distribution of the data. If there were only 1s and 10s, it would sort it. If it was all integers from 1–100, it would sort it. If it was only prime numbers, it would sort it.

The same is not true for ML. If a data scientist created a sorting algorithm based on labeled data of integers evenly distributed from 1–100, it would work great if we only used it for input data that had a uniform distribution of integers from 1–100, but would fail miserable if we only fed it prime numbers, since the data distribution has changed.

This is what we mean by ML being a weak abstraction and software being a strong abstraction.

This seems like a grim view of ML. Is ML really so unreliable? Can it promise us anything?

The answer to both of these is yes. Because of its leaky abstraction, ML is a big headache for people without the proper solution, but there are in fact promises that ML can make. It just requires a new way of thinking.

What does it mean when a data scientist claims their dog vs cat ML model accuracy is 96%?

As you may now have guessed, it doesn’t mean that no matter what images you give it, it will be 96% accurate in classifying the images. Like you would expect if I said my software algorithm is 96% reliable, or if I say my airplane bolt can handle 90 tons.

Saying my dog vs cat classifier is 96% accurate means something very specific. It means over a particular labeled dataset, which we typically call a test (or holdout) set (let’s say it comprises 100 data points), we predicted 96 out of 100 correctly. We mean nothing more and nothing less than this.

This brings us to the process every machine learning model goes through to become production-ready. You first take your data and separate it into different buckets: a training bucket, a validation bucket, and a test bucket.

**Training bucket:**use this to train your ML model**Validation bucket**: use this to tune the parameters of your model**Test bucket:**use this just once to give you the most reliable estimate of the model’s accuracy

Once you train your model on the training data, you then tune hyperparameters on the validation data, before making predictions on the test set without looking at the answers. Once you’ve done that, you can compare the labeled answers with the ML models answers and find, for example, that 960 predictions matched the labels and 40 didn’t. In this case you would say your model is 96% accurate.

What this means is that the 96% accuracy number is only valid for a specific test set (and any other test sets that have an identical distribution). This accuracy guarantee is lost if you run your model over another test set with a different distribution. This is a very important point — and often overlooked. If your input data in the test bucket closely resembles the input data in production (which it should), you can expect similar results in production, but they will never be identical.

In essence, if an image classifier’s training data doesn’t resemble the images in your collection, its results are likely to be poor, even if it promises 96% accuracy. Put another way, an image classifier trained on renaissance art will be useless at categorizing dogs. Hmmm… Caravaggio? No. Pomeranian.

Having a great ML model depends upon the data distribution. This is why having a large, diverse, clean and relevant labeled dataset is essential to make machine learning work on real problems. It’s also the reason we built **super.AI**. If you are looking to get machine learning to work for you on real problems, we can help you train your model on a data distribution that matches that of your production system. We adapt as your production system changes, always ensuring that your data is high quality, varied, and clean.

The leaky abstractions that ML entails complicate the system to the point that a deeper understanding is essential to avoid the surprises that lay in wait. In the next post in this series, I’m going to examine three complications along with their possible solutions.