There is a crisis in machine learning that is preventing the field from progressing as fast as it could. It stems from a broader predicament surrounding reproducibility that impacts scientific research in general. A Nature survey of 1,500 scientists revealed that 70% of researchers have tried and failed to reproduce another scientist’s experiments, and over 50% have failed to reproduce their own work. Reproducibility, also called replicability, is a core principle of the scientific method that helps ensure the results of a study aren’t a one-off occurrence, but actually represent a replicable observation.
In computer science, reproducibility has a more narrow definition: Any results should be documented by making all data and code available so that the computations can be executed again with the same results. Unfortunately, artificial intelligence (AI) and machine learning (ML) are off to a rocky start when it comes to transparency and reproducibility. For example, take this response published in Nature by 31 scientists that are highly critical of a study from Google Health that documented successful trials of AI that detects signs of breast cancer.
The skeptical scientists claim the Google study offered far too little detail about how the AI model was built and tested, and went so far as to say it was merely an advertisement for proprietary technology. Without adequate information about how a given model was created, it is nearly impossible for the scientific community to review and reproduce its results. This is contributing to a growing perception that transparency is lacking in artificial intelligence, exacerbating trust issues between humans and AI systems.
To maintain forward momentum and succeed with artificial intelligence, it will be essential to address replicability and transparency issues in the field. This article explains the significance of the reproducibility crisis on AI, as well as how a new version of GitHub built specifically for machine learning could help solve it.
GitHub is a cloud-based service for developing and managing code. The platform is used for software version control, which helps developers track changes to code throughout the development lifecycle. This makes it possible to safely branch and merge projects and ensure code is reproducible, working the same way regardless of who is running it. Because AI and ML applications are written in code, GitHub was the natural choice for managing them. Unfortunately, a number of differences between AI and more traditional software projects makes GitHub a bad fit for artificial intelligence, contributing to the reproducibility crisis in machine learning.
Traditional software algorithms are created by developers taking ideas out of their heads and writing them as code in a deterministic, mathematical, Turing-complete language. This makes software highly replicable—all that is needed to reproduce a given piece of software is its code and the libraries used for task optimization.
Machine learning algorithms are different because they aren’t created from the minds of developers, but instead implied from data. This means that if the data changes the machine learning algorithm changes, even if the code and operating environment variables recorded in traditional software development remain constant. This is the heart of the problem with using GitHub for AI: Even if you track the code and libraries used to develop an artificial intelligence algorithm, you can’t reproduce it because it depends on the data, not just the code. Some ways to overcome this include:
It isn’t just the inability to track changes in data that makes using GitHub for AI problematic, traditional software and AI also depend on completely different data types. Software is written in code, and code is expressed as text. By nature, text files are not very large. Conversely, artificial intelligence relies on unstructured data, such as audio, images, and video, which are far bigger than text files and therefore present additional data tracking and management challenges.
The process by which data from multiple sources is combined into a single data store is called extract, transform, and load (ETL). This is a general process for replicating data from source systems to target systems, and it makes it possible for different types of data to work together. Data scientists and engineers need data versioning, data lineage, the ability to handle large files, as well as manage the script and libraries used for data processing in order to extract, transform, and load data for use in AI application development.
Some emerging solutions to this problem are discussed later in the article, but it is important to note that this functionality is not currently built into the core of GitHub–making it impossible to properly manage the data that informs machine learning algorithms on the platform.
These issues with AI replicability and using GitHub for ML projects extends beyond just the inability to track changes in data and manage large, unstructured datasets. Even if the code, libraries, and data used to develop an artificial intelligence algorithm remain constant, it still wouldn’t be possible to replicate the same results using the same AI system because of variability in model parameters.
As mentioned before, machine learning algorithms are informed by data. However, this isn’t the only factor that influences the system. Parameters are other inputs that contribute to how a given algorithm functions. There are two types of model parameters, hyperparameters and just plain parameters. Hyperparameters can be thought of as high-level controls for the learning process that influence the resulting parameters of a given model. After ML model training is complete, parameters are what represent the model itself. Hyperparameters, although used by the learning algorithm during training, are not part of the resulting model.
By definition, hyperparameters are external to an ML model and their value cannot be estimated from data. Changes to hyperparameters result in changes to the exact algorithm that the machine learning model ultimately learns. If the code is the design of how to build a human brain, the hyperparameters and models are how to build your exact brain. This is important because the same code base used to train a model can generate hundreds or thousands of different parameters.
When testing machine learning models it is important to track experimental results. These results help determine which model is the best fit for production, and unsurprisingly GitHub wasn’t designed to record these details. Although it is possible to build a custom workaround, this solution doesn’t scale and is inaccessible to many developers due to time and resource constraints.
Managing a machine learning model also involves code review and version tracking, which is where GitHub excels. Although GitHub tracks code and environment variables very well, machine learning introduces the need to track data, parameters, metadata, experimental results, and much more. The Git platform was not built to accommodate this level of sophistication but, fortunately, there are some emerging solutions that attempt to overcome the limitations of GitHub for AI and ML.
There is no single alternative to GitHub that offers a comprehensive solution for managing AI and ML projects. Ideally, a GitHub specifically tailored for machine learning will become available to data scientists and engineers operating in this space. Until then, there are a number of solutions that address different issues mentioned above:
At super.AI, our mission is to automate boring work so that people can be more human. We strive to make artificial intelligence available to everyone with both the technology we build and the resources we create. If you’re interested in learning more about artificial intelligence, check out the following resources: