TL;DR Continuous integration for ML is an important functionality of great practical importance, but requires careful thinking and research.

To learn more, check our publications: [SysML 2019] and [KDD 2020]
This work is conducted in collaboration with Bolin Ding from Alibaba, and researchers from Microsoft including Wentao Wu and Matteo Interlandi.

Machine learning (ML) has seen a tremendous success in the past decade due to not only the high quality predictions it was able to produce, but also its ability to reduce thousands of line of code in production by orders of magnitudes. This trend of replacing production code by trained models brings new challenges when it comes to guarantee the stability and quality of the production environment. It is not uncommon for companies to re-train new models repetitively, in cycles of weeks, days or even hours, in order to get up-to-date and more accurate models. Consequently, the effort to systematically and faultlessly evaluate these models can be quite challenging.

Fortunately, traditional software engineering has dealt with similar problems for years and has a whole sub-area called software testing focusing on all aspects of quality assurance. Major advances in this area brought us continuous integration (CI), a development methodology that, among other things, includes an automated execution of a battery of software tests. This CI methodology proved itself essential for enabling shorter development cycles and faster detection of software bugs.

However, if we take the most straightforward way to apply these standard software testing practices to systems containing ML components, we will inevitably get ourselves into trouble. In this post, we take a closer look at two key challenges that arise when testing machine learning models. We discuss our proposed approaches to tackle them in a way that is practical while at the same time following important theoretical principles.

Testing Conventional Software vs Testing Machine Learning Models

The main challenges we present here revolve around testing ML models. Before going into details we should first review how CI typically works in a conventional software development setting.

Testing Conventional Software

Testing conventional software starts with the definition of a set of tests. These tests can be of different type depending on the portion of the code or the specific functionality that they target (e.g. unit tests, integration tests). Each test encodes some set of conditions that must hold in order to pass the test. These conditions aim to represent certain invariants of the system's functionality.

from primes import is_prime

def test_prime_basic():
  # Test some positive examples.
  for n in [1, 2, 3, 5, 7]:
    assert(is_prime(n) == True)

  # Test some negative examples.
  for n in [4, 6, 8, 9, 10]:
    assert(is_prime(n) == False)
Example test case for a function that checks if a number is prime.

We would like to draw the attention to two properties of conventional software tests which, although being relatively obvious, pose a challenge in the context of machine learning:

  1. Test outcomes are deterministic: Even if the test contains random elements, most well-designed tests construct highly controlled environments and have deterministic outcomes. Good tests are made to not have false positives or false negatives.
  2. Tests are defined transparently: Tests do not need to be hidden from the developers. In fact, the developers themselves usually write their own tests. If a test uncovers a bug, the developer will often inspect the test code in order to understand the test condition, and, more importantly, the underlying test invariant. Knowing this invariant is helpful when trying to effectively resolve a bug in the code.

Testing Machine Learning Models

The goal of testing ML models is to measure their ability to generalize, that is, to make correct predictions on new data never seen by the trained model. This is achieved through the use of a separate test dataset, which is the first key component specific to ML testing. This dataset should be representative of the data distribution that the model is trying to represent, hence it should embody the key invariants of the problem the model is trying to solve.

Given a test dataset, we run inference on the ML model in order to get predictions. We then evaluate these predictions using some scoring metric, which measures a useful property of the model. For example, if a model is supposed to classify images of cars, the metric may be the average accuracy of the model's predictions. Given the measured score, we can construct a test condition that checks for example if the score is beyond a certain threshold, or if it is better than some older model that we use as baseline.

from pandas import read_csv
from pickle import load

def test_model():

  # Load test dataset.
  X = read_csv("data/test/features.csv").to_numpy()
  y = read_csv("data/test/labels.csv").to_numpy()
  # Load the model we want to test.
  model = load("model.pkl")
  score = model.score(X, y)
  # Test condition.
  assert(score > 0.85)
Example test case for a pickled ML model.

Regardless of the manner in which an ML test condition is constructed, it has two properties that stand in contrast to conventional software testing:

  1. Test outcomes are non-deterministic: ML tests are inherently random. Test datasets are samples of a data distribution. Hence, test outcomes are random variables. As such, they are subject to small statistical errors directly related to the size of the test dataset. These errors must be controlled for.
  2. Tests must be hidden: As mentioned above, in order to test whether the model can generalize, in principle, no information about the test dataset should reach the model during iterative training. If this requirement is breached, it will likely lead to overfitting. Hence, developers must not be exposed to any information about the test dataset.

Even when developers are well intentioned, they are dealing with ML models, which are essentially black boxes producing random results. These statistical approaches can be quite sensitive to non-obvious factors, especially to developers who are not experts in machine learning.

As a result of the above mentioned properties of ML tests, integrating them alongside regular software tests and simply applying existing CI workflows to trained models poses several technical challenges. We cover two of them in the following section.

What are the challenges?

Let us examine a representative CI workflow featuring ML models. We roughly divide it into four stages: develop, build, test and deploy. These stages correspond to regular software integration and deployment phases. We assume that at each stage standard tasks for the overall software project along with ML-specific actions are preformed.

Let's focus our attention to the testing stage. Apart from running regular test suites, this stage comprises of testing the ML model based on some pre-defined test conditions. As mentioned before, each test condition is determined by the test dataset used and the evaluation metric computed by running inference on the dataset using the model to test. For example, if we want to check if the model's accuracy score is higher than the score obtained on the previously deployed model, we would construct the simple condition new_score > old_score. The figure below illustrates our workflow.

Challenge 1: Comparing scores needs to include error bars

Let's imagine a test condition encoded as model_score > 0.8, and measuring a model score of 0.82. Normally, we would directly conclude that the test is passed. However, what guarantees can we have in trusting this score despite knowing that we could suffer from randomness in the test outcome? Maybe, by repeating the same test a 100 times on a 100 different test sets, we would discover that the mean score is actually 0.79. It is entirely possible that our original test set was not large enough to guarantee an accurate estimate and we were fooled by randomness.

This challenge arises from the very nature of testing machine learning models, as mentioned above. Therefore, a principled treatment of this score mandates that we represent it as a random variable. To avoid the pitfall of blindly trusting the measurement, we need to take care of two aspects:

  1. We need to know the size of error bars: Intuitively, the larger the test set, the better our score estimate is going to be. Better estimates are equivalent to tighter error bars. The question remains, can we provide a rigorous relationship between the size of the test set and the size of error bars?
  2. We need to decide what to do if we overlap with the error bars: In our above example, if the measured model score is 0.82 and assuming we decide to fix the error bars at 0.05. Whenever we are within 0.05 away from 0.8(the target value we compare against), any decision we make can be wrong. This is an unavoidable aspect, well known from statistics. These error are categorized as Type I and Type II errors, which correspond to false negative and false positive decisions. We can only be free from one of the two. Therefore, we have to make this choice in advance as part of our test system's configuration.

Challenge 2: We have to prevent overfitting to the test set

Let's focus now on a different test condition: new_score > old_score. It compares the score of the new submitted model to the score of an older model previously deployed. Let's imagine that a team of developers is continuously training newer and better models and committing them to the project repository.

Assuming that the test is executed after each commit and the results are shown in the dashboard, developers can check whether the changes made in the latest model managed to beat the older one, and iterate based on that knowledge. If the test failed, they can take it as a signal that the direction they've taken in their latest update was not fruitful and that they should explore other directions to improve the model further. If the test passes, they may conclude that they are on the right track and may keep improving the model to get even better scores. Whatever the outcome is, the development team will likely continue this iterative process in a well intentioned effort to increase the quality of their final product.

In the short term, there seems to be no harm in what is going on here. However, the development team is effectively using the signals received from the testing system to guide their search for better models. As a result, in the long term, the development team will inevitably start fine-tuning their model to perform extremely well on the test set. The issue is that this test set is merely a sample of the real-world data the model is supposed to handle. As a result, a model that fits well on this sample will not necessarily do well on real data. This phenomenon of overfitting is well known to machine learning experts.

We simulate the previously mentioned iterative process in the figure above. We perform automatic model selection and hyper-parameter tuning, while at the same time directly using the test outcome to decide whether our model changes are effective. We can see on the left that, if left unchecked, the estimated accuracy based on a static test set can noticeably diverge from the true accuracy.

How do we approach these challenges?

Error bars as an integral part of test conditions

To approach the first challenge, we realise that our test conditions can not consist only of variables representing point estimates of model accuracy scores. We have to embrace the fact that: (1) these estimates are random variables subject to noise; and (2) whatever margins we define, we can never be 100% certain about the true outcome. An example of a test condition that adopts these notions is:

new_score > 0.8 +/- 0.05, err_prob < 0.001

The condition states that the estimated test score must be above a given threshold of \(0.8\). Furthermore, it encodes the requirement for the estimation of the score, although noisy, to diverges more than \(0.05\) points from the true value with a probability below \(0.001\). In the next section we will see how this probabilistic requirement is directly linked to a minimum amount of requires test samples.

Let's assume that we have access to a large enough test size in order to meet the error bar requirements. If we then measure a model score of \(0.9\), this condition immediately passes. However, if we measure a model score of \(0.82\) it is unclean whether the test condition should pass or fail? Notice that, given our test condition, we can only guarantee with \(99.9\%\) probability that the measured score lies between \(0.75\) and \(0.85\). Therefore, it is impossible to determine if it surely lies above \(0.8\), as required by the test condition.

This is where the error mode comes into the picture. The error mode is an additional parameter that can be chosen between "False Positive Free" or "False Negative Free". It expresses the notion that we can only ever be either free of false positives or false negatives (also called Type I and Type II statistical errors), but not both. Given our choice of error mode, whenever the test condition is in the range between error bars, the test condition is passed if we want to be "False Negative Free", or rejected if we want to be "False Positive Free".

Computing the dataset size as a function of error bars and the number of runs

Solving the second challenge is a bit more complex. We describe the key ideas here and ask the interested reader to check out our paper for more details. Let us first assume that we will only do a single test run. We focus on simple test condition from above: new_score > 0.8 +/- 0.05, err_prob < 0.001.

How large does our test dataset need to be to ensure a reliable evaluation of  the test condition? For every data sample in our test set, we compare the model prediction with the true label. The outcome of this comparison can be represented as a binary random variable \(X_i \sim \mathcal{D}\). The accuracy over the whole test dataset of size \(N\) is then computed as: \[ \bar{X} = \frac{1}{N} \sum_{i=1}^{N} X_i. \]

There exists a well established relationship in probability theory between the value of  \(\bar{X}\), its true expected value, the dataset size \(N\), the error bar size \( \epsilon\) and the error probability \( \delta\). It is called Hoeffding's inequality: \[ \delta = \mathbb{P}( \bar{X} - \mathbb{E}(\bar{X} ) \geq \epsilon ) \leq \exp ( -2 N \epsilon^2 ). \]

By applying Hoeffding's inequality we are able to ensure the statistical outcome of a simplest test condition. We extend this approach to slightly more complex test conditions allowing multiplications by constants, expressions with multiple variables and conjunctions of multiple test conditions. For details we refer the reader to our publications.

Now we look at the scenario where we want to run the test condition \(H\) times on the same test data. We have to distinguish two settings depending on whether we reveal the test outcome to the developer or not. Not revealing any outcome is referred to as a non-adaptive scenario. In this case we can simply apply the union bound and multiply the probability \( \delta\) above with \(H\) since each evaluation is independent.

Alternatively, if we reveal the test outcome to the developer, we need to take a pessimistic approach and treat the developer as a potential adversary player aiming at overfitting to the test set rather than generalizing to the underlying probability distribution. Under this assumption, the developer can be described as a process providing each newly committed model as a function of the knowledge gained over the past test outcomes. Since there are \(2^H\) such possible outcomes before the test data is renewed, we can simply enforce the union bound over these \(2^H\) possibilities to get the number of samples needed.

A System for Testing ML Models

To round everything up, we implement an ML testing system that uses the aforementioned methods. Our system has two key functionalities:

  1. Managing Test Data: Firstly, this involves computing the number of test runs given the size of a test dataset according to the methods we have described earlier. Secondly, it counts the number of test runs performed and swaps out the used up test dataset with a fresh one after the maximum number of re-uses is reached.
  2. Evaluating Test Conditions: This mainly involves taking the measured model accuracy and using it to determine if the test condition should pass or fail.

The system is built to be versatile with respect to where the test data and logging metadata are stored, and how ML models are represented. The workflow we prescribe is given below. We assume two user roles: the developer who implements and commits models, and the manager who defines the test conditions and is responsible for providing fresh test data. Each time a new model is committed to the repository, this triggers a CI run, which will first execute the test and then evaluate the test condition in the manner we described above. The data manager component is responsible for swapping the current test set with a fresh one once the allowed number of test runs has been reached.

What are the limitations?

We see the proposed approach as a first step towards core principles on how to conduct systematic ML model testing in a statistically sound manner within the established CI workflows. Nevertheless, we realize that our work suffers from a couple of limitations, which directly opens up some further research questions.

Firstly, the current implementation and theoretical guarantees only hold for evaluating the accuracy of a classifier. Extending this work beyond classification models and other metrics represents an interesting opportunity for further research.

Secondly, when calculating the bounds and number of samples needed we focus on a worst-case scenario where the developer represents an adversarial player aiming at overfitting to the hidden test set. Luckily, this is rarely the case in real life. Modelling developers in a more realistic but still statistical sound way could offer the possibility to further reduce the number of test samples needed (or increase the number of supported evaluations) and thus make the proposed approach and system more suitable for a larger target audience in the future.


  • Renggli, Cedric, et al. "Continuous integration of machine learning models with Towards a rigorous yet practical treatment." SysML (2019). [Link to the paper]
  • Karlaš, Bojan, et al. "Building continuous integration services for machine learning." SIGKDD (2020). [Link to the paper]