Improving test suite generation quality through machine learning-driven boundary value analysis

Download the Audio (Right-click, Save-As)

Let's say you're building a simple web app. A user can sign up or log in, do some stuff, and log off. Nothing fancy.

Right now you're working on the signup form. When a user chooses their password, you need to validate it (on the backend) against your company's requirements. It must be at least a certain minimum length, not exceed a maximum length, contain certain types of characters, avoid other disallowed character types, avoid repeating or sequential characters, etc.

Since you're doing TDD/BDD, you need tests for all these cases. Too few characters? That's a case. Too many characters? Another case. Empty string? Starts with a space? Non utf-8 characters? Non-ASCII characters? Case after case after case. And a good portion of your effort is spent flushing out what we call "boundary conditions". That is, testing what happens just short of a numerical or conceptual threshold, testing what happens exactly at that threshold, and then testing what happens past it.

These boundary conditions are where a lot of software tends to break. So, many testing tools allow you to focus on something called "boundary value analysis". That is: methodically and rigorously testing the edges of your input space. That's great! But here's the problem: as systems grow more complex, manually identifying these boundaries becomes far less trivial.

And that's where this paper comes in. In it, the authors are trying to teach machine learning models to automatically find and test these boundary regions. Their system combines neural networks with Monte Carlo sampling to generate test inputs that are more likely to expose bugs than traditional approaches would. On today's episode, we'll walk through how they trained their models to recognize when inputs cross boundaries, how they sampled test inputs from boundary-rich regions, and how they automatically generated training data using code coverage tools. Let's dive in.

To set the stage, we first need to understand why boundary testing is deceptively difficult to do correctly. It doesn't sound hard because in simple programs, boundaries are easy to spot. If you, for example, have a function that checks whether a number is positive, you'd test with values like -1, 0, and 1. Clear boundaries, easy to test.

But consider a program that decides whether to approve a loan. It looks at credit score, income, debt-to-income ratio, employment history, and so on. The threshold for approval, in this case, isn't a simple line. It's a multidimensional surface. Crossing from X credit-score to Y credit-score doesn't mean much on its own. The approval decision looks at all the variables at once, looks at how they relate and how they interact with each other. So how would you test a persona that is 'just below' the approval threshold? Or exactly at it? Or just past it? What do those concepts mean when the input space is multidimensional?

Traditionally, boundary value analysis is going to rely on one of two strategies:

Black-box testing with flushed out specifications.
Or white-box testing with source code analysis.

The issue is: both of these approaches hit walls when you're dealing with high complexity. Specifications are often incomplete or written in natural language that's ambiguous about edge cases. Source code analysis becomes computationally intractable when you're dealing with this many interdependent variables and conditions. And unlike black-box testing where boundaries are derived from the specs, white-box testing has to find feasible input values that satisfy the given conditions. For example: if a program contains a condition where two variables must sum to at least twelve, the boundary can't be determined by either variable alone. And finding the actual input values that lie on the boundary is a search problem.

But this is where the authors saw an opportunity for machine learning to step in and automate things. The core idea is that you should treat boundary detection as a pattern recognition problem. Instead of trying to derive boundaries from specifications or from code, they trained models to recognize when two inputs are likely to produce different behaviors.

It works by taking pairs of inputs and feeding them to the program under test. It monitors execution using a tool called Gcov, which tracks which branches of code get executed. If two inputs follow the same execution path, they're in the same behavioral partition. If they follow different paths, there's a boundary between them. They formalize this using an ML pipeline that outputs the probability that two inputs belong to different equivalence partitions. And this pipeline has two components:

A model that extracts behavioral information from individual inputs.
A similarity function that determines whether two behavioral outputs are the same.

They represent execution traces as binary vectors where each element indicates whether a particular branch was taken. The similarity function then measures how different two execution paths are by comparing these vectors element by element. This creates a mathematical foundation for automatically determining when inputs cross behavioral boundaries.

Let's dig into the architecture of the model a little more. It's a binary classifier, trained using binary cross entropy loss, that uses fully connected layers. Namely, an input layer, two hidden layers, and an output layer. The hidden layers use ReLU (rectified linear unit) activation, which is a common choice that helps networks learn complex patterns, while the output layer uses sigmoid activation to produce probability values between zero and one. The input layer size scales with the number of paired inputs. So if you're testing a program with five input parameters, the network needs to handle ten total inputs.

But recognizing boundaries is only half the battle. Remember, the real goal is generating test inputs that are likely to find bugs. This is where the second part comes in: using the boundary predictions to guide test generation. They do this through a family of algorithms called Markov Chain Monte Carlo sampling (MCMC). This is a method for generating samples from complex probability distributions that you can't directly compute. It simulates a random walk where each step is influenced by how "interesting" the current location is according to some target distribution. The challenge lies in defining what makes a region "interesting" for boundary testing. That is: determining how to flag boundary density.

The first method, called pointDensity, takes a single input point and adds small perturbations around it to see how many cross boundaries.
The second method, called pairDensity, uses the model's prediction weighted by the distance between them. Pairs that are both predicted to cross boundaries and are close together in space get higher density values.

The pairDensity approach addresses a limitation of pointDensity when dealing with large input spaces. In high-dimensional problems, it becomes increasingly difficult to find individual points that are close to boundaries through small perturbations. By working with pairs directly, the system can search over a larger space while still maintaining the relationship between distance and boundary likelihood. Now, back to the actual MCMC implementations:

Metropolis-Hastings, works by proposing candidate test inputs and deciding whether to accept them based on the boundary density ratio. It generates a random candidate input, calculates its boundary density, and compares it to the current input's density. If the candidate has higher density (meaning it's closer to boundaries) it's likely to be accepted. If it has lower density, it might still be accepted but with lower probability.
Langevin Monte Carlo, on the other hand, incorporates gradient information to move toward high-density regions more efficiently. Instead of randomly proposing candidates, this method computes the direction of steepest increase in boundary density and moves partly in that direction while adding some random noise for exploration.

So that's how their system works. But how well does it work?

To find out, they tested seven different applications: from a simple date calculator, to an aircraft collision avoidance system. Complexity ranged from simple functions with a few dozen lines of code, to complex systems with hundreds of lines and up to twelve input dimensions. They compared their system against several baselines, and six different types of faults: off-by-one bugs, logical operator replacement, scalar variable replacement, relational operator replacement, arithmetic operator replacement, and constant replacement.

In the end, they did well! Their system was competitive-with and often superior-to many of the existing approaches. Their system outperformed manual boundary analysis in five of eight programs and beat concolic testing in four of eight. The pointDensity approach consistently achieved higher fault detection rates than pairDensity, likely because it can guarantee test inputs within a specific distance of boundaries rather than just predicting that boundaries exist nearby. Langevin Monte Carlo algorithms took longer than Metropolis-Hastings due to gradient computation requirements, but showed superior convergence to boundary-rich regions in the visual analysis. This suggests that extra computational cost may be justified when high-quality boundary identification is critical.

So what can we learn from this study? Namely, that there is a feasible path towards automated testing that can adapt to complex software without requiring extensive human analysis of specs or source code. If you want to see the experimental setup, the neural network architectures, the MCMC parameters, or the mutation testing methodology, I'd highly recommend downloading the paper. The authors also include statistical significance tests, and ablation studies showing which components contribute most to the system's effectiveness.