Today's article comes from MDPI's Journal of Software. The authors are Bagheri et al., from the University of Szeged, in Hungary. This paper debuts a new method for detecting security-vulnerabilities in a Python codebase. According to the authors, their new approach attained the highest level of precision of any ML-based vulnerability-detector in history. Is their claim valid, or is it possible that their model was overfitted to the training data? Let's see.
First, we need to step back in time for a moment. In 2017, Vaswani et al published a seminal paper that would send shockwaves through the A.I. community: “Attention Is All You Need”. In it, the authors described a novel concept: the Transformer. Transformers (and their building-blocks called “transformer-blocks”) were a new type of a Neural-Network architecture based on the concept of self-attention. Self attention allowed for two key benefits:
In other words, transformers made Large Language Models possible, set the stage for what would become the Foundation Models, ChatGPT, and arguably the whole AI boom of the last few years.
I’m telling you this because the paper we’re about to dive-into builds on top of the concept of a transformer, utilizing an even newer concept: a Conformer. You see, after a few years working with transformers, researchers started to realize that they had a few limitations. Namely, transformers were great at determining the long-range relationships between tokens, but they struggled with short-range relationships or “local dependencies”. So the conformer was born, which utilizes a transformer but adds on convolutions for local patterns. Thus the name “con” from convolution and “former” from transformer. Conformers are designed to be great at both the big picture and the tiny details.
Now that you have that context, let's get into today’s research. In this article the authors train a conformer model that can scan a Python codebase and surface any security vulnerabilities that may be present. Their theory is that robust vulnerability-detection takes two different skills:
They argue that a conformer model is uniquely well suited to this task, so that’s what they set out to build.
Vulnerability analysis and detection comes in many forms: There are manual processes, semi-automated processes and fully-automated processes. There is static analysis (which reviews code without running it), and dynamic analysis (which actually runs the code and observes the behavior). A robust vulnerability-prevention regime includes all of these types of analysis together. It’s important to note that this research focuses on fully-automated static analysis only. So it is not, and isn’t intended to be, a full suite of solutions for vulnerability prevention. It’s just a new solution for that single component of what should be a more holistic program.
There have already been many ML models produced that perform fully-automated static analysis of codebases, so the authors decided to make theirs different in a number of key ways:
Here are the steps they took to actually do all of that: First, they scraped Github, picking their codebases carefully and labeling the commits. Then, they preprocessed the data, then they built their ASTs, CFGs and DFGs, then they embedded the text of the codebases into their training dataset using CSE (Code Sequence Embedding). Then, finally, they pulled down several security-vulnerability datasets to train their LLM. They trained their conformer on an HPE called Komondor. Komondor was unveiled in early 2023 and, at the time of its release, was the most powerful supercomputer that Hungary had ever built. These researchers luckily got access to train with it, and they did so utilizing TensorFlow and Keras.
When the training was over, they ended up with a system they called VulDetective (presumably short for Vulnerability Detective), and they evaluated it on four main metrics:
First, they broke the results out into different vulnerability categories: SQL injection, XSS, Command Injection, XSRF, Remote Code Execution, and Path Disclosure. For every single one of these categories, the model's accuracy was above 99%, precision was above 97%, recall was above 99% and F1 was above 98%.
Next, they compared these results to other methods (from previous studies), that either used the same data-mining methods, or the same database as VulDetective. Their model significantly outperformed all of them: CNN, CodeBERT, GraphCodeBERT, CuBert, SELFATT, Devign, VulDeepecker, DeepVulSeeker, FUNDED and Code2Vec. And it really wasn’t even close. By comparison: it was rare for any other method to reach greater than 90% on any of the 4 metrics. Virtually every other method had strengths in some areas and significant weaknesses in others. And none of them had anything approaching Vuldetective’s performance on any of the metrics.
Next they did an ablation study. This means that they decomposed their system into constituent parts, removed one part at a time, and re-ran the tests. For example, they individually (at different times) removed the AST structures, then DFGs, then CFGs, then the LLM, etc. Here’s the issue: when they removed any individual component, the accuracy, precision, recall and F1 scores all dropped dramatically. They dropped into the 60% - low 70% range for all the metrics. So to repeat that: with the entire system in place, VulDetective was able to get 98-99% scores on all key metrics. But, when any individual piece of the system was missing, they lost a third of their performance.
Now, is it possible that these authors have stumbled on the exact right recipe, the exact mixture of components that when combined become significantly more than the sum of their parts? Certainly, that’s possible. And is it possible that every ingredient is so critical that the system legitimately loses a third of its accuracy if any component is removed? Yes. All of this is possible.
But, is it also possible that the fully-combined system was overfitted to the data, and the ablated versions were not? In my opinion, that is also a possibility.
How can we know? Well, the hallmark of overfitted data is that it doesn’t generalize, and doesn’t transfer. So I think that replication will be the key here. If the methods and components they used to build this model are truly the new standard by which automated static analysis should be performed, then their results will be consistently replicable by many others over time.
I haven’t replicated their experiment so I can’t say, but I’d encourage you to do so if you’re interested in this topic. If you want to do that, or you just want to read more about all the moving-pieces they put together or how they cleaned or labeled their data, please do download the paper. When we look back a few years from now, this might turn out to be a landmark study. It might even be as impactful as the Vaswani study we talked about at the beginning. Or, it might turn out to be a total dud. Time will certainly tell.