Towards a Block-Level Conformer-Based Python Vulnerability Detection

Download the Audio (Right-click, Save-As)

First, we need to step back in time for a moment. In 2017, Vaswani et al published a seminal paper that would send shockwaves through the A.I. community: "Attention Is All You Need". In it, the authors described a novel concept: the Transformer. Transformers (and their building-blocks called "transformer-blocks") were a new type of a Neural-Network architecture based on the concept of self-attention. Self attention allowed for two key benefits:

The ability for tokens to maintain relationships with long-range dependencies.
The ability to parallelize both your training and inference.

In other words, transformers made Large Language Models possible, set the stage for what would become the Foundation Models, ChatGPT, and arguably the whole AI boom of the last few years.

I'm telling you this because the paper we're about to dive-into builds on top of the concept of a transformer, utilizing an even newer concept: a Conformer. You see, after a few years working with transformers, researchers started to realize that they had a few limitations. Namely, transformers were great at determining the long-range relationships between tokens, but they struggled with short-range relationships or "local dependencies". So the conformer was born, which utilizes a transformer but adds on convolutions for local patterns. Thus the name "con" from convolution and "former" from transformer. Conformers are designed to be great at both the big picture and the tiny details.

Now that you have that context, let's get into today's research. In this article the authors train a conformer model that can scan a Python codebase and surface any security vulnerabilities that may be present. Their theory is that robust vulnerability-detection takes two different skills:

The ability to see tiny details, and small errors.
The ability to zoom out and understand the entire architecture of the system.

They argue that a conformer model is uniquely well suited to this task, so that's what they set out to build.

Vulnerability analysis and detection comes in many forms: There are manual processes, semi-automated processes and fully-automated processes. There is static analysis (which reviews code without running it), and dynamic analysis (which actually runs the code and observes the behavior). A robust vulnerability-prevention regime includes all of these types of analysis together. It's important to note that this research focuses on fully-automated static analysis only. So it is not, and isn't intended to be, a full suite of solutions for vulnerability prevention. It's just a new solution for that single component of what should be a more holistic program.

There have already been many ML models produced that perform fully-automated static analysis of codebases, so the authors decided to make theirs different in a number of key ways:

They'd train their model on real Github repositories. This meant searching through Github for repos that had previously experienced some kind of vulnerability, and then finding the exact git commit where the vulnerability was patched. They labeled the commit before the patch and the commit after the patch and then used that before/after as training data.
They'd strip comments out of the codebases so that the model wouldn't accidentally train on them.
They'd project the codebases into ASTs, CFGs and DFGs. ASTs are Abstract Syntax Trees, these are intermediate-representations of code that captures the structure of the application. CFGs are Control Flow Graphs, these are structures that show all the logical paths through the code. DFGs are Data Flow Graphs, a structure that represents how information/data moves through a system. The model would be trained on all of these data structures in addition to the plain text.
They'd complement their model with an LLM that had been trained on a dataset of common security vulnerabilities.

Here are the steps they took to actually do all of that: First, they scraped Github, picking their codebases carefully and labeling the commits. Then, they preprocessed the data, then they built their ASTs, CFGs and DFGs, then they embedded the text of the codebases into their training dataset using CSE (Code Sequence Embedding). Then, finally, they pulled down several security-vulnerability datasets to train their LLM. They trained their conformer on an HPE called Komondor. Komondor was unveiled in early 2023 and, at the time of its release, was the most powerful supercomputer that Hungary had ever built. These researchers luckily got access to train with it, and they did so utilizing TensorFlow and Keras.

When the training was over, they ended up with a system they called VulDetective (presumably short for Vulnerability Detective), and they evaluated it on four main metrics:

Accuracy
Precision
Recall
F1

First, they broke the results out into different vulnerability categories: SQL injection, XSS, Command Injection, XSRF, Remote Code Execution, and Path Disclosure. For every single one of these categories, the model's accuracy was above 99%, precision was above 97%, recall was above 99% and F1 was above 98%.

Next, they compared these results to other methods (from previous studies), that either used the same data-mining methods, or the same database as VulDetective. Their model significantly outperformed all of them: CNN, CodeBERT, GraphCodeBERT, CuBert, SELFATT, Devign, VulDeepecker, DeepVulSeeker, FUNDED and Code2Vec. And it really wasn't even close. By comparison: it was rare for any other method to reach greater than 90% on any of the 4 metrics. Virtually every other method had strengths in some areas and significant weaknesses in others. And none of them had anything approaching Vuldetective's performance on any of the metrics.

Next they did an ablation study. This means that they decomposed their system into constituent parts, removed one part at a time, and re-ran the tests. For example, they individually (at different times) removed the AST structures, then DFGs, then CFGs, then the LLM, etc. Here's the issue: when they removed any individual component, the accuracy, precision, recall and F1 scores all dropped dramatically. They dropped into the 60% - low 70% range for all the metrics. So to repeat that: with the entire system in place, VulDetective was able to get 98-99% scores on all key metrics. But, when any individual piece of the system was missing, they lost a third of their performance.

Now, is it possible that these authors have stumbled on the exact right recipe, the exact mixture of components that when combined become significantly more than the sum of their parts? Certainly, that's possible. And is it possible that every ingredient is so critical that the system legitimately loses a third of its accuracy if any component is removed? Yes. All of this is possible.

But, is it also possible that the fully-combined system was overfitted to the data, and the ablated versions were not? In my opinion, that is also a possibility.

How can we know? Well, the hallmark of overfitted data is that it doesn't generalize, and doesn't transfer. So I think that replication will be the key here. If the methods and components they used to build this model are truly the new standard by which automated static analysis should be performed, then their results will be consistently replicable by many others over time.

I haven't replicated their experiment so I can't say, but I'd encourage you to do so if you're interested in this topic. If you want to do that, or you just want to read more about all the moving-pieces they put together or how they cleaned or labeled their data, please do download the paper. When we look back a few years from now, this might turn out to be a landmark study. It might even be as impactful as the Vaswani study we talked about at the beginning. Or, it might turn out to be a total dud. Time will certainly tell.