Today's article comes from the PeerJ journal of Computer Science. The authors are Yun et al., from the Harbin Institute of Technology, in China. In this paper they showcase a new strategy for solving real-world SAT problems at scale.
Look, I'll be straight with you: even if you're a software engineer, you're probably never going to interact directly with a SAT solver. But...odds are pretty good that you're using tools that rely on them every single day. The Electronic Design Automation tools that help design the processors in your laptop? SAT solvers. Model checking systems that verify your code doesn't have race conditions? SAT solvers. Even some AI planning systems are SAT solvers. They're everywhere.
And here's the thing: SAT solvers (that is: Boolean Satisfiability Problem solvers) are getting slower. At least in absolute terms, that is. Why? Because of the insane problems we've started asking them to solve. As integrated circuits get more complex and software systems get bigger, we've started asking these solvers to chew through exponentially larger search spaces, and to work on problems that we would have previously considered "intractable".
So when a team of researchers pokes their head up and says they've figured out how to make SAT solvers faster, we take notice. Today we're looking at one of those types of papers. The authors think they've found a clever way to speed things up, and they're taking inspiration from an unlikely source: neural networks. Specifically, they're using Self-Organizing Maps (SOMs) to partition SAT problems before solving them. Divide and conquer. Break a big problem into smaller chunks, solve those chunks separately, then combine the results.
On today's episode, we'll see if their approach actually works. But before we jump in, we need to talk about what a SAT problem actually is, why they're so important, and why they're so hard.
The Boolean Satisfiability Problem sounds deceptively simple:
Given a logical formula with variables that can be either true or false, is there some assignment of those variables that makes the entire formula true?
Take this expression:
(A OR B) AND (NOT A OR C)
Can you find values for A, B, and C that make this whole thing true? Take a moment and try to think through it, but remember each variable's value can only be TRUE or FALSE and the value you choose for each needs to hold for everywhere that variable appears in the equation. Meaning: if you set A to TRUE in the first half (in the first parentheses), it will also need to be assigned to TRUE in the second set of parentheses. If you want extra credit, don't just try to solve this, but pay attention to how your brain is solving it. There's a good chance your brain is brute-forcing through different options, flipping variables from true to false and seeing what happens.
Okay let's look at a solution:
If you set A to FALSE, B to TRUE, and C to TRUE, then you get...
(FALSE OR TRUE) AND (TRUE OR TRUE)
...which equals...
TRUE AND TRUE
...which equals...
TRUE
So yes, this formula is satisfiable.
Now imagine doing that same process but with thousands or millions of variables and clauses. That's what industrial SAT problems look like. And they're not contrived, they normally come from real-world systems: circuit designs, software verification tasks, cryptographic analysis, etc. And as you can probably tell from what your brain just tried to do, the search space grows exponentially with the number of variables. In other words: if we would have added a variable D to that equation we just looked at, it would have given the whole thing another level of pain.
If we take these kinds of problems up to 20 variables, you'll have over a million possible combinations to check. Infeasible to do on a piece of paper, but no sweat for a modern computer. But, if you keep going, keep adding variables, it gets way worse. With a few hundred variables, you're looking at more combinations than there are atoms in the observable universe.
So how can we solve something like that? How is that even possible?
Well, modern SAT solvers don't just brute-force through every possibility the way our brains probably would. They use algorithms with names like DPLL, CDCL, and Look Ahead. These techniques prune the search space by learning from conflicts and making guesses about which variables to try first. That being said, even with all these optimizations, large industrial problems can still take hours or days to solve (depending on what it's running on, of course).
Okay, here's where today's paper comes in. The authors noticed something interesting: unlike the contrived example I gave you, real-world SAT problems aren't just random collections of variables and clauses. They have structure.
Imagine you're a registrar at a university. Your job is to build the class schedule. That doesn't just mean tossing classes onto a calendar, it means making sure professors aren't assigned to teach two courses at the same time, that rooms aren't double-booked, and that the courses that are required for each major are offered multiple times a week. You also have to handle rules like "introductory chemistry and its lab section must be scheduled on the same day," or "upper-level electives can't overlap with their prerequisite courses," and you may even have to work around personal constraints like certain professors only teaching in the mornings. Out of all the possible ways to arrange things, you need to find one that satisfies every single requirement at once. This search for a valid arrangement: balancing people, times, rooms, and special rules, is exactly the kind of real-world problem that gets encoded as a SAT instance.
So how is that different from a random made-up SAT example?
The authors reasoned that if you could identify these clusters and solve them separately, you might just dramatically reduce the overall solving time. If you have a problem with thousands of variables that grows exponentially in solving time, and you can split it into smaller problems, your total time becomes the sum of two smaller exponential functions instead of one giant one. Since each piece is significantly smaller than the original, this can be a huge improvement. To put it more simply: don't do something that takes 9^2 seconds to complete, if you can opt for something that takes 3^2 + 3^2 + 3^2 seconds instead. The former is 81, the latter 27. A big savings.
But this technique only works if you can find and identify those clusters. How do you actually do that? Traditionally, you have to manually tune similarity thresholds and other parameters for each type of problem. That's not scalable, and it doesn't work well when you don't know the structure of your problem ahead of time. This is where the Self-Organizing Map, or SOM comes in. SOM is a type of neural network, trained on an unsupervised learning algorithm. This means you don't need to tell it what the "right" answer looks like. You just feed it data and it figures out patterns on its own. Imagine a 2D grid of neurons, like a sheet of graph paper. Each neuron has a feature vector that gets updated during training. When you show the network a data point, it finds the neuron whose feature vector is most similar to that data point. That neuron "wins" and gets to update its features to become even more similar to the input. But here's the clever part: neighboring neurons also get to update their features, just to a lesser degree.
Over time, this creates a map where similar inputs cluster together in the same region.
So how are SOMs integrated here exactly?
Well, when faced with a problem that needs to be divided into manageable pieces, the authors start by mapping every clause in the SAT instance into a high-dimensional space where each variable corresponds to a dimension. Then they feed those points into the SOM. As the network trains, clauses that share many variables gravitate toward the same region of the grid, while unrelated clauses drift apart. What emerges is a topological map where neighborhoods correspond to clusters of clauses that are tightly connected internally but only loosely linked to other neighborhoods. Once these clusters are identified, the solver can carve the SAT instance along those boundaries, producing two or more smaller problems instead of one massive monolith. The important thing is that the SOM removes the need for hand-tuned thresholds or domain-specific rules. It exposes the natural modularity in the data automatically. After clustering, the solver applies a partitioning strategy and then runs sub-solvers on each piece, sharing learned clauses between them to avoid redundant work.
In effect, the SOM acts as the lens that reveals hidden structure, while the partition-and-solve framework exploits that structure to cut down solving time. The result is a system that simply scales better. Not by inventing a new solver core, but by reshaping the problem itself into a form that existing solvers can tackle far more efficiently.
So how well does this all actually work? They tested their approach on three benchmark suites: SATLIB, SAT Competition 2013, and SAT Competition 2023. They used three different underlying solvers: MiniSat, Glucose, and CaDiCaL, which are meant to represent different generations of solutions, older and newer.
In the end, the results were pretty impressive, at least for large industrial problems. On the benchmarks, their method achieved winning rates between 69% and 81%. In other words it was faster than directly calling the solver on that percentage of problems. When it did win, the speedups averaged between 42% and 64%. And as you might expect, their method performed terribly on randomly generated problems. The winning rates dropped to 8% - 16%, and sometimes it was actually slower than the direct approach. This makes perfect sense. Random problems don't have the modular structure that makes partitioning effective. If variables are connected to each other completely at random, then any partition you create will have a large cut size, and you won't get the exponential savings from divide-and-conquer. On the contrary, you're just adding overhead that serves no purpose. While this sounds damning, it actually confirms the authors' core hypothesis: that naturally generated data isn't randomly connected and that this gives us an opportunity to optimize the solving process in ways that aren't otherwise possible.
More broadly, this paper demonstrates how machine learning techniques can enhance classical algorithms rather than replace them entirely. This hybrid approach seems (at least to me) to be much more promising than trying to train an end-to-end neural network to solve SAT problems directly. It piggy-backs on techniques that are already working, and just gives me a leg-up rather than trying to replace them.
If you want to go deeper (into the parameter selection process, or their benchmark setup, or their results), then I'd highly recommend that you download the paper. It has everything you'd need to implement the algorithm yourself, or recreate their tests and analysis.