Free Sample Episode

Multifidelity Kolmogorov-Arnold networks

Today's article comes from the journal of Machine Learning Science and Technology. The authors are Howard et al., from Pacific Northwest National Laboratory, in Washington state. In this paper, they're showcasing a system that lets you train KANs on lower-quality data more effectively.

DOI: 10.1088/2632-2153/adf702

Book
Book
Download the Audio (Right-click, Save-As)

Andrey Kolmogorov was one of the giants of 20th-century mathematics. He, among other things, helped shape probability theory into the rigorous field we know today. Then, in the late 1950s, near the end of his career, he turned his attention to a deceptively simple but deep problem: how to represent multivariable functions. Around the same time, his student Vladimir Arnold was beginning to make a name for himself (and he'd later become famous for his contributions in dynamical systems and geometry). Together, the two uncovered something remarkable.

First, Kolmogorov proved that any continuous function of several variables can be represented as a finite sum of continuous functions of just one variable, combined through addition and composition. Two years later, Arnold sharpened and clarified this result, showing in more detail how such decompositions could actually be constructed. Their work became known as the Kolmogorov-Arnold representation theorem.

For decades, the theorem lived mostly in the realm of pure mathematics. That changed in recent years, when researchers in machine learning realized that this old theorem could serve as the blueprint for a new kind of neural network. Neural networks, at their core, are universal function approximators. Mash this idea up with the Kolmogorov-Arnold representation theorem and the design practically suggests itself: instead of relying on fixed activation functions at each node, why not build networks where the connections themselves carry learnable univariate functions? That's the idea behind Kolmogorov-Arnold Networks, or KANs. Each connection is parameterized with splines and a base function, which allows the network to adapt its building blocks during training rather than forcing every neuron to behave in the same rigid way.

The beauty of this approach is twofold. First, KANs are usually more interpretable than traditional multilayer perceptrons (MLPs), because the functions they learn along each connection can often be inspected and understood directly. This means you get insights into the structure of the underlying data. Second, they can often achieve competitive performance with smaller networks than their MLP counterparts, since their architecture is better aligned with the mathematics of function decomposition. This makes them attractive in scenarios where you want not just a "black box" predictor, but also a model that offers some explanatory power about how inputs combine to produce outputs.

That all being said, KANs are not a silver bullet. Just like other neural networks, their performance depends heavily on the quality of the training data. And here lies a practical challenge: generating or collecting good quality training data, especially high-fidelity scientific data, can be costly and time-consuming. For this reason, KANs are particularly well suited to applications in scientific machine learning where interpretability is valuable, model sizes must remain modest, and where you can carefully curate or augment datasets. But they can struggle in cases where data is sparse, noisy, or highly irregular.

And that's where this paper comes in. In it, the authors develop something called MFKANs: Multifidelity Kolmogorov-Arnold Networks. They are built specifically to combine large amounts of cheap, lower-quality data with small amounts of expensive, high-quality data, so that you can train KANs more effectively. On today's episode, we'll explore how MFKANs work, and walk through more of the mathematical foundations that make this possible. Let's dive in.

An MFKAN consists of three distinct KAN blocks working in concert.

  • The first block, called the low-fidelity KAN, learns to model the relationship between inputs and outputs using only the abundant low-fidelity data. This network is pretrained as a standard single-fidelity KAN and its weights are frozen during the subsequent training phases. Think of this as your rough draft writer who's really good at capturing the general shape of what you want to say.
  • The second and third blocks work together to learn the relationship between the low-fidelity model's predictions and the high-fidelity ground truth. The high-fidelity prediction is formed as a weighted combination of two components. The linear KAN learns the linear correlation between the input variables, the low-fidelity prediction, and the high-fidelity data. The nonlinear KAN learns a nonlinear correction term to account for complex relationships that can't be captured linearly.

The final high-fidelity prediction is a weighted average of the linear and nonlinear network outputs. The weighting parameter is trainable, which means the network learns how much to rely on linear versus nonlinear correlations. This is clever because it lets the system automatically figure out whether the relationship between your 'cheap' and your 'expensive' data is mostly linear with small corrections, or fundamentally nonlinear.

But, the authors don't just minimize prediction error. They also include two regularization terms. The first term encourages the method to learn the maximum possible linear correlation before resorting to nonlinear corrections. The second term minimizes the magnitude of the nonlinear KAN's parameters to prevent overfitting when high-fidelity data is sparse. This is like having guardrails that prevent the system from getting too fancy when it doesn't need to be.

To evaluate their design, the authors worked through a progression of test cases that gradually increased in complexity. They began with a function that had a sudden jump, the kind of discontinuity that normally trips up neural networks. Training only on sparse high-fidelity samples failed, because the network could memorize those few points but had no way to capture the sharp transition in between. By contrast, the multifidelity model leveraged the abundant low-fidelity data to map out the overall structure, then used its linear block to learn that the high-fidelity function was essentially a scaled version of the low-fidelity one, and finally applied the nonlinear block to make small adjustments.

This division of labor allowed the network to capture the discontinuity far more accurately. In the next benchmark, the authors tackled a nonlinear relationship where the low-fidelity data followed a simple oscillation while the high-fidelity data involved a more complex distortion of that wave. Here the multifidelity approach showed how important regularization is: without it, the nonlinear block would overfit the few high-fidelity points and create noisy wiggles between them. With the right penalty in place, the model resisted overfitting and produced a smooth, reliable approximation. These tests demonstrated the two central strengths of this approach: the ability to generalize from very limited high-fidelity data, and the ability to balance linear and nonlinear corrections in a stable way.

The later experiments pushed these ideas into higher dimensions and more realistic applications. A two-dimensional test showed that the method could stitch together dense low-fidelity grids with sparse, irregular high-fidelity points, capturing structure that a high-fidelity-only model missed. In four dimensions, the setup added noise to both data sources to mimic measurement uncertainty, and once again the multifidelity model held up while the high-fidelity-only version collapsed...a reminder that real scientific data is never clean. A physics test went further, replacing data with governing equations. By letting the low-fidelity physics guide the training, the multifidelity model managed to handle a harder, high-frequency case that a single physics-informed network could not. And more applied benchmarks reinforced this. The most striking result came from fluid dynamics: after supplying high-fidelity snapshots for only the first half of a vortex-shedding simulation, the multifidelity model continued to make accurate predictions for later times by leaning on the ongoing low-fidelity stream, while the high-fidelity-only model quickly broke down. The common thread across all these cases is that multifidelity KANs deliver accurate, stable, and interpretable results precisely when high-quality data is scarce or expensive. Which is to say, in the situations that often matter most.

That all being said, the approach has limitations worth noting. It assumes meaningful correlation between low- and high-fidelity data sources. If your low-fidelity model is completely unrelated to the high-fidelity ground truth, this method won't help. The additional hyperparameters require tuning, and the optimal settings can be problem-dependent. There's also the question of computational overhead. You're training three networks instead of one, after all.

If you want to dive deeper into the mathematical derivations, read the experimental protocols, or review the hyperparameter settings you should definitely download the paper. The appendix contains both the implementation details and links to a GitHub repository with working code samples.