Beyond scaling curves: internal dynamics of neural networks through the NTK lens

Download the Audio (Right-click, Save-As)

In 1883, a Danish mathematician named Jorgen Pedersen Gram introduced a new kind of matrix. He was studying problems in linear algebra and geometry, particularly how to determine whether a set of vectors is independent. And he showed that if you take the pairwise inner products of vectors and arrange them into a matrix, you can analyze properties like linear independence and volume using its determinant.

Let's break that down a bit.

Think of a vector as an arrow pointing in some direction. The inner product (also called the dot product) is just a way to measure how similar two arrows are in directionality.

If two arrows point in almost the same direction, the value is large.
If they point in different directions, the value is smaller.
If they are perpendicular, the value is zero.

Now imagine you have many arrows, and create a table where you compare every arrow with every other arrow and record how similar their directions are. What you're picturing in your head now, is a Gram Matrix (named after Jorgen). You take a bunch of vectors, you compare each one with every other one, and you record how similar they are in a grid.

And this works for any kind of vectors you want to compare. They could be something simple like velocity or force, or they could be something more complex. These vectors could even be neural network gradients.

As a reminder: a neural network has lots of parameters, things like weights and biases. You can think of those parameters as a long list of numbers. That list defines a single point in multidimensional space, where each axis corresponds to one parameter. When you're training a model, your optimizer is computing the gradient of the loss with respect to all those parameters (all those axes). That gradient is simply a collection of numbers, one number per parameter (one number per axis in space). So each component of the gradient tells you how the loss would change if you moved along that respective parameter's axis. And when put all of them together that gradient is essentially an arrow that points in the direction the optimizer should move the model to reduce the loss. In other words: it's a vector. It might have millions or billions of dimensions, but the gradient is still just a vector like any other. An arrow that tells the training process which direction to step next.

Now what would it mean to combine two gradients into a Gram Matrix? And why would you want to?

Well, remember: when you're training a model, every single example in your training set will produce a gradient. It's own little update vector that nudges the model in a specific direction through parameter space. So imagine taking two different examples from your training set and comparing their gradients.

If the inner product of those gradients is large, it means both examples are trying to push the model in roughly the same direction. Learning from one will tend to help the model understand the other.
If the value is small, the two examples are mostly unrelated.
And if the value is near zero, their updates are effectively independent.

Now scale that idea up. Take the gradients produced by many inputs, compare every pair of them, and store those inner products in a Gram Matrix. If you do that, you would have created what we call an NTK: a Neural Tangent Kernel. It's a Gram matrix, like any other, just built from gradients. And since those gradients determine how the model changes during training, your NTK effectively captures how the network learns, and how different training examples interact during that learning process.

In reality NTK is a bit more complicated than that. Technically it's built from the gradients of the model's output with respect to the parameters, not the gradients of the loss. But conceptually the idea is the same.

Now why did I just tell you all of that? Well, because NTK takes center stage in today's paper. In it, the authors use the NTK as a lens to study what actually happens inside neural networks as we scale them up. As we increase the number of parameters (weight and biases), and as we increase the amount of data in their training sets. The authors want to know how that affects a model's internal learning dynamics. And they also want to know whether the scaling laws we often see in performance metrics are actually driven by the same underlying mechanisms. Let's walk through their methodology and see what they found.

First a little more background. About a decade ago, practitioners and researchers alike began noticing a remarkably consistent pattern. As neural networks grew larger, and as the datasets used to train them expanded, their performance improvements followed smooth and predictable trajectories. If you plotted the model's error against the amount of compute used to train it, the size of the dataset, or the number of parameters in the model, the resulting curves tended to follow a simple mathematical form: a power law. As you scale up, performance improves steadily, but with diminishing returns. Doubling the model size or the dataset doesn't double the improvement, but it does reliably push the error lower by a predictable amount. This was first noticed in language models, and then computer vision, reinforcement learning, and beyond. What made this discovery powerful was that these curves held over many orders of magnitude. Models ranging from millions to hundreds of billions of parameters appeared to follow the same basic pattern.

Because of this regularity, these "scaling laws" became a practical tool for engineering large models. If the relationship between compute and performance follows a definable curve, then it's pretty straightforward to estimate how much additional compute, data, or model capacity is needed to reach a particular target level of performance. And today this insight plays a major role in how large AI systems are designed. When a team is planning a new model they'll usually run a small-scale experiment, fit a scaling curve, and then extrapolate what will happen at much larger scales.

So what's this issue? Well, while these scaling laws are extraordinarily useful from an engineering perspective, they do leave a deeper scientific question unanswered: why is this happening? The curves describe the outcome of scaling, but they do not explain the mechanism. The laws only tell us that performance improves predictably with additional resources. They don't tell us what changes inside these networks as those resources increase, or why those improvements emerge at all.

The authors of this paper are looking for an answer. They want to know what actually changed inside the network to make the improvement possible. And they want to know if the kind of scaling that happens from increased parameters is roughly equivalent to that which occurs from a larger dataset. And this is where NTK comes in.

Rather than attempting to track the full, high-dimensional learning process of a network, the authors reduce that complexity to two coarse summary statistics derived from the NTK. You can think of them as global observables that describe the behavior of the training dynamics in aggregate.

The trace captures the overall magnitude of the learning signal, essentially summarizing how strongly the different learning modes contribute to parameter updates over time.
The effective rank, by contrast, describes the structure of that signal. It estimates how many independent directions in the learning process are actually being used to fit the task. A high effective rank suggests the model is spreading its learning across many distinct modes, while a lower value indicates that training is concentrating on a smaller set of dominant directions.

By tracking how these two quantities evolve during training, the authors are able to observe how the internal structure of learning changes as models and datasets are scaled. And that's just what they did, through two experiments.

In the first, they scaled the model size while holding the dataset fixed. They trained a sequence of neural networks of increasing width on small subsets of standard vision benchmarks.
In the second experiment they do the opposite: they keep the architecture small and fixed, but progressively increase the amount of training data available to the model.

In both cases they measured test loss, along with the two observables I just outlined (trace and effective rank). So what happened?

Well, as expected, both regimes reproduce the standard scaling behavior. Performance improved smoothly as either model size or dataset size increased. But, when the authors examine the NTK measurements, the two experiments diverged dramatically. As model width grew, the effective rank of the NTK increased. This indicates that wider networks distribute learning across a larger number of independent directions. When additional data was introduced to a small model, however, the effective rank decreased. This implies that the network concentrates its learning on a smaller set of dominant modes that capture the most important shared structure in a dataset. At the same time, their analysis revealed that model scaling primarily affects the initial scale of the learning signal, whereas data scaling primarily affects how that signal evolves over the course of training.

So in the end, you have two scaling regimes that produce nearly identical improvements in loss, but are driven by fundamentally different internal dynamics. And they were only able to figure any of that out, because of the power of Gram's Matrix.