Novel Loss Functions for Improved Data Visualization in t-SNE

Download the Audio (Right-click, Save-As)

If you're working with high-dimensional data, you might find yourself wanting to visualize it. To project it down into two dimensions, so you can get a better understanding of what's going on. Think scatter plots of customer segments, maps of gene expressions, or clusters of documents. t-SNE, the t-distributed Stochastic Neighbor Embedding, is one of the most widely used tools for doing that kind of thing. It takes your enormous, complex dataset and squishes it down to a 2D representation, keeping similar data points close together and dissimilar ones further apart.

It does this by building a kind of probability map. It checks how close every pair of points is in the original high-dimensional space and then tries to recreate that same set of similarity relationships in 2D instead, projecting and compressing the full structure down to an X and Y for each point.

But, in order to check how well that projection worked, it needs to measure the "gap" between the original layout (the high dimensional one) and the compressed layout (the two dimensional one). By "gap" we mean the discrepancy between the pairwise similarity distributions in the original high-dimensional space and those induced by the low-dimensional embedding. There are a number of ways to do this. The one that t-SNE uses is called the Kullback-Leibler (KL) divergence. The lower the KL divergence, the more the two-dimensional picture resembles the original structure. And t-SNE uses that measurement as a target for optimization. In iterative rounds of gradient descent, it nudges the points around in 2D space to minimize the mismatches between the two distributions. And it keeps doing this until it converges to a layout where nearby relationships are preserved as well as possible.

Generally speaking, this works pretty well. t-SNE produces nice, clean cluster plots with tight groupings and clear separation. You've probably seen them many times without knowing it. Colorful clouds of dots, each color representing a class or category, separated into distinct islands, usually on a white background. More often than not, t-SNE is the engine underneath.

But there's an issue. t-SNE is hyper-focused on getting only the local relationships right. It puts enormous weight on making sure that points that are close in the original space stay close in the embedding. And to do that, it is willing to sacrifice the accuracy of the longer-distance relationships. So the distances between far-apart clusters become unreliable. While the spacing within groups might look good, the spacing between them can stretch, collapse, or rearrange arbitrarily. Two clusters that look far apart might actually be very similar, and two clusters that appear close might have no meaningful relationship at all. The map looks clean, but the geometry is misleading. What you end up with is not really a faithful picture of the full structure, it's a collection of well-formed islands in a malformed archipelago.

And there's another subtler issue as well. KL divergence is asymmetric. So it's far more punishing when it accidentally places two points too far apart, than when it places them too close together. To put it another way: it hates when it misses a neighbor, but it doesn't mind having a false one. This imbalance pushes t-SNE toward tight compact clusters, which look great but, again, can be misleading.

So what can we do about it? Well, that's where today's paper comes in. In it, the authors are proposing (and evaluating) two replacements for KL-divergence (within t-SNE). On today's episode we'll break down the architecture of each contender, and see how the authors evaluated their performance. Then we'll find out which one came out on top, and walk through the reasons why. Let's dive in.

The first of the two methods is called Max-Flipped KL Divergence, or KLmax. Instead of directly comparing the raw similarity probabilities between two points in the original and embedded spaces, KLmax asks a slightly different question: how far is each similarity value from the maximum similarity in that "row"? And by "row" we mean all the similarity scores for that point to every other point. So to put it another way: for each point, you look across all of its pairwise similarities and find the largest one, and then rather than looking at what each probability value is in absolute terms, it looks at how far that value is from the highest one you found for that point. Why? Well, remember the maximum similarity means more closeness, not less. Less distance, not more. In the original high-dimensional space, every data point has a most-similar neighbor: the point that is closest to it. That will be the one with the highest similarity score. KLmax wants to make sure that this closest-neighbor is preserved when you compress everything down. But it doesn't just want the raw numbers to match, it also wants the relationships between high-similarity and lower-similarity points to be preserved too. So it measures divergence based on how similarity values differ from the strongest neighbor, rather than comparing the raw values alone. First it computes a modified version of the similarity distribution for each point. It measures how far each similarity score sits from the highest score in that neighborhood, and it does this in both the original space and the embedded space. Then it runs the standard divergence calculation on those transformed, rank-aware distributions and adds it to the original objective as a second weighted component. The gradient update that comes out of this combined objective carries information about both the raw similarity values and their relative structure. And this is what produces sharper cluster boundaries and less within-cluster spread in the final embedding.

The second method is the KL-Wasserstein Loss. As it sounds, it's a combination of the standard KL divergence and the Wasserstein distance. Wasserstein distance comes from optimal transport theory (OT). Imagine you have two piles of sand and you want to reshape one pile to look like the other. The Wasserstein distance measures the minimum amount of work required to move that sand around, accounting for how far each grain needs to travel and in what direction. In practice it's just a geometry-aware way of measuring the difference between two distributions. Very useful for fields like logistics or image generation. What makes it attractive here is that it's sensitive to global structure in a way that KL divergence isn't. It doesn't just care about which points are close, it also cares about how entire groups of points are arranged relative to one another. That means it captures not just local neighborhoods, but the overall layout of the data. If one cluster should sit between two others, or if two groups should be far apart, that information is reflected in the transport cost. When you combine these two (KL divergence and Wasserstein distance) into a single loss function, you get KL-Wasserstein. A hybrid objective that gets both signals at once.

The KL component handles local neighborhood relationships, making sure that nearby points stay nearby.
The Wasserstein component handles the global geometry, making sure that clusters are correctly positioned relative to each other in the overall layout.

And these two forces interact during optimization. Early in training, the Wasserstein term tends to dominate. It's pulling points toward globally coherent positions, getting the big picture right first. And as the optimization progresses and clusters begin to form, the KL term takes over, refining the local details and tightening the neighborhoods.

That being said, computing the Wasserstein distance exactly at every training step is prohibitively expensive, so the authors approximate it on small batches of points at a time rather than the full dataset. But, since the approximation only returns a scalar value, (the distance itself rather than a gradient), they can't directly backpropagate through it. So they estimate the gradient instead, by nudging the embedded points slightly (in random directions) and observing how the distance changes. It's noisier than computing an exact gradient analytically, but it's tractable and it's effective enough to meaningfully improve the embeddings.

So how well do these two replacements actually work? To find out, the authors tested them against standard t-SNE across five different datasets and scored the resulting embeddings on F1-score, Normalized Mutual Information, and the Davies-Bouldin Index. In the end, for most datasets and metrics, both of the replacements outperformed the original. And that improvement was not confined to a single regime. On discrete classification the modified objectives produced embeddings whose cluster assignments aligned more closely with the ground-truth labels. And on more geometrically structured datasets they improved the internal organization of the embedding itself. KLmax was especially effective when the main failure mode was poor ranking among close neighbors. KL-Wasserstein was effective when preserving the broader spatial organization of the clusters mattered most. So there's not necessarily a winner among the two replacements, just two options you can choose from when local ranking among neighbors is the main concern or when preserving the global layout of clusters is more important.

So what can we learn from this paper?

Narrowly, that standard t-SNE's behavior is not some fixed property of dimensionality reduction itself, but a consequence of the particular divergence it optimizes. So swapping in better-structured objectives can materially improve both neighbor preservation and embedding geometry.
But more broadly, this study shows us that visualization methods are never really neutral. The loss function decides what kinds of errors matter. This is a reminder that a 2D embedding is not just a compressed version of the data. It is the output of a very specific geometric preference baked into an objective. Change the objective, and you change what the map is faithful to.

If you want to go deeper, make sure you download the paper. The authors include the full gradient derivations, the statistical significance tests, and the implementation details for the Wasserstein gradient estimation and the parameter sensitivity analysis. If you work with high-dimensional data, it's definitely worth the read.