Free Sample Episode

Random k conditional nearest neighbor for high-dimensional data

Today's article comes from the PeerJ journal of Computer Science. The authors are Lu et al., from the University of Western Ontario, in Canada. In this paper, they reveal a new variant of the k-Nearest Neighbors algorithm. This gives us an excuse to do a thorough deep-dive on kNN and all of the associated concepts and key terms. Let's jump into it.

DOI: 10.7717/peerj-cs.2497

Book
Book
0:00 0:00
Download this Episode (Right-click, Save-As)

Imagine that you’re a Software Engineer at a music streaming company. An app like Spotify, or Pandora, or Youtube Music. Your job is to keep people on the app; keep them listening. And the way you do that, is by feeding them more music that they like. When the playlist or the song that they’re listening to comes to an end, you’ve got one chance to serve up something that they’ll enjoy. So, how do you do that? How do you make sure that the next thing that pops up in the player is going to resonate with them?

Well you’ve got three main options:

  1. Serve them something they’ve already liked before.
  2. Serve them something that is similar to other songs they’ve liked before.
  3. Serve them something that is frequently enjoyed by other users who are similar to that user. (This is the classic Amazon upsell strategy “Customers who bought this item also bought…”)

All three of these options work, but option #1 only gets you so far. You can’t play that card too many times in a row. So, the bulk of your engineering lift is going to be on options 2 and 3: which are actually rooted in the same technical problem. How do you go into your database and find something that is similar to something else? In one case, we’re trying to find songs that are similar to other songs, and in the other case we’re trying to find people who are similar to a given person. But on a technical level, this is really all the same thing: it is the process of measuring distances in feature space.

Today’s paper is about advancements in one of the most popular techniques for that: an algorithm called k-Nearest Neighbors. If you’ve listened to some of our Machine-Learning episodes, you’ve undoubtedly heard that term before. k-Nearest Neighbors (or kNN for short) is everywhere, and is used for everything. Why? Because it’s simple and it's effective. It’s not necessarily always the best tool for the job (and later we’re going to see how these authors are trying to modify it to be better at more things), but it is pretty good at a lot of use-cases. That’s why you’ll see kNNs used in ecology, in linguistics, in sports analytics, and even in archaeology. So before we get into this paper, let’s talk about the kNN algorithm for a while. How does it work? What do “distances in feature space” even mean? And why is this algorithm so good at calculating them?

We’re going to start by talking about features and dimensions. If we’re constructing the similarity-search from option #2 above (finding songs that are similar to each other), then we’ve got a lot of aspects of each song that we could compare to each other. We could find songs with similar BPMs, or songs that are in the same key, or songs by the same artist, or the same producer, or from the same region, or time period, or genre. All of these aspects are what we call “features” of the dataset. But some features are useful, and some are not. We call the latter group non-informative or irrelevant features. Generally, you’d try to avoid using any irrelevant features in your analysis. In our example those could be things like the color of the album art, or the artist’s royalty rate, or the song’s unique ID in the database. Looking at those kinds of things is probably not going to help you find similar songs. So you’d want to ignore those, take only the relevant features, and then run analysis on them.

The batch of features you choose to use are collectively referred to as your “dimensions”. k-Nearest Neighbors is an algorithm that simply finds the data-points (ie other songs) that are physically closest to your data-point (your song) in the multi-dimensional space that is created by the features you’re looking at. Imagine a 3D graph, with X, Y and Z axes.

  • The X axis is the song’s BPM.
  • The Y axis is the key that the song is in. (Turned into a continuous value).
  • The Z axis is the year the song was recorded.

Each song in your database will have a value on all three axes. So you pull your songs out of the database, plot them in this three-dimensional space, and see which other songs are physically closest to your song. In other words, you check which “k” number of songs are the “nearest neighbors” to your song. These are the “k-Nearest Neighbors”, to your “query vector” (your song). When you’re in multi-dimensional spaces, the distance being measured is called the Euclidean distance. It’s incredibly difficult for our brains to picture what anything more than a 3-dimensional space looks like. So, measuring distance in 10 dimensions, or a hundred, or a thousand dimensions, sounds impossible. Euclidean distance is the tool you use to normalize our interpretations of what distance means across an arbitrary number of dimensions.

When we say that a dataset has “high dimensionality” we’re saying that this analysis is looking at a large basket of features. This is also called a large “feature space”. When we’re adding more features to the basket (to the feature space), we are effectively adding more dimensions to the analysis. In many ways, that’s a good thing. Looking at more dimensions gives a more robust and nuanced landscape. It enables us to detect connections and similarities that we might have otherwise missed. But there’s a big problem with adding more features. It’s called the “curse of dimensionality”.

The curse of dimensionality refers to the challenges that arise as the number of dimensions increases. In high-dimensional spaces, data points become more spread out, and distances between them lose meaning because all the points tend to appear similarly far apart. This makes algorithms like kNN much less effective. The Euclidean distance they calculate no longer has meaningful results.

But hold on…why is that? Why would adding more dimensions make points further apart? Why would Euclidean distances necessarily be higher as dimensionality increases?

Imagine you have a 1-dimensional space. It’s 1-axis: a line with two points on it. Let’s say point A has a value of 5, and point B has a value of 10. The distance between them is 5. Now let’s call that original line the X-axis, and let’s add a 2nd axis: the Y-axis. Now we’re in a 2-dimensional space. If point A and point B share the exact same value for Y, then the distance between these points has not grown. But, if their values for Y are even slightly different from each other, then these two points are now further apart in this 2-dimensional space than they were in the 1-dimensional space. Why? Think back to when you learned Pythagorean’s Theorem in grade school.

  • Think of the distance between these two points on the X axis as the base of a right-triangle.
  • The distance between them on the Y axis is the height of a right-triangle
  • The actual total distance between them is now the hypotenuse. And as you’ll remember, in a right triangle the hypotenuse is always longer than either of the other two sides.

So to recap: if we add a dimension for which the data points hold slightly different values, the points will end up being further apart in terms of Euclidean difference. This is true as you go from 1-dimension to 2, or 2 to 3, or from 1000 to 1001. Every new dimension increases the distance between the data points whenever those points don’t have a common value for that new dimension. So as the dimensions rack up, all the points get further and further apart from each other. The result is that both the absolute and relative distances between these points lose their meaning, and this makes the kNN algorithm less and less useful.

In this paper, the authors are creating a variant of kNN, called RkCNN (the random k conditional nearest neighbor), which they hope will help alleviate two issues:

  1. The curse of dimensionality.
  2. The problems that arise when irrelevant features are included in the analysis.

Why the latter? Because sometimes it’s just not obvious which features/dimensions are relevant, so for many use-cases a person will just include them all. But having these irrelevant features causes noise and adds to the problems that are already created by high-dimensionality. So that’s why the authors designed a solution to help alleviate both of these issues.

Here’s how their algorithm works: It builds on top of an existing variant called kCNN. This is not to be confused with CNN models (Convolusional Neural Networks) as those are a totally different thing that just unfortunately share a part of this acronym. A kCNN is a normal k-Nearest Neighbors algorithm, plus the introduction of class-conditional probabilities. The missing context here is that kNN is often used as part of classifiers. If you’re trying to figure out what class your query vector belongs to, you use kNN to look up the nearest neighbors, then you look at what class those neighbors have, and those labels show you what class your query vector must have. Back to the music example. You’re trying to classify the genre of a new song: you query it on the dimensions we talked about earlier, and you see that the majority of the nearest neighbors (the nearest songs in the database) are classified as Jazz. So you classify this song as Jazz. Sometimes the class of each neighbor counts equally, and it’s a straight majority vote. Other times it’s a weighted vote based on how close the neighbor is to your query vector. (Closer neighbors get more votes). But either way, the purpose is the same: classify the data-point based on how the other points around it are classed. In kCNN, class-conditional probabilities means that instead of just counting votes, the algorithm calculates the probability of the query vector belonging to each class. This is based on patterns or distributions within the nearby neighbors. So instead of saying "3 neighbors voted for class Z"; it says, "Given the neighbors, the chance of this point being class Z is 80%. To put it another way, traditional kNN voting is about tallying, while class-conditional probabilities are about estimating likelihoods. That’s kCNN. And in this paper, the authors take it another step further. They introduce RkCNN, and the R stands for Random.

In an RkCNN, the algorithm randomly selects subsets of features. It does this over and over again. Ie picks these 3 features, then these 5, then these 2, and then these 20 etc. And each time it chooses a basket, it trains a new classifier (a model) on that randomly thrown-together basket. And at the end, all of these little models are combined into a big ensemble classifier. This means that when trying to classify an input, their outputs collectively decide the result. But each model in the ensemble doesn’t hold the same weight. The weights are based on how well the model scores on a metric called the separation score. Higher separation means the model’s vote counts for more. So, what is separation? It’s a measure of how well the selected features in a subset distinguish between classes. It’s not a measure of how mutually exclusive the features/dimensions are in general, but how separate the classes are within the feature space. To put it another way, imagine each data point (each song from our earlier example) as being a star floating around in the universe. Trying to find that song's class is like trying to figure out which galaxy that star belongs to. If the galaxies are composed of stars that are really tightly clustered together, and there is a bunch of space in between one galaxy and the next, then identifying the star’s home galaxy becomes really easy. It’s readily apparent which galaxy the star belongs to, because the galaxies are visually “separate”. In technical terms “separation” means that we’re comparing the variance within each class (how clustered-together the stars are in the galaxy), to the variance between each class’ center (how far the galaxies are apart). In the RkCNN, different random baskets of features will end up presenting the classes as different configurations of stars in the universe. The separation score is the quantification of how clearly the galaxies in that universe are defined. Higher separation score for a basket, means that basket gets higher voting weight in the ensemble.

But why do the authors think that an ensemble is supposed to be better? Why wouldn't they just create all the models, and then choose the single model with the highest separation score? That’s because of something called the bias-variance tradeoff. When you have a smaller basket of fewer features (fewer dimensions), you can focus more clearly on the most relevant patterns, because the noise and irrelevant information is reduced. So in the end, you can generalize better than you could if you had a larger set. We call this having lower “bias” because your analysis isn't being biased by the presence of irrelevant or misleading features. But, with that lower number of features, the model is sensitive to minor changes, which can throw off the classification and lead to higher variance. So it’s a catch-22:

  • Fewer features equals less bias, but more variance.
  • More features means less variance, but more bias.

So…pick your poison. But wait, because in this paper, the authors are arguing that the ensemble approach means that they no longer have to make that tradeoff at all. In this ensemble method, the weighted averaging together of all the smaller classifiers gives you the best of both worlds:

  • Low bias because each of the small classifiers is based on a small group of features
  • But also low variance because averaging them all together reduces the variance.

So in this ensemble, bias holds steady, but variance reduces. Or at least, that was their theory. The question is: did it work? Spoiler alert: yes, it did.

To evaluate the system, the researchers conducted a series of simulation experiments. They ran simulations on both synthetic datasets and real-world datasets and compared RkCNN's performance against other nearest-neighbor-based methods, and some other common algorithms. This included:

  • Plain vanilla kNN
  • kCNN
  • Random kNN (RkNN)
  • Ensemble of subsets of kNN (ESkNN)
  • Random forests (RF)
  • Support vector machines (SVM)

They looked at misclassification rates and Matthews correlation coefficients (MCC). The results demonstrated that RkCNN consistently outperformed competing methods, particularly in noisy, high-dimensional datasets. While standard kNN and kCNN were highly sensitive to irrelevant features, RkCNN’s ability to isolate informative feature subsets led to lower misclassification rates across all tested scenarios. The use of separation scores to weight feature subsets proved critical, as subsets with higher separation scores contributed more effectively to the ensemble, ensuring that noisy or irrelevant subsets were downweighted or excluded entirely.

Despite these successes, the researchers acknowledged a number of challenges. Chief among them was the computational complexity of the method, which increased with the size of the dataset and the number of random subsets that were generated. Although the separation score calculation was relatively efficient, the ensemble nature of RkCNN necessitates training many models. As the datasets grew, this became more and more computationally intensive. Looking ahead, the researchers proposed several directions for improving the system. From optimizing the parameter selection process, to parallelizing the generation and evaluation of subsets, to developing approximations for separation scores. So I’d expect to see a lot more work published around this new variant in the future. If you’d like to go deeper on their algorithm, read their pseudocode, review their formulas or diagrams or get access to the datasets they used in their evaluation, then I’d highly recommend that you download the pdf.