Variational Autoencoders-Based Algorithm for Multi-Criteria Recommendation Systems

0:00 0:00

Download the Audio (Right-click, Save-As)

Let’s say you’re planning a vacation with your family.

Got your PTO approved? Check.
Found a pet-sitter for time you’re gone? Check.
Passports? Check.
Plane tickets? Check.

Now all you need to do is book the rooms. But, you have a better idea: you’ll book on AirBnB instead! You go to the site, enter the location, enter your dates, adjust the filters, and see a multitude of options appear on the map in front of you. If you’re like most people, you’re going to assume two things:

The properties you’re seeing on the map are there because they match your filters.
The properties you’re seeing are all the options. The complete set of all the properties that match your filters in that area for those dates.

On both counts, you’re probably wrong. While it might seem like a straightforward search-and-filter experience, the search results you get on AirBnB are actually provided by a recommendation algorithm. They’re not trying to show you the exhaustive set of matches for your query, they’re trying to show you the results that you are most likely to book. And this includes matches that don’t meet your criteria, but that they think you’re likely to book anyway. To quote from their own Help Center article (linked above):

If there aren’t enough high quality listings available that match a guest’s search criteria, we may show other listings that we think might appeal to the guest, even if they do not meet all of the guest’s criteria. [Emphasis mine]

Notice the careful use of the term “high quality” there. They’re not saying that they’ll supplement-in results when you’ve exhausted all the properties that match your filters. The opposite. They’re saying they’ll supplement-in results when you’ve run through all the properties they consider “high quality” for you at that moment…in whatever way they are choosing to define high-quality at the time.

Bold, right? I mean, it’s kind of astounding. You’re setting filters to tell them exactly what you want, and they’re coming back and saying “That’s not what you want, THIS is what you want instead”. How can they be so confident? How can they be so sure that they know (perhaps even better than you do) what you would, and would not, enjoy? Well, today’s paper gives us a clue about that. A clue that might just show us how and why they can be so supremely confident in their own ability to recommend properties to you. It all has to do with their rating system.

Think back to the last time you booked a stay. After you checked out, you were prompted to leave a review, and there’s a good chance you did. And that was not a simple thumbs-up / thumbs-down. You were probably prompted to rate the booking on cleanliness, accuracy, communication, check-in experience, location, and value for money. Each of these with its own separate rating. After that they probably asked you another set of verifications about the availability of certain features on the property. And that’s not all, you were then prompted for multiple free-text responses as well, (the private feedback and public feedback). In these paragraphs you could describe, in detail, exactly how you felt about the place you just stayed.

You probably thought that you were writing these detailed ratings for the community. You thought you wrote that tome about the dirty bathtub to help other future travelers. But that’s only partially true. In reality, all those stars and rating-numbers you chose, and text-fields you filled in were about more than that. When you leave a rating and review, you’re not just telling the public about that property, you’re telling AirBnB what you like, what you love, what you appreciate, and what you don’t. And that info is being used to power Airbnb’s recommendation algorithm. This influences which listings get highlighted on search results and maps and it influences which supplemental results get pulled in.

This kind of system is called an MCRS: a Multi-Criteria Recommender System. Anytime you’re asked more than one question about a product/service, that’s an MCRS. You’ll see them popup after your Uber Eats delivery, or after you finish a Udemy course, or buy something on Amazon. And for good reason. They’re incredibly powerful tools.

And that’s what the authors are talking about in this paper. They argue that compared to a simple binary rating (like a thumbs-up/thumbs-down), an MCRS gives you the ability to make much stronger recommendations, with many fewer ratings. It also lets you capture nuanced user preferences with greater accuracy, and adapt recommendations dynamically based on evolving user behavior in ways that wouldn’t be possible otherwise.

The question is, after you’ve prompted the user to give you this detailed multi-dimensional feedback, how can you build a recommendation engine around it? It’s not like dealing with binary ratings for single parameters. Simple linear correlations (“people who liked this are likely to like that”) aren’t going to cut it when you’re dealing with data this complicated. And that is why the authors are proposing that you build a model around a Variational Autoencoder (VAE). In this paper they did just that, and then they benchmarked it against a number of other MCRS options. Let’s take a look at how their VAE works, and then we’ll see how it stacks up against the competition.

A VAE is a deep learning model designed to learn compact representations of complex data while preserving its underlying structure. Traditional autoencoders simply compress input data into a lower-dimensional space and then attempt to reconstruct it. VAEs go further. They incorporate a probabilistic component that allows for more flexible and expressive representations. This makes them particularly well-suited for applications where data exhibits variability and uncertainty. Instead of mapping inputs to a single fixed latent representation, a VAE learns a probability distribution over possible representations, enabling it to generate new data points and make more nuanced predictions.

Its architecture consists of two primary components: an encoder and a decoder, connected through a latent space.

The encoder takes high-dimensional input data (such as a user’s multi-criteria ratings) and compresses it into a lower-dimensional representation known as the latent vector. But, instead of producing a single deterministic vector, the encoder learns to generate two separate vectors: one representing the mean and the other representing the variance of a probability distribution. These parameters define a multivariate distribution from which the final latent representation is sampled. This stochastic sampling process ensures that the model does not overfit to specific training data points but instead generalizes well by capturing the underlying structure of the data.
The decoder then takes a sampled point from this latent distribution and reconstructs the original input as accurately as possible. By forcing the latent representations to follow a smooth probability distribution, the VAE introduces regularization that prevents the model from memorizing specific data points, leading to better generalization. This is particularly useful in recommendation systems, where user preferences are not absolute but exist within a spectrum of possible choices.

When applied to MCRS, VAEs give you a significant advantage over traditional methods. A conventional collaborative filtering system relies on explicit or implicit relationships between users and items. Because of this, they often struggle with sparsity issues when users have not rated enough items for the system to make accurate recommendations. A VAE circumvents this limitation by inferring user preferences even from limited data. It does this by leveraging the probabilistic nature of the latent space to fill in gaps in a meaningful way.

Another major benefit of VAEs is their ability to capture complex, nonlinear relationships between different criteria. They can uncover dependencies that are difficult to express with explicit rules or linear models. This enables them to generate more personalized and context-aware recommendations that better reflect real user behavior.

The authors built a VAE specifically for MCRS, and then tested it on the Yahoo! Movies dataset. It contains multi-criteria ratings for films based on acting, story, visuals, and direction. It includes both individual aspect ratings and an overall rating for each user-item interaction. To evaluate the recommender the authors used Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE).

MAE measures the average deviation between predicted and actual ratings.
RMSE penalizes larger errors more heavily, making it a useful indicator of how well a model captures user preferences.

But, they didn’t just run their VAE through the benchmark, they also ran several other recommendation approaches as well. Some deep learning models, a comparator, an ANN, and traditional collaborative filtering methods. Each of these models was trained and tested on the same dataset, to allow for a direct performance comparison. So how did the VAE do? Very well!

The results showed that it achieved the lowest MAE and RMSE scores (which is good...lower is better). But, that doesn’t mean it’s necessarily a practical option. The probabilistic nature of VAEs provides flexibility, but it also introduces significant challenges. Training a VAE is computationally expensive, requiring careful tuning of hyperparameters to prevent instability in the latent space. And unlike traditional collaborative filtering, which provides clear, interpretable recommendations, VAEs operate as a black box. This makes it difficult to explain why certain items are suggested and others are not. Additionally, VAEs can sometimes oversimplify user preferences. While they smooth over missing data effectively, they risk generating recommendations that align statistically but fail to capture more nuanced, individual preferences. The model’s reliance on learning a structured latent representation also means it may struggle when user feedback is extremely sparse, potentially hallucinating ratings in a way that reduces trust in the recommendations. All in all, while the system they built here demonstrates clear advantages over other methods, it needs to be used with caution. Its success is highly dependent on the quality and distribution of input data, and its increased computational cost and lack of interpretability are serious trade-offs that need to be considered before you deploy something like this.

That being said, if you’d like to dive into the pseudocode for their algorithms, review the results tables, or learn more about the structure of the dataset they used, I’d highly recommend that you download the paper.