Query-Adaptive Hybrid Search

Download the Audio (Right-click, Save-As)

Imagine that you're working on a site like WebMD. And just now, as happens millions of times a day, a user has come to your site and entered their symptoms into your search box. They want answers. They want to know what's causing their back pain, or why their throat is sore, or why their hip is making a popping sound all of a sudden. Lucky for them, you're sitting on a corpus of millions of articles. Everything from strep throat to snake bites. There's a really good chance that you've got an article that will help them, somewhere in that pile. The only problem is finding it. Once they click search, you've only got a few moments to make up your mind. A few milliseconds to traverse your database, match their query against your documents, take the top N results, rank them, and return that list.

The traditional approach to this problem is something called BM25. It's a keyword-matching algorithm that scores documents based on how frequently the words from the query appear inside them, adjusted for the length of the document and the overall rarity of those terms. The name itself, which stands for "Best Matching, version 25" is an historical artifact. It was one variant in a sequence of models (like BM11, BM15, and others) that progressively refined a probabilistic ranking approach. BM25 is simply the version that proved most stable and effective in practice, and that's why it became the standard.

But BM25 has a fundamental limitation: it only cares about the words themselves. If you search for "automobile" and the relevant document uses the word "car", BM25 will give it a low score. It's matching strings, not meaning. In the field, this is known as the "vocabulary mismatch" problem. A newer strategy called "dense retrieval" was developed largely in response to this issue. Instead of matching keywords, it encodes both the query, and every document in the corpus, into vectors. These are high-dimensional numerical representations that live in a shared latent space. The idea is that semantically similar things end up close to each other in that space. So "automobile" and "car" end up close to one another, even though they share no characters. With this type of system, you'd receive a query, encode it, and then find the documents whose vectors are most geometrically close to it (via cosine similarity). It works, and it's become widely used for a range of tasks, particularly short, conversational queries, where the semantic relationship between the query and the answer is more important than exact lexical overlap. But, to be clear, neither of these approaches is universally better than the other. BM25 is still quite good on certain types of queries, particularly when the documents are long or when the query contains rare technical terms that a neural model might not generalize well from.

This motivated the development of a concept called "hybrid retrieval". These are systems that combine sparse, lexical scores from BM25 with dense, semantic scores from a neural encoder, and blend them together into a single ranking. In practice, you:

Compute both scores for every candidate document.
Normalize them to a shared scale.
Take a weighted sum.

The result is a system that often outperforms either component in isolation. But...there's still an issue. You see, most hybrid retrieval systems blend the two scores using a fixed weight, called Alpha. It's a ratio that applies to every single query that hits the system. Why? Because the assumption is that every query benefits from the same balance of semantic and lexical signals.

The authors of today's paper argue that that assumption is wrong.

A short, natural language question like "Why does my neck feel weird in the morning?" is going to benefit enormously from dense semantic retrieval, because the answer might use entirely different vocabulary than the question.
But, a query that is packed with precise technical jargon is one where BM25 is going to be extremely well-suited.

And, the authors argue, that there's no reason to apply the same blending ratio to both. In fact, they argue, that fixed ratio is a structural bottleneck. It works on average, but it's wrong for a large portion of the queries that actually hit your system. So in this paper, they're introducing an alternative. An adaptive hybrid retrieval framework that replaces the static mixing weight with a module that reads each incoming query up front and predicts the right Alpha to use dynamically. They call it QDAP: the Query-Driven Alpha Prediction module. On today's episode we'll dive into the architecture of their solution and see how it works.

Before we get into the mechanics of QDAP, we need to wrap our heads around why score normalization and fusion are hard in the first place. When you combine a BM25 score with a cosine similarity score, you're trying to combine numbers that live in entirely different numerical spaces. BM25 scores are unbounded and vary wildly depending on corpus size and document length. Cosine similarity, on the other hand, lives in a tightly constrained range. If you just add them together with a coefficient, the BM25 number will completely dominate. Not because it's more informative, but just because it's usually so much larger. So, before you can blend them meaningfully, you need to normalize both scores to a common range. The authors look at two approaches to this.

The first is Reciprocal Rank Fusion, which throws away the raw scores entirely and just uses the rank position of each document within each retrieval system's results. This is simple and robust, but you lose the distributional information about how confident each system was.
The second approach, and the one they ultimately chose is min-max normalization. It rescales each system's scores relative to the highest and lowest scores it produced for that particular query. This preserves the shape of the score distribution, and with both scores living on the same scale you can just take the weighted sum.

The question QDAP is trying to answer is: what should those weights be? Given a query, how much should the final score depend on the dense signal versus the sparse signal? Their big idea is that the query embedding itself (the vector representation that the dense encoder produces as part of its normal retrieval operation), already contains the information needed to answer that question. A query that's abstract and semantically rich will look different in the embedding space than a query that is full of specific technical terms. The structure and nature of the query is encoded in that vector, and a prediction module sitting on top of that embedding should be able to learn to recognize which types of queries need more BM25 and which need more neural retrieval.

They implement this with two versions of the same idea. QDAP-S is the lightweight variant. It takes the embedding that the dense model already computed for retrieval and runs a small prediction head on top of it. No extra encoding step, very little added latency. QDAP-L is the full-capacity version. It runs a second encoder in parallel to produce a richer representation of the query before making the same prediction. That improves accuracy, but costs more memory and compute. So QDAP-S ends up being cheaper and faster, QDAP-L ends up being more precise.

Both variants share the same output and training setup. Instead of predicting a single weight, the output is a distribution over possible weights. That reflects the fact that for some queries, there isn't a single sharp optimum. Several nearby weightings may perform similarly, and the model is trained to capture that shape rather than collapse it to a point. To do this, the authors precompute retrieval performance across the full range of weights for each query, and that performance curve becomes the target for both QDAP-S and QDAP-L. And both variants are also trained with the same loss function: a combination of cross-entropy and Wasserstein distance.

But they don't just stop there, they also modify the training regime a bit. The standard approach to training dense retrieval models is contrastive learning: you give the encoder a query and a relevant document, and then a collection of irrelevant documents, and you train it to push the query embedding close to the relevant one and far from the irrelevant ones. But that objective is defined in isolation. In a hybrid system, this creates a mismatch. If the sparse retriever already handles a given document well, it doesn't matter much whether the dense model finds that document hard or easy. The hybrid system will get it right regardless. What actually matters, for the performance of the hybrid as a whole, is whether the dense model learns to handle the cases where BM25 fails.

To address this, the authors introduce what they call antagonist negative sampling. First, they filter the training set to keep only the query-document pairs where BM25 struggles. And second, within those filtered training instances, they select hard negatives that are ranked highly by both systems simultaneously. These are documents that the dense model finds plausible and that BM25 also finds plausible, but that aren't actually the correct answer. By forcing the dense encoder to distinguish the true positive from these jointly-confusing negatives, the training process directly teaches it to patch the failure modes of the full system, not just its own.

If you want to go deeper, make sure you download the full paper. The authors provide much more on the training procedure itself, the full pseudocode for both the weight predictor and the antagonist sampling algorithm, their evaluation procedure and a per-language breakdown of their results.