:::tip 🎮 Interactive Playground Visualize this concept: Try the ANN Algorithms demo on the EngineersOfAI Playground - no code required. :::
Hybrid Search - Dense and Sparse Retrieval
The Frustrating RAG Problem
The customer success manager files a bug: "Users searching for 'ISO 27001 certification requirements' are getting results about general information security. But we have an entire document section on ISO 27001. The system is broken." The engineer pulls the logs. The query is being processed correctly - it gets embedded, sent to the vector database, and returns 10 results. But the ISO 27001 document is not among them.
The problem is not that the system is broken. The problem is that pure dense retrieval finds semantically related content, and "information security best practices" is semantically related to "ISO 27001" from the embedding model's perspective. But the user typed a specific standard number. They need exact-match retrieval for that identifier, not semantic neighborhood traversal.
BM25 would have solved this immediately - "ISO 27001" in the query exactly matches "ISO 27001" in the document, and BM25 would rank that document first. But the team had replaced their Elasticsearch-based BM25 search with a vector database six months ago, and now they could not do exact term matching.
This is the hybrid search problem. Dense retrieval is excellent for semantic matching but blind to exact identifiers, abbreviations, product names, and technical terms that the embedding model has not learned to cluster correctly. Sparse retrieval is excellent for exact matching but misses paraphrases, synonyms, and contextual relevance. Combining them captures both strengths.
Why This Exists - The Complementarity Problem
Dense embeddings and sparse retrieval fail in complementary and predictable ways.
Dense embeddings fail when:
- The query contains specific identifiers the model has not learned (model numbers, SKUs, technical codes)
- The query is short and ambiguous out of context ("ISO 27001" has high specificity but the embedding may not capture it)
- The relevant document uses very different vocabulary from the query
- New terminology has emerged after the model's training cutoff
BM25 fails when:
- The query and document use different words for the same concept ("automobile" vs "car")
- The relevant document is short or the vocabulary is sparse
- The query is conceptual ("explain the tradeoffs of normalization")
- Cross-lingual retrieval is needed
These failure modes are largely independent, which means combining them is highly effective. Hybrid search consistently outperforms either method alone on standard retrieval benchmarks like BEIR (Thakur et al., 2021), with average improvement of 3–8 points in NDCG@10.
BM25: Why It Still Matters
BM25 (Best Matching 25, Robertson and Zaragoza, 1994) is a bag-of-words term weighting function that improves on raw TF-IDF by penalizing document length and saturating term frequency:
Where:
- penalizes common terms
- (typically 1.2–2.0) controls term frequency saturation
- (typically 0.75) controls document length normalization
- is document length, is average corpus document length
BM25 represents queries and documents as sparse vectors (one dimension per vocabulary term, value = BM25 weight). The dot product of sparse query and document vectors gives the BM25 score.
from rank_bm25 import BM25Okapi
import numpy as np
from typing import List
class BM25Retriever:
def __init__(self, documents: List[str]):
# Tokenize (simple whitespace split; use a proper tokenizer in production)
self.tokenized_corpus = [doc.lower().split() for doc in documents]
self.bm25 = BM25Okapi(self.tokenized_corpus)
self.documents = documents
def search(self, query: str, top_k: int = 10) -> List[tuple]:
"""Returns list of (doc_index, score) sorted by score descending."""
tokenized_query = query.lower().split()
scores = self.bm25.get_scores(tokenized_query)
top_k_indices = np.argsort(scores)[::-1][:top_k]
return [(int(idx), float(scores[idx])) for idx in top_k_indices]
Dense Retrieval
Dense retrieval converts both queries and documents to embeddings and finds the nearest neighbors. It captures semantic similarity that BM25 misses.
from sentence_transformers import SentenceTransformer
import numpy as np
import faiss
class DenseRetriever:
def __init__(self, model_name: str = "BAAI/bge-large-en-v1.5"):
self.model = SentenceTransformer(model_name)
self.index = None
self.doc_embeddings = None
def build_index(self, documents: List[str]) -> None:
"""Embed documents and build HNSW index."""
print(f"Embedding {len(documents)} documents...")
self.doc_embeddings = self.model.encode(
documents,
batch_size=256,
normalize_embeddings=True,
show_progress_bar=True,
).astype(np.float32)
d = self.doc_embeddings.shape[1]
self.index = faiss.IndexHNSWFlat(d, 16)
self.index.add(self.doc_embeddings)
def search(self, query: str, top_k: int = 10) -> List[tuple]:
"""Returns list of (doc_index, cosine_similarity) sorted descending."""
q_emb = self.model.encode([query], normalize_embeddings=True).astype(np.float32)
# HNSW returns L2 distance; for normalized vectors, L2 = 2(1 - cosine)
distances, indices = self.index.search(q_emb, top_k)
# Convert L2 to cosine similarity
cosine_scores = 1 - distances[0] / 2
return [(int(idx), float(score)) for idx, score in zip(indices[0], cosine_scores)]
Reciprocal Rank Fusion
The key challenge in hybrid search is combining BM25 and dense retrieval results, which have incompatible score scales. BM25 scores are in the range 0–20+ and depend on corpus statistics. Cosine similarities are in the range 0–1. You cannot simply add them.
Reciprocal Rank Fusion (RRF, Cormack et al., 2009) sidesteps this problem entirely by working with ranks rather than scores:
Where is the set of rankings (BM25 ranking, dense ranking), is document 's rank in ranking , and (typically 60) prevents top-ranked documents from dominating too much.
RRF is robust, requires no tuning, and works well across diverse retrieval scenarios. A document that ranks first in BM25 and first in dense retrieval will have a very high RRF score. A document that ranks 3rd in BM25 and 50th in dense retrieval will still score reasonably - it is clearly relevant to the keyword component.
from typing import Dict, List, Tuple
def reciprocal_rank_fusion(
ranked_lists: List[List[Tuple[int, float]]],
k: int = 60,
top_k: int = 10,
) -> List[Tuple[int, float]]:
"""
Fuse multiple ranked lists using Reciprocal Rank Fusion.
Args:
ranked_lists: List of ranked lists, each being [(doc_id, score), ...]
sorted by score descending. Scores are ignored - only ranks used.
k: RRF constant (default 60 from original paper)
top_k: Number of results to return
Returns:
Fused ranked list of (doc_id, rrf_score) sorted by rrf_score descending
"""
doc_scores: Dict[int, float] = {}
for ranked_list in ranked_lists:
for rank, (doc_id, _) in enumerate(ranked_list):
# rank is 0-indexed; +1 to make 1-indexed, then +k as denominator
rrf_contribution = 1.0 / (k + rank + 1)
doc_scores[doc_id] = doc_scores.get(doc_id, 0.0) + rrf_contribution
# Sort by RRF score descending
fused = sorted(doc_scores.items(), key=lambda x: x[1], reverse=True)
return fused[:top_k]
class HybridRetriever:
"""
Combine BM25 and dense retrieval with RRF fusion.
"""
def __init__(
self,
documents: List[str],
embedding_model: str = "BAAI/bge-large-en-v1.5",
):
self.documents = documents
self.bm25 = BM25Retriever(documents)
self.dense = DenseRetriever(model_name=embedding_model)
self.dense.build_index(documents)
def search(
self,
query: str,
top_k: int = 10,
k_candidates: int = 50,
rrf_k: int = 60,
) -> List[dict]:
"""
Hybrid search: retrieve top-k candidates from both BM25 and dense,
then fuse with RRF.
k_candidates should be > top_k to ensure good coverage before fusion.
"""
bm25_results = self.bm25.search(query, top_k=k_candidates)
dense_results = self.dense.search(query, top_k=k_candidates)
fused = reciprocal_rank_fusion(
[bm25_results, dense_results],
k=rrf_k,
top_k=top_k,
)
return [
{
"rank": rank + 1,
"doc_id": doc_id,
"rrf_score": rrf_score,
"content": self.documents[doc_id][:200],
}
for rank, (doc_id, rrf_score) in enumerate(fused)
]
SPLADE: Learned Sparse Models
SPLADE (SParse Lexical AnD Expansion, Formal et al., 2021) trains a neural model to produce sparse embeddings - vectors where most dimensions are zero, but nonzero dimensions correspond to vocabulary terms, like BM25, but with neural-learned weights rather than TF-IDF statistics.
SPLADE's key advantage over BM25: it performs query expansion. If you search for "car," SPLADE's sparse vector may also activate "automobile," "vehicle," "motor vehicle" - terms that would not be in the BM25 query vocabulary. This gives SPLADE much of BM25's precision on exact matches while also capturing some semantic expansion.
SPLADE sparse vectors can be used directly with vector databases that support sparse vectors (Qdrant, Weaviate, Pinecone), enabling hybrid search within a single vector database query rather than requiring separate BM25 and dense retrieval systems.
from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch
import numpy as np
from typing import Dict
class SPLADEEncoder:
"""
SPLADE encoder: produces sparse vectors suitable for hybrid search.
"""
def __init__(self, model_name: str = "naver/splade-cocondenser-ensemble-distil"):
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForMaskedLM.from_pretrained(model_name)
self.model.eval()
def encode_to_sparse(self, text: str) -> Dict[int, float]:
"""
Returns a sparse vector as {token_id: weight}.
Only nonzero entries are returned (typically 20-200 out of 30522 vocab tokens).
"""
tokens = self.tokenizer(
text,
return_tensors="pt",
truncation=True,
max_length=512,
)
with torch.no_grad():
outputs = self.model(**tokens)
# SPLADE activation: ReLU(log(1 + logits)) max-pooled over token dimension
logits = outputs.logits # (1, seq_len, vocab_size)
activations = torch.log(1 + torch.relu(logits))
sparse_vector = activations.max(dim=1).values.squeeze(0) # (vocab_size,)
# Return only nonzero entries as dict
nonzero_mask = sparse_vector > 0
indices = nonzero_mask.nonzero(as_tuple=True)[0].cpu().numpy()
values = sparse_vector[nonzero_mask].cpu().numpy()
return {int(idx): float(val) for idx, val in zip(indices, values)}
def decode_sparse_vector(self, sparse_vec: Dict[int, float], top_k: int = 20) -> Dict[str, float]:
"""Decode token IDs back to human-readable terms for debugging."""
sorted_terms = sorted(sparse_vec.items(), key=lambda x: x[1], reverse=True)[:top_k]
return {
self.tokenizer.decode([token_id]): weight
for token_id, weight in sorted_terms
}
Tuning the Alpha Parameter
When using a weighted combination (instead of RRF), the alpha parameter controls the balance between BM25 and dense:
Both scores must be normalized to [0,1] before weighting. This requires collecting score ranges across the corpus, which makes it brittle. RRF is more robust because it is range-free.
When to use alpha tuning instead of RRF:
- You have labeled relevance data and can optimize alpha via grid search
- Your queries have predictable structure (all keyword queries vs all semantic queries)
- You want explicit control over the keyword-semantic balance for different query types
def normalize_scores(scores: List[Tuple[int, float]]) -> List[Tuple[int, float]]:
"""Min-max normalize scores to [0, 1]."""
if not scores:
return scores
values = [s for _, s in scores]
min_val, max_val = min(values), max(values)
if max_val == min_val:
return [(doc_id, 1.0) for doc_id, _ in scores]
return [
(doc_id, (score - min_val) / (max_val - min_val))
for doc_id, score in scores
]
def alpha_weighted_fusion(
bm25_results: List[Tuple[int, float]],
dense_results: List[Tuple[int, float]],
alpha: float = 0.7, # higher = more semantic weight
top_k: int = 10,
) -> List[Tuple[int, float]]:
"""
Weighted combination of normalized BM25 and dense scores.
Requires score normalization - less robust than RRF but more tunable.
"""
bm25_norm = dict(normalize_scores(bm25_results))
dense_norm = dict(normalize_scores(dense_results))
all_doc_ids = set(bm25_norm) | set(dense_norm)
combined = {}
for doc_id in all_doc_ids:
bm25_score = bm25_norm.get(doc_id, 0.0)
dense_score = dense_norm.get(doc_id, 0.0)
combined[doc_id] = alpha * dense_score + (1 - alpha) * bm25_score
sorted_results = sorted(combined.items(), key=lambda x: x[1], reverse=True)
return sorted_results[:top_k]
When Hybrid Beats Pure Dense
Hybrid search adds latency and complexity. It is worth adding when:
-
Technical identifiers matter: Product codes, standard numbers, model names, error codes, API endpoint names. Dense search will find semantically related results; BM25 finds the exact match.
-
Short unique queries: One or two rare terms where the embedding model generalizes too broadly.
-
Enterprise knowledge bases: Legal documents, compliance guides, technical specifications - all have precise terminology that must be matched exactly.
-
Multilingual corpora: BM25 handles cross-lingual exact match (proper nouns) while dense handles semantic cross-lingual matching.
-
Measured regression on production queries: If you have labeled query-document pairs and pure dense retrieval misses identifiable patterns, that is your empirical signal to add BM25.
Production Engineering Notes
Run A/B test before full rollout. Deploy hybrid search to 10% of traffic and compare NDCG@10, click-through rate, or downstream task performance against pure dense for the control group. Hybrid is not universally better - for heavily semantic queries (question answering, concept explanation), it can add noise from exact-match BM25 on irrelevant terms.
BM25 infrastructure is not free. BM25 requires maintaining an inverted index. You can run it via Elasticsearch/OpenSearch, or use Qdrant/Weaviate's native sparse vector support to keep everything in one system. The native sparse vector approach is simpler operationally but requires that your vector database supports it.
Tune k_candidates before RRF. RRF only fuses what you retrieved. If the relevant document is ranked 60th by BM25 but 1st by dense, you need k_candidates >= 60 in your BM25 retrieval to ensure it gets included in the fusion. Set k_candidates = 3–5× your final top_k to ensure adequate coverage from both systems.
Common Mistakes
:::danger Normalizing BM25 scores with global min-max from the corpus BM25 scores change as the corpus grows. A document with score 8.5 at corpus size 10K will have a different normalized value at corpus size 1M as the IDF values shift. If you use alpha-weighted fusion with score normalization, you must recompute normalization parameters periodically. RRF avoids this problem entirely by ignoring raw scores. :::
:::warning Adding hybrid search to fix a chunking problem If your RAG pipeline is missing documents because chunks are too large (diluting the signal) or too small (losing context), fixing the chunking strategy will improve results more than adding BM25. Hybrid search fixes the vocabulary gap problem, not the chunking problem. Diagnose which failure mode you have before adding complexity. :::
:::tip Check BEIR benchmark for your domain's category The BEIR benchmark covers 18 retrieval datasets across different domains (biomedical, legal, news, financial, scientific). Find the category closest to your domain and use it to predict whether dense, BM25, or hybrid will perform best. In BEIR results, hybrid almost always wins on technical/domain-specific corpora, while pure dense is competitive on general knowledge corpora. :::
Interview Questions
Q1: Why does pure dense vector search sometimes miss exact keyword matches? Give a concrete example.
Dense embeddings project text into a continuous vector space where semantically similar texts cluster together. The embedding model learns neighborhood relationships from training data. For well-represented terms, this works beautifully. For rare identifiers - "CVE-2024-1234," "ISO 27001," "RFC 8446" - the model either maps them to their semantic neighborhood (security vulnerabilities, information security, TLS) or maps them to a generic region of the space because it saw these terms infrequently during training. BM25 has no such problem: "CVE-2024-1234" in the query exactly matches "CVE-2024-1234" in the document with a high IDF weight because it is a rare term.
Q2: Explain Reciprocal Rank Fusion. Why use it instead of a weighted sum of scores?
RRF converts both ranked lists to a common rank space, computes for each document in each list, and sums across lists. This sidesteps score normalization entirely - BM25 scores (0–20+) and cosine similarities (0–1) are incompatible units that cannot be meaningfully weighted without normalization. Normalization requires knowing the score distribution across the corpus, which changes as documents are added. RRF is robust, requires no calibration, works well across different retrieval system combinations, and the default from the original paper works across most scenarios without tuning.
Q3: What is SPLADE and how does it differ from BM25?
Both BM25 and SPLADE produce sparse vectors where dimensions correspond to vocabulary terms. BM25's term weights are computed using statistical formulas (TF, IDF, document length). SPLADE's term weights are produced by a fine-tuned masked language model - the model learns to activate relevant terms, including terms that are semantically related to the input (query expansion). For the query "electric vehicle charging," SPLADE might activate "EV," "battery," "plug," "charger" even if those exact terms are not in the input. BM25 only activates exactly the tokens in the input. SPLADE combines BM25's precision with some of dense retrieval's semantic generalization.
Q4: You are building a hybrid search system and need to decide between RRF and alpha-weighted fusion. Which do you choose and why?
RRF unless you have labeled relevance data to optimize alpha and a stable query distribution. RRF requires no calibration, handles score distribution changes as the corpus evolves, works well with default k=60, and has been empirically validated across diverse retrieval tasks. Alpha-weighted fusion requires: normalized scores (which change as corpus grows), careful calibration of alpha (typically 0.5–0.8, but domain-dependent), and re-calibration when query distribution changes. The only clear win for alpha-weighted fusion is when you have a labeled evaluation set, stable query distribution, and want to maximize performance on that specific distribution.
Q5: Your hybrid search recall@10 is 0.91 but pure dense is 0.89. Is it worth the added complexity?
Depends on what "worth it" means. A 2-point recall improvement is meaningful at scale - if you process 10 million queries per day, that is 200,000 fewer missed retrievals. But the operational cost matters: hybrid search doubles or triples the retrieval latency (two parallel retrievals + fusion), requires maintaining BM25 infrastructure (or sparse vector support), and complicates the retrieval pipeline. My recommendation: deploy hybrid search if (a) the 2-point improvement is statistically significant on your production query distribution, not just a test set; (b) the latency overhead is within budget; (c) you have identified specific failure modes of pure dense (e.g., missed technical identifiers) that hybrid actually fixes. A 2-point recall improvement due to random noise in the test set is not worth it.
