Skip to main content

Hybrid Search: Dense and Sparse

The Edge Case That Exposes Every Dense-Only System

The engineering team at a healthcare software company had built a confident, elegant RAG system. Dense embeddings, Qdrant, text-embedding-3-large. It worked beautifully on natural language questions. Then a physician typed a query: "ICD-10 code E11.65".

The embedding model had never learned to meaningfully represent diagnostic codes. "E11.65" (Type 2 diabetes mellitus with hyperglycemia) had an embedding vector that was essentially random noise relative to other ICD codes - the model had seen too few examples of medical code semantics to learn the structure. The entire ICD-10 code space was a flat, undifferentiated blob in the embedding space.

Meanwhile, a BM25 index would have retrieved any document containing the string "E11.65" with perfect precision. This wasn't a semantic query - it was a keyword lookup. But the team had built a system that only knew how to do semantic search.

This isn't a failure unique to medical systems. Product SKUs, part numbers, version strings (Python 3.11.2), legal citation formats (§ 102(a)(1)), financial tickers (GOOGL, TSLA), and technical error codes (ECONNREFUSED, HTTP 429) all share the same property: they are exact strings where semantic similarity is meaningless. Dense retrieval fails at these cases. Keyword search handles them trivially.

Hybrid search combines both signals. The result consistently outperforms either approach alone across most query distributions.

Why This Exists: The Complementary Failure Modes

Dense retrieval and sparse retrieval fail at opposite ends of the query spectrum:

Dense retrieval (embedding-based) fails when:

  • The query contains rare technical terms, product codes, or identifiers not well-represented in training data
  • The query has specific vocabulary that must appear verbatim (exact model names, API error codes, regulation numbers)
  • The query is very short (1-2 words) and lacks context for the embedding to capture intent
  • The document uses different vocabulary than the training distribution

Sparse retrieval (BM25) fails when:

  • The query and document use different words for the same concept (synonymy)
  • Cross-lingual retrieval is needed (query in English, document in French)
  • The query is a natural language question and documents are statements
  • Understanding is needed beyond vocabulary overlap

The key insight: These failure modes are largely complementary. A hybrid system that combines both scores will outperform either approach in isolation on the vast majority of real query distributions - because queries in the real world include both natural language and exact-match lookups.

Sparse Retrieval: BM25 Deep Dive

BM25 (Best Match 25) is the dominant sparse retrieval algorithm, developed by Robertson et al. in the 1990s as a probabilistic extension of TF-IDF. It's the algorithm behind Elasticsearch's default search, Apache Solr, and Apache Lucene.

The BM25 score for document dd given query qq:

BM25(d,q)=tqIDF(t)f(t,d)(k1+1)f(t,d)+k1(1b+bdavgdl)\text{BM25}(d, q) = \sum_{t \in q} \text{IDF}(t) \cdot \frac{f(t,d) \cdot (k_1 + 1)}{f(t,d) + k_1 \cdot (1 - b + b \cdot \frac{|d|}{\text{avgdl}})}

Where:

  • f(t,d)f(t,d): frequency of term tt in document dd (TF)
  • d|d|: length of document dd in words
  • avgdl\text{avgdl}: average document length in the corpus
  • k1[1.2,2.0]k_1 \in [1.2, 2.0]: term frequency saturation parameter - limits the impact of very frequent terms
  • b[0.75]b \in [0.75]: length normalization parameter - penalizes long documents for having more opportunities to match

Why BM25 beats raw TF-IDF:

  1. Saturation (k1k_1): TF-IDF gives linearly increasing score as term frequency increases. A document mentioning "diabetes" 100 times should not score 100x higher than one mentioning it once. BM25's saturation function asymptotes - doubling term frequency gives diminishing score increase.

  2. Length normalization (bb): A 10,000-word document has more opportunities to contain a query term than a 100-word document. BM25 penalizes longer documents, making short precise documents competitive. b=0.75b=0.75 is the empirical standard.

  3. IDF: Rare terms get higher weight. "The" appearing in a document is uninformative. "anisocytosis" appearing in a document is very informative for a medical query containing that term.

# BM25 implementation from scratch to understand the internals
import math
from collections import Counter
from typing import List

class BM25:
def __init__(self, corpus: List[List[str]], k1: float = 1.5, b: float = 0.75):
self.k1 = k1
self.b = b
self.corpus = corpus
self.N = len(corpus)
self.avgdl = sum(len(doc) for doc in corpus) / self.N
self.df = self._compute_df()
self.idf = self._compute_idf()

def _compute_df(self) -> dict:
"""Document frequency: how many documents contain each term."""
df = {}
for doc in self.corpus:
for term in set(doc):
df[term] = df.get(term, 0) + 1
return df

def _compute_idf(self) -> dict:
"""IDF: log((N - df + 0.5) / (df + 0.5) + 1)"""
idf = {}
for term, df in self.df.items():
idf[term] = math.log((self.N - df + 0.5) / (df + 0.5) + 1)
return idf

def score(self, query_terms: List[str], doc_terms: List[str]) -> float:
tf = Counter(doc_terms)
doc_len = len(doc_terms)
score = 0.0
for term in query_terms:
if term not in self.idf:
continue
tf_term = tf.get(term, 0)
# BM25 formula
numerator = tf_term * (self.k1 + 1)
denominator = tf_term + self.k1 * (1 - self.b + self.b * doc_len / self.avgdl)
score += self.idf[term] * (numerator / denominator)
return score

def search(self, query: str, top_k: int = 10) -> List[tuple]:
query_terms = query.lower().split()
scores = [
(i, self.score(query_terms, doc))
for i, doc in enumerate(self.corpus)
]
scores.sort(key=lambda x: x[1], reverse=True)
return scores[:top_k]


# Real usage: use rank_bm25 or elasticsearch for production
from rank_bm25 import BM25Okapi

corpus_texts = [
"Home equity loans have fixed interest rates between 7-9% APR",
"ICD-10 code E11.65 refers to type 2 diabetes with hyperglycemia",
"Federal Reserve rate decisions affect mortgage costs",
"HELOC rates are variable and tied to prime rate",
]

# Tokenize - for production, use a proper tokenizer with stemming/lemmatization
tokenized_corpus = [text.lower().split() for text in corpus_texts]
bm25 = BM25Okapi(tokenized_corpus)

# Keyword query - BM25 handles this well
query = "ICD-10 E11.65"
scores = bm25.get_scores(query.lower().split())
top_indices = scores.argsort()[::-1][:3]
for i in top_indices:
print(f"Score: {scores[i]:.3f} | {corpus_texts[i]}")

SPLADE: Learned Sparse Representations

BM25 uses exact vocabulary matching - a document about "automobiles" gets no credit for a query about "cars." SPLADE (Sparse Lexical and Expansion Model, Formal et al. 2021) learns sparse representations that expand vocabulary using a pretrained language model.

SPLADE produces sparse vectors where each dimension corresponds to a vocabulary token (typically 30,000+ dimensions for BERT's vocabulary). Most dimensions are zero; active dimensions represent terms the model predicts as relevant based on semantic understanding.

The key difference from BM25: SPLADE learns that a document about "automobiles" should have a non-zero weight at the "cars" vocabulary dimension, even if the word "cars" doesn't appear. It's vocabulary expansion driven by BERT's language understanding.

SPLADE consistently outperforms BM25 on BEIR and MS-MARCO benchmarks while maintaining sparsity (roughly 300-500 active dimensions per 30K-vocab vector vs full-density dense vectors).

from transformers import AutoModelForMaskedLM, AutoTokenizer
import torch
import numpy as np

# SPLADE model (must download weights)
model_name = "naver/splade-v3"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)
model.eval()

def splade_encode(text: str) -> dict:
"""
Returns sparse vector as {token_id: weight} dict.
Only non-zero weights are stored (sparse representation).
"""
tokens = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
output = model(**tokens)
# SPLADE uses ReLU on log(1 + MLM logits), then max-pool over tokens
logits = output.logits # (1, seq_len, vocab_size)
activations = torch.log(1 + torch.relu(logits))
sparse_vec = activations.max(dim=1).values.squeeze() # (vocab_size,)

# Convert to sparse representation (only store non-zero weights)
non_zero_ids = sparse_vec.nonzero(as_tuple=True)[0]
return {
tokenizer.convert_ids_to_tokens([int(i)])[0]: float(sparse_vec[i])
for i in non_zero_ids
}

query_sparse = splade_encode("car repair cost")
# May contain: {'car': 3.2, 'automobile': 1.8, 'vehicle': 2.1, 'repair': 4.1, ...}
# Note: 'automobile' and 'vehicle' appear even though they weren't in the query

For production deployment, SPLADE sparse vectors are stored in an inverted index (like Elasticsearch or Qdrant's sparse vector support), enabling efficient sparse retrieval at scale.

Fusion Methods: Combining Dense and Sparse Scores

Once you have ranked lists from dense and sparse retrievers, you need to combine them. Three main approaches:

1. Reciprocal Rank Fusion (RRF)

The model-free approach. Scores are based on rank position, not raw values - no need to normalize different score scales.

from collections import defaultdict
from typing import List, Dict, Tuple

def rrf_fusion(
ranked_lists: List[List[str]],
k: int = 60,
) -> List[Tuple[str, float]]:
scores: Dict[str, float] = defaultdict(float)
for ranked_list in ranked_lists:
for rank, doc_id in enumerate(ranked_list):
scores[doc_id] += 1.0 / (k + rank + 1)
return sorted(scores.items(), key=lambda x: x[1], reverse=True)

# Example
bm25_results = ["doc_A", "doc_C", "doc_B", "doc_E", "doc_D"]
dense_results = ["doc_B", "doc_A", "doc_D", "doc_C", "doc_F"]
fused = rrf_fusion([bm25_results, dense_results])

2. Weighted Score Normalization

Normalize both score distributions to [0,1], then combine with weights:

hybrid_score(d)=αnormalized_dense(d)+(1α)normalized_sparse(d)\text{hybrid\_score}(d) = \alpha \cdot \text{normalized\_dense}(d) + (1 - \alpha) \cdot \text{normalized\_sparse}(d)

import numpy as np

def normalize_scores(scores: List[float]) -> List[float]:
"""Min-max normalize to [0, 1]."""
min_s, max_s = min(scores), max(scores)
if max_s == min_s:
return [0.5] * len(scores)
return [(s - min_s) / (max_s - min_s) for s in scores]

def weighted_hybrid_fusion(
dense_results: List[Tuple[str, float]], # (doc_id, dense_score) pairs
sparse_results: List[Tuple[str, float]], # (doc_id, bm25_score) pairs
alpha: float = 0.5, # weight for dense; (1-alpha) for sparse
top_k: int = 10,
) -> List[Tuple[str, float]]:
# Build score maps
dense_map = dict(dense_results)
sparse_map = dict(sparse_results)

# All unique document IDs
all_docs = set(dense_map.keys()) | set(sparse_map.keys())

# Normalize dense scores
dense_vals = [dense_map.get(d, 0.0) for d in all_docs]
norm_dense = dict(zip(all_docs, normalize_scores(dense_vals)))

# Normalize sparse scores
sparse_vals = [sparse_map.get(d, 0.0) for d in all_docs]
norm_sparse = dict(zip(all_docs, normalize_scores(sparse_vals)))

# Combine
combined = [
(doc, alpha * norm_dense[doc] + (1 - alpha) * norm_sparse[doc])
for doc in all_docs
]
return sorted(combined, key=lambda x: x[1], reverse=True)[:top_k]

Tuning alpha: Start at α=0.5\alpha = 0.5 (equal weight). Tune on a held-out evaluation set. For queries that are mostly natural language, α=0.7\alpha = 0.7 (more dense) typically works better. For technical/exact-match heavy workloads, α=0.3\alpha = 0.3 (more sparse) is better. Many teams run A/B tests to find the optimal alpha for their specific query distribution.

3. Learned Fusion (Re-ranking with CombSUM)

Train a lightweight model (linear combination, gradient boosted tree, or small neural network) to weight the scores optimally, given labeled training data. This requires labeled query-document relevance pairs but can outperform fixed-weight fusion.

Full Hybrid Search Pipeline

from rank_bm25 import BM25Okapi
from openai import OpenAI
import numpy as np
from typing import List, Dict, Any, Tuple

client = OpenAI()

class HybridSearchEngine:
def __init__(self, documents: List[Dict[str, Any]], alpha: float = 0.5):
"""
documents: list of {'id': str, 'text': str} dicts
alpha: weight for dense retrieval (0 = pure BM25, 1 = pure dense)
"""
self.documents = documents
self.alpha = alpha
self.doc_ids = [d["id"] for d in documents]
self.doc_texts = [d["text"] for d in documents]

# Build BM25 index
tokenized = [text.lower().split() for text in self.doc_texts]
self.bm25 = BM25Okapi(tokenized)

# Build dense index
print(f"Embedding {len(documents)} documents...")
self.embeddings = self._embed_batch(self.doc_texts)
print("Ready.")

def _embed_batch(self, texts: List[str]) -> np.ndarray:
response = client.embeddings.create(
model="text-embedding-3-small",
input=texts,
)
vecs = np.array([r.embedding for r in response.data], dtype=np.float32)
# L2 normalize for cosine similarity via dot product
norms = np.linalg.norm(vecs, axis=1, keepdims=True)
return vecs / norms

def _embed_query(self, query: str) -> np.ndarray:
response = client.embeddings.create(
model="text-embedding-3-small",
input=query,
)
vec = np.array(response.data[0].embedding, dtype=np.float32)
return vec / np.linalg.norm(vec)

def _normalize(self, scores: np.ndarray) -> np.ndarray:
min_s, max_s = scores.min(), scores.max()
if max_s == min_s:
return np.ones_like(scores) * 0.5
return (scores - min_s) / (max_s - min_s)

def search(self, query: str, top_k: int = 10) -> List[Dict]:
# BM25 scores
bm25_scores = np.array(
self.bm25.get_scores(query.lower().split()),
dtype=np.float32
)

# Dense scores
q_emb = self._embed_query(query)
dense_scores = self.embeddings @ q_emb # dot product = cosine sim (normalized)

# Normalize and combine
norm_bm25 = self._normalize(bm25_scores)
norm_dense = self._normalize(dense_scores)
hybrid_scores = self.alpha * norm_dense + (1 - self.alpha) * norm_bm25

# Get top-k
top_indices = np.argsort(hybrid_scores)[::-1][:top_k]

return [
{
"id": self.doc_ids[i],
"text": self.doc_texts[i],
"hybrid_score": float(hybrid_scores[i]),
"dense_score": float(dense_scores[i]),
"bm25_score": float(bm25_scores[i]),
}
for i in top_indices
]


# Demo
documents = [
{"id": "d1", "text": "Home equity loans offer fixed rates between 7-9% APR."},
{"id": "d2", "text": "ICD-10 code E11.65 is type 2 diabetes mellitus with hyperglycemia."},
{"id": "d3", "text": "The Federal Reserve raised the federal funds rate by 75 basis points."},
{"id": "d4", "text": "HELOC interest rates are variable and tied to the prime lending rate."},
{"id": "d5", "text": "Borrowers should compare APR across multiple lenders before committing."},
]

engine = HybridSearchEngine(documents, alpha=0.5)

# Natural language query - dense should dominate
print("\nQuery: 'what are typical home loan rates'")
for r in engine.search("what are typical home loan rates", top_k=3):
print(f" [{r['hybrid_score']:.3f}] {r['text'][:60]}")

# Exact match query - BM25 should dominate
print("\nQuery: 'ICD-10 E11.65'")
for r in engine.search("ICD-10 E11.65", top_k=3):
print(f" [{r['hybrid_score']:.3f}] {r['text'][:60]}")

Weaviate Hybrid Search (Native Implementation)

Weaviate has native hybrid search that combines BM25 and vector search at the database layer - the most production-ready approach for hybrid RAG.

import weaviate
from weaviate.classes.init import Auth
from weaviate.classes.query import HybridFusion

client = weaviate.connect_to_weaviate_cloud(
cluster_url="https://your-cluster.weaviate.network",
auth_credentials=Auth.api_key("your-api-key"),
)

collection = client.collections.get("Documents")

# Hybrid search with native BM25 + vector fusion
results = collection.query.hybrid(
query="ICD-10 E11.65 diabetes",
alpha=0.5, # 0.0 = pure BM25, 1.0 = pure vector
fusion_type=HybridFusion.RELATIVE_SCORE, # or RANKED (RRF)
limit=10,
return_properties=["text", "source"],
)

for obj in results.objects:
print(f"{obj.properties['text'][:80]}")

Elasticsearch + BM25: When the Classic Stack Wins

For organizations already running Elasticsearch, adding dense retrieval via the knn parameter creates a production-ready hybrid stack:

from elasticsearch import Elasticsearch
import json

es = Elasticsearch("http://localhost:9200")

# Create index with both text and vector fields
mapping = {
"mappings": {
"properties": {
"text": {"type": "text"}, # BM25 analyzed field
"embedding": { # Dense vector field
"type": "dense_vector",
"dims": 1536,
"index": True,
"similarity": "cosine",
},
"doc_type": {"type": "keyword"},
}
}
}
es.indices.create(index="documents", body=mapping)

# Hybrid search query
def es_hybrid_search(query_text: str, query_vector: list, top_k: int = 10):
return es.search(
index="documents",
body={
"query": {
"bool": {
"should": [
# BM25 text match
{"match": {"text": query_text}},
]
}
},
"knn": { # Dense vector search
"field": "embedding",
"query_vector": query_vector,
"k": top_k * 2, # fetch more candidates for merging
"num_candidates": top_k * 10,
},
"rank": {
"rrf": {"window_size": top_k * 2, "rank_constant": 60}
},
"size": top_k,
}
)

When Does Hybrid Beat Pure Dense?

Benchmarks consistently show hybrid outperforms pure dense on these query types:

Query TypeExampleDenseBM25Hybrid
Natural language"how do plants make food"0.890.710.91
Short keyword"photosynthesis"0.820.850.88
Exact identifiers"ICD-10 E11.65"0.310.980.97
Technical terms"ECONNREFUSED error"0.550.910.93
Typos/misspellings"diabetis treatment"0.780.120.75
Cross-lingual"hypertension (Fr→En)"0.720.050.71

(Recall@5 values, illustrative based on published benchmarks)

Key takeaway: Hybrid wins or ties on almost every query type. The cases where hybrid loses are rare (cross-lingual is the main one, where the sparse component may hurt). For most production RAG systems, hybrid search is the right default.

Production Engineering Notes

Tokenization for BM25: The default whitespace tokenizer misses important preprocessing. Use stemming (reduce "running" to "run"), stop word removal, and lowercasing. For production, integrate with a proper NLP tokenizer (SpaCy, NLTK, Hugging Face tokenizers).

Elasticsearch vs dedicated vector DBs: Elasticsearch now supports dense_vector with HNSW indexing and native RRF fusion - a legitimate production choice if you already operate the Elastic stack. For new deployments starting from scratch, Qdrant with sparse vector support or Weaviate's native hybrid is typically faster to iterate on.

Query-time alpha adaptation: Advanced systems dynamically adjust alpha based on query characteristics. If the query contains exact identifiers (detected via regex for product codes, ICD codes, etc.), increase the BM25 weight. If the query is a full natural language sentence, increase dense weight. This query classification adds a few milliseconds but can significantly improve retrieval quality.

Qdrant sparse vectors: Qdrant 1.7+ supports sparse vectors natively, enabling BM25-equivalent retrieval alongside dense in the same database. No Elasticsearch needed:

from qdrant_client.http.models import (
SparseVector, SparseVectorParams, NamedSparseVector,
)

# Encode query as sparse vector (using BM25 or SPLADE weights)
sparse_query = SparseVector(
indices=[101, 4521, 8832], # vocabulary token IDs with non-zero weights
values=[2.3, 1.8, 3.1],
)

# Hybrid query in Qdrant
results = client.query_points(
collection_name="documents",
prefetch=[
# Dense search
models.Prefetch(query=dense_query_vector, using="dense", limit=50),
# Sparse search
models.Prefetch(query=models.SparseVector(**sparse_query.dict()), using="sparse", limit=50),
],
query=models.FusionQuery(fusion=models.Fusion.RRF), # RRF fusion
limit=10,
)

Common Mistakes

:::danger Not Normalizing Scores Before Weighted Fusion BM25 scores are in an arbitrary scale (depends on corpus size, document length distribution). Cosine similarity scores are in [-1, 1]. Summing them without normalization gives BM25 scores orders of magnitude more weight by default. Always normalize both distributions to [0, 1] before combining with weighted fusion. RRF avoids this problem entirely by using rank positions instead of raw scores. :::

:::warning Assuming Hybrid Always Wins Hybrid retrieval adds complexity: two retrieval systems, normalization, fusion logic, two index maintenance burdens. Before committing to hybrid, measure whether your query distribution actually has significant sparse-search failures. If 95% of your queries are natural language and your dense model is domain-appropriate, hybrid may add 5-10% latency with 1-2% quality gain - possibly not worth the engineering cost. :::

:::warning Running BM25 on Raw Text Without Preprocessing BM25 on raw text without stemming and stop word removal is significantly weaker than properly preprocessed BM25. "Running" and "run" are different terms. "The", "a", "is" are noise. Use a tokenizer that handles these. For English: NLTK's PorterStemmer + a stop word list. For multilingual: language-specific stemmers or a lightweight NLP model. :::

Interview Questions and Answers

Q: Explain the BM25 formula and how it improves on TF-IDF.

A: BM25 improves TF-IDF in two key ways. First, term frequency saturation: TF-IDF uses raw term frequency, so a document mentioning "diabetes" 100 times scores 100x higher than one mentioning it once. BM25 applies a saturation function - f(t,d)(k1+1)f(t,d)+k1\frac{f(t,d) \cdot (k_1+1)}{f(t,d) + k_1} - that asymptotes as frequency increases. The k1k_1 parameter (typically 1.5) controls how quickly it saturates. Second, length normalization: BM25 normalizes document scores by length. A 10,000-word document has more opportunities to contain query terms than a 500-word document. The bb parameter (typically 0.75) controls length normalization strength. Together, these make BM25 substantially better at ranking short, precise passages above long, tangentially-relevant documents - which is exactly what passage retrieval for RAG requires.

Q: When does hybrid search outperform pure dense retrieval and why?

A: Hybrid outperforms pure dense in two main scenarios. First, exact-match queries: product SKUs, error codes, medical codes, API identifiers, regulation section numbers. The embedding model may not have seen these in training data, or may not have learned the precise structure of identifier spaces. BM25 retrieves exact string matches reliably. Second, rare technical vocabulary: emerging terminology, proprietary product names, company-internal abbreviations. If the term appears in documents but wasn't in embedding training data, its embedding is essentially random noise. BM25 retrieves by exact vocabulary match regardless of training data coverage. The practical rule: if your users will type any exact identifiers, product names, or technical codes, you need hybrid. If your query distribution is 100% natural language with common vocabulary, pure dense may suffice.

Q: What is SPLADE and how does it differ from BM25?

A: BM25 is a bag-of-words model - it only sees exact vocabulary matches. If the query says "car" and the document says "automobile," BM25 gives zero credit. SPLADE (Sparse Lexical and Expansion Model) uses a pretrained language model (BERT) to produce sparse vectors where non-zero weights correspond to vocabulary tokens - including tokens that are semantically related to the text but don't appear literally. A SPLADE representation of "automobile" might have high weights at vocabulary tokens for "car," "vehicle," "auto," "transportation." This makes SPLADE both exact-match capable (like BM25) and semantically aware (like dense retrieval). SPLADE consistently outperforms BM25 on BEIR benchmarks while maintaining the efficiency of sparse retrieval. The trade-off: SPLADE requires BERT-scale inference for encoding, while BM25 is a simple counting formula. For quality-first systems where you can afford GPU inference, SPLADE is a strong upgrade to the sparse retrieval component.

Q: How do you choose the alpha parameter for hybrid fusion?

A: Alpha is the weight given to dense retrieval (1-alpha goes to sparse). The right value depends on your query distribution. Empirically: alpha=0.5 is a reasonable default. To optimize: collect 100+ representative queries with ground truth (the relevant document for each query). Run both dense and sparse retrieval independently. Try alpha values from 0.3 to 0.7 in steps of 0.1. Measure Recall@5 or NDCG@10 at each alpha. Pick the alpha that maximizes your target metric. For natural-language-heavy systems (customer support, Q&A), alpha typically settles around 0.6-0.7. For technical systems with many exact-match queries, 0.3-0.4. If you have distinct query types (some natural language, some exact match), consider query classification to dynamically select alpha - classify the query, then apply the appropriate alpha for that type.

Q: Describe how you would add hybrid search to a system currently using only dense retrieval.

A: Three steps. First, add BM25 infrastructure: either add Elasticsearch/OpenSearch alongside your vector DB, or use Qdrant's native sparse vector support, or use a Python BM25 library (rank_bm25) for smaller scale. You need the same documents indexed in both systems. Second, implement retrieval fusion: query both systems for each incoming query, normalize scores (min-max or z-score), combine with weighted sum or RRF. RRF is safer to start with - it doesn't require score normalization and has been shown to be robust. Third, evaluate and tune: run your existing eval set, compare hybrid vs dense Recall@5 or NDCG@10. Tune alpha (or RRF k parameter). If hybrid doesn't improve over dense on your eval set, your query distribution may not need it. If it does improve, measure the latency increase (typically 50-100ms for BM25 on a moderate corpus) and decide if the quality gain is worth it.

The optimal alpha for hybrid search depends on query type. A production system can classify queries on the fly and adapt the fusion weight:

import re
from openai import OpenAI
from typing import Literal

client = OpenAI()

# Patterns that strongly suggest exact-match needs
EXACT_MATCH_PATTERNS = [
r'\b[A-Z]{2,}-\d+(?:\.\d+)*\b', # ICD codes, JIRA tickets: E11.65, PROJ-123
r'\b[A-Z0-9]{3,}-[A-Z0-9]{2,}\b', # SKUs, model numbers: SKU-A4B2
r'v\d+\.\d+(?:\.\d+)?', # Version strings: v3.11.2
r'HTTP [1-5]\d{2}', # HTTP status codes
r'RFC \d{4}', # RFC numbers
r'§\s*\d+', # Legal section numbers
r'[A-Z]+:\s*\w+Error', # Error codes: TypeError, ECONNREFUSED
]

def classify_query_type(query: str) -> dict:
"""
Classify query as natural language, exact match, or mixed.
Returns recommended alpha for hybrid fusion.
"""
# Check for exact-match patterns
exact_match_signals = sum(
1 for pattern in EXACT_MATCH_PATTERNS
if re.search(pattern, query)
)

# Heuristics
word_count = len(query.split())
is_question = query.strip().endswith("?") or query.lower().startswith(("what", "how", "why", "when", "where", "who"))
has_quotes = '"' in query or "'" in query

if exact_match_signals >= 2:
return {"type": "exact_match", "alpha": 0.2, "reason": "Contains identifier patterns"}
elif exact_match_signals == 1 and not is_question:
return {"type": "mixed", "alpha": 0.4, "reason": "Contains some identifier patterns"}
elif is_question and word_count >= 5:
return {"type": "natural_language", "alpha": 0.7, "reason": "Natural language question"}
elif word_count <= 3:
return {"type": "keyword", "alpha": 0.5, "reason": "Short keyword query"}
else:
return {"type": "natural_language", "alpha": 0.6, "reason": "Default natural language"}


def adaptive_hybrid_search(
query: str,
engine: "HybridSearchEngine",
top_k: int = 10,
) -> list:
"""Hybrid search with adaptive alpha based on query type."""
classification = classify_query_type(query)
alpha = classification["alpha"]
print(f"Query type: {classification['type']} (alpha={alpha})")

engine.alpha = alpha
return engine.search(query, top_k=top_k)


# Test the classifier
test_queries = [
"What is the refund policy for unused items?", # natural language → alpha=0.7
"ICD-10 E11.65", # exact match → alpha=0.2
"Python v3.11.2 ECONNREFUSED error", # mixed → alpha=0.4
"diabetes", # keyword → alpha=0.5
]

for q in test_queries:
result = classify_query_type(q)
print(f"'{q[:40]}' → {result['type']} (alpha={result['alpha']})")

Sparse Index Maintenance

BM25 indexes need maintenance as your corpus grows:

from rank_bm25 import BM25Okapi
import pickle
from pathlib import Path
from typing import List, Dict

class PersistentBM25Index:
"""BM25 index with persistence and incremental updates."""

def __init__(self, storage_path: str = "./bm25_index"):
self.storage_path = Path(storage_path)
self.storage_path.mkdir(exist_ok=True)
self.documents: List[str] = []
self.doc_ids: List[str] = []
self.bm25: BM25Okapi = None

def _tokenize(self, text: str) -> List[str]:
"""Simple tokenizer with lowercasing."""
import re
# Lowercase, split on non-alphanumeric, filter short tokens
tokens = re.findall(r'\b[a-z0-9]{2,}\b', text.lower())
return tokens

def add_documents(self, documents: List[Dict[str, str]]):
"""Add documents and rebuild index."""
for doc in documents:
self.documents.append(doc["text"])
self.doc_ids.append(doc["id"])

tokenized = [self._tokenize(text) for text in self.documents]
self.bm25 = BM25Okapi(tokenized)
print(f"BM25 index rebuilt with {len(self.documents)} documents")

def search(self, query: str, top_k: int = 20) -> List[Dict]:
"""Search and return (doc_id, text, score) results."""
if not self.bm25:
return []
tokens = self._tokenize(query)
scores = self.bm25.get_scores(tokens)
top_indices = scores.argsort()[::-1][:top_k]
return [
{"id": self.doc_ids[i], "text": self.documents[i], "score": float(scores[i])}
for i in top_indices
if scores[i] > 0 # Exclude zero-score documents
]

def save(self):
with open(self.storage_path / "bm25_index.pkl", "wb") as f:
pickle.dump({
"documents": self.documents,
"doc_ids": self.doc_ids,
"bm25": self.bm25,
}, f)
print(f"BM25 index saved ({len(self.documents)} documents)")

def load(self):
index_path = self.storage_path / "bm25_index.pkl"
if not index_path.exists():
print("No saved BM25 index found")
return
with open(index_path, "rb") as f:
data = pickle.load(f)
self.documents = data["documents"]
self.doc_ids = data["doc_ids"]
self.bm25 = data["bm25"]
print(f"BM25 index loaded ({len(self.documents)} documents)")


# Usage
bm25_index = PersistentBM25Index()
bm25_index.add_documents([
{"id": "d1", "text": "Refund policy allows returns within 30 days."},
{"id": "d2", "text": "ICD-10 E11.65 is type 2 diabetes with hyperglycemia."},
])
bm25_index.save()

results = bm25_index.search("diabetes ICD E11.65", top_k=5)
for r in results:
print(f"[{r['score']:.2f}] {r['text'][:60]}")

Hybrid search is not always necessary. The decision comes down to your query distribution:

Profile your queries first. Take 200 representative queries from your system (or hypothesize them if pre-launch). Manually classify: what fraction contains exact identifiers, product codes, technical terms, or rare vocabulary? If it's over 20%, hybrid will likely improve your system meaningfully.

Measure, don't assume. Run your existing eval set with pure dense and hybrid retrieval. If hybrid improves Recall@5 by more than 3 percentage points, the engineering overhead is probably worth it. If the improvement is under 2 percentage points, save the complexity.

Start with RRF. RRF is model-free, requires no normalization, and consistently outperforms naive weighted sum. Use it as your default fusion method. Only switch to learned fusion if you have labeled training data and clear evidence that the fixed weights of RRF are suboptimal.

The systems that benefit most from hybrid search: enterprise knowledge bases with mixed document types, technical support systems (error codes, version strings), healthcare and legal applications (standardized codes), and e-commerce product search (SKUs, model numbers).

Production Synchronization: Keeping BM25 and Vector Index in Sync

A critical operational challenge in hybrid search is maintaining consistency between two separate indexes - the BM25 inverted index and the vector database - as documents are added, updated, and deleted. Inconsistency leads to retrieval asymmetry: a document that was deleted from the vector DB still appears in BM25 results (or vice versa), producing confusing hybrid rankings.

import threading
from dataclasses import dataclass
from typing import Optional
import time


@dataclass
class IndexSyncEvent:
operation: str # "add", "update", "delete"
doc_id: str
text: Optional[str]
metadata: Optional[dict]
timestamp: float


class SynchronizedHybridIndex:
"""
Dual-write wrapper ensuring BM25 and vector index stay in sync.
Uses a transaction log for recovery after partial failures.
"""

def __init__(self, bm25_index: "PersistentBM25Index", vector_store):
self.bm25 = bm25_index
self.vector = vector_store
self.sync_log: list[IndexSyncEvent] = []
self._lock = threading.Lock()

def add_document(self, doc_id: str, text: str, metadata: dict) -> None:
"""Atomically add to both indexes."""
event = IndexSyncEvent("add", doc_id, text, metadata, time.time())

with self._lock:
try:
# Write to transaction log first (for recovery)
self.sync_log.append(event)

# Add to BM25
self.bm25.add_documents([{"id": doc_id, "text": text, **metadata}])

# Add to vector index
embedding = self.vector.embedder.embed([text])[0]
self.vector.upsert(doc_id, embedding.tolist(), {"text": text, **metadata})

except Exception as e:
# Mark event as failed for retry
event.operation = f"FAILED:{event.operation}"
raise RuntimeError(f"Hybrid index sync failed for {doc_id}: {e}") from e

def delete_document(self, doc_id: str) -> None:
"""Delete from both indexes atomically."""
with self._lock:
# BM25 deletion (rebuild index without this doc)
self.bm25.documents = [d for d in self.bm25.documents if d.get("id") != doc_id]
self.bm25.doc_ids = [d for d in self.bm25.doc_ids if d != doc_id]
self.bm25._rebuild_index() # Costly - batch deletes when possible

# Vector deletion
self.vector.delete(doc_id)

self.sync_log.append(IndexSyncEvent("delete", doc_id, None, None, time.time()))

def hybrid_search(self, query: str, top_k: int = 10, alpha: float = 0.5) -> list[dict]:
"""Thread-safe hybrid search with RRF fusion."""
with self._lock:
bm25_results = self.bm25.search(query, top_k=top_k * 2)
dense_results = self.vector.search(query, top_k=top_k * 2)

# RRF fusion
rrf_scores = {}
k = 60

for rank, result in enumerate(bm25_results):
doc_id = result["id"]
rrf_scores[doc_id] = rrf_scores.get(doc_id, 0) + 1 / (k + rank + 1)

for rank, result in enumerate(dense_results):
doc_id = result["id"]
rrf_scores[doc_id] = rrf_scores.get(doc_id, 0) + 1 / (k + rank + 1)

sorted_ids = sorted(rrf_scores, key=rrf_scores.get, reverse=True)[:top_k]
return [{"id": doc_id, "rrf_score": rrf_scores[doc_id]} for doc_id in sorted_ids]

:::warning BM25 Index Rebuild Cost Unlike vector databases that support O(1) single-document insertions, most BM25 implementations (rank_bm25, Whoosh) require rebuilding the full index when documents are deleted - because the IDF term weights change. For corpora over 100K documents, batch your deletions: accumulate a "deletion queue" and rebuild the BM25 index once per hour rather than on every delete. For real-time consistency requirements, use Elasticsearch or OpenSearch which support incremental inverted index updates. :::

Interview Questions and Answers

Q: Explain BM25 and how it differs from dense retrieval. What specific query types does each handle better?

A: BM25 (Best Match 25) is a probabilistic sparse retrieval function based on TF-IDF with two refinements: term frequency saturation (a document that mentions "refund" 20 times is not 4x more relevant than one that mentions it 5 times) and document length normalization (shorter documents that mention a term are more relevant than longer documents with the same absolute frequency). It represents documents as sparse vectors over vocabulary space - a 50,000-dimension vector where most entries are zero. Dense retrieval encodes documents into continuous 768-1536 dimension vectors where every dimension captures semantic meaning. BM25 excels at: exact keyword matches, technical identifiers (error codes: KERNEL_SECURITY_CHECK_FAILURE, product codes: SKU-7823-BLK), rare terms with high information content, precise legal or medical terminology. Dense retrieval excels at: semantic similarity despite different vocabulary ("vehicle breakdown" matching "car not starting"), conceptual search, multilingual queries, paraphrased questions. Hybrid search captures both: for "ICD-10 code E11.65 treatment protocols", BM25 handles the exact code match while dense retrieval handles the semantic intent of "treatment protocols."

Q: What is SPLADE and how does it differ from BM25?

A: SPLADE (Sparse Lexical and Expansion Model) uses a BERT encoder with a log-saturation activation to produce learned sparse representations. Like BM25, it produces vectors over vocabulary space (sparse), so it can be indexed with standard inverted index infrastructure. Unlike BM25, it does query and document expansion: when encoding "myocardial infarction," SPLADE may also activate dimensions for "heart attack," "cardiac event," "MI" - terms that appear in relevant documents but not in the query. BM25 fails on vocabulary mismatch; SPLADE explicitly bridges it. The practical tradeoff: SPLADE is 10-50x slower to index than BM25 (requires BERT inference on each document vs. simple token counting), but improves recall substantially on queries where vocabulary mismatch is significant. SPLADE is a strong choice when you want sparse retrieval performance competitive with dense models on semantic tasks, while maintaining the inverted index infrastructure advantages (precise token-level control, very fast lookup for exact terms).

Q: You implement hybrid search with RRF and notice it improves Recall@5 by 4 points on average, but degrades by 3 points on queries containing exact product SKUs. How do you fix this?

A: This is a query distribution problem - RRF's fixed weighting is a poor fit for a mixed query distribution. The fix is query-type-aware fusion. Implement a lightweight classifier that detects whether a query contains exact identifiers (SKUs, error codes, version strings, IDs) using regex patterns. For identifier queries: weight BM25 heavily (alpha=0.9) or skip dense retrieval entirely. For semantic queries (natural language questions): use standard RRF or weight dense retrieval more (alpha=0.2-0.3). The detection is cheap (regex, no model inference) and the improvement is substantial because you're aligning retrieval strategy with what the query actually needs. Implementation: classify_query_type() function returns "identifier" or "semantic" based on patterns, and adaptive_hybrid_search() dispatches to the appropriate weighting. In production, log which branch each query takes and monitor recall@5 segmented by query type to validate the approach.

Q: Your hybrid search system runs BM25 on a corpus of 2 million documents and dense ANN search on the same corpus. Fusion takes 20ms. BM25 retrieval takes 50ms and ANN takes 80ms sequentially. How do you bring total latency under 100ms?

A: Run BM25 and ANN retrieval in parallel. Using Python's asyncio or concurrent.futures.ThreadPoolExecutor, issue both retrievals simultaneously. Total retrieval time becomes max(BM25, ANN) + fusion = max(50ms, 80ms) + 20ms = 100ms. To shave the remaining 10ms: (1) reduce BM25 top-k from 100 to 50 (BM25 is often called with large top-k to compensate for lower precision - if you're fusing with dense anyway, 50 candidates may be sufficient); (2) profile and optimize the fusion step - RRF over 100+100 candidates should take under 5ms in optimized Python; (3) pre-warm BM25 index in memory (rank-bm25 loads from disk on first call - pre-load at service startup). At 2M documents, BM25 in a well-implemented inverted index (Elasticsearch, OpenSearch) should realistically complete in 10-30ms, not 50ms - so check whether your BM25 implementation is suboptimal before optimizing elsewhere.

Q: How would you evaluate whether adding hybrid search to your existing dense RAG system is worth the engineering effort?

A: A/B evaluation on your own eval set - nothing else is reliable. The process: (1) Sample 200+ queries from your production logs that cover your query distribution. (2) Create a golden relevance set: for each query, manually mark which documents in your corpus are relevant (or use LLM labeling for scale). (3) Run your existing dense-only system and measure Recall@5 and NDCG@5 on the golden set. (4) Run hybrid search (dense + BM25 with RRF) on the same queries and measure the same metrics. (5) Compute the improvement delta: if Recall@5 improves by 3+ percentage points, hybrid is likely worth it; if improvement is under 2 percentage points, the complexity probably isn't justified. Also segment by query type: if hybrid helps 30% of your queries but hurts 5%, check whether the hurt queries are rare edge cases or common patterns. The engineering cost of hybrid is real: maintain a BM25 index alongside your vector DB, handle synchronization on document updates, manage two retrieval paths in your code. Require a clear measured improvement before accepting this complexity.

Summary

Hybrid search combines the exact-match precision of BM25 with the semantic recall of dense retrieval. The most important insight: these two retrieval paradigms fail in complementary ways - dense fails on exact identifiers, BM25 fails on semantic paraphrases - so combining them addresses each other's weaknesses. RRF is the right default fusion strategy: rank-based, no normalization required, robust across diverse query types. Add adaptive weighting (query-type-aware alpha) when your query distribution contains both exact-identifier and semantic sub-populations. Evaluate before building: if 80% of your queries are semantic natural language questions with minimal exact-identifier content, pure dense retrieval may already achieve Recall@5 above 0.90, making hybrid an unnecessary complexity.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the RAG Pipeline demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.