What is hybrid search?

How to combine BM25 sparse retrieval with dense vector search using Reciprocal Rank Fusion, and how to apply cross-encoder reranking for precision that neither method achieves alone.

How does BM25 work in practice?

Hybrid Search and Reranking covers hybrid search, BM25, sparse retrieval from first principles with code examples. Free lesson at https://engineersofai.com/docs/ai-engineering/rag-engineering/hybrid-search-and-reranking

What is the difference between hybrid search and sparse retrieval?

See the full breakdown at https://engineersofai.com/docs/ai-engineering/rag-engineering/hybrid-search-and-reranking

:::tip 🎮 Interactive Playground Visualize this concept: Try the Hybrid Search demo on the EngineersOfAI Playground - no code required. :::

Hybrid Search and Reranking

Reading time: 45–55 minutes Interview relevance: Very High - hybrid search and reranking are the distinguishing features of production-grade RAG systems versus toy demos Target roles: AI Engineer, ML Engineer, Backend Engineer (AI), Search Engineer, MLOps Engineer

The Code Search That Broke at Midnight

The team had built a RAG system for their internal developer platform. It was excellent. Engineers could ask "How does our authentication system work?" and get a clear, well-cited answer synthesized from their documentation, architecture decision records, and code comments. The demo went well. The team shipped it.

One week later, a senior engineer filed a high-priority bug: "The search is useless for code lookups." She had been trying to find all call sites of a function called validate_jwt_token. She tried: "find validate_jwt_token usage", "where is validate_jwt_token called", "validate_jwt_token function references". Every query returned documentation about JWT authentication in general - high-quality semantic answers to conceptual questions she had not asked. Zero results showed the actual function or its call sites.

The root cause was architectural: the system used pure dense retrieval. Embedding models excel at semantic similarity - they understand that "car" and "automobile" mean the same thing, that "authentication" and "login" are related concepts. But they struggle with exact string matching. The string validate_jwt_token is not semantically similar to anything. It is a specific identifier. No amount of training on English text teaches an embedding model to prefer chunks containing validate_jwt_token over chunks discussing JWT validation in general.

Meanwhile, BM25 - the traditional keyword-based search algorithm - would have solved this trivially. BM25 looks for exact term matches and their frequency. A chunk containing validate_jwt_token three times scores extremely high for the query "validate_jwt_token". It is blind to semantic meaning, but perfect for exact identifier lookup.

The solution was hybrid search: run BM25 (sparse retrieval) and dense retrieval in parallel, then combine their results using Reciprocal Rank Fusion (RRF). For conceptual questions ("how does authentication work?"), dense retrieval dominates and BM25 contributes noise that RRF appropriately down-weights. For exact identifier queries ("find validate_jwt_token"), BM25 dominates and dense retrieval contributes noise that RRF appropriately down-weights. Neither retriever is universally better - but their combination is almost always better than either alone.

Then, to push quality further, the team added a cross-encoder reranker: take the top-30 hybrid results, run each (query, document) pair through a cross-encoder model that attends to both simultaneously, and return the top-5 reranked results. The reranker had no speed advantage (it reads every document carefully), but it had accuracy the retrieval stage could never match. Final system: hybrid retrieval for broad recall, cross-encoder reranking for precision. Production retrieval accuracy improved 23%.

This lesson builds that system from scratch.

Why Hybrid Search Exists

The Two Failure Modes of Pure Retrieval

Every retrieval system has characteristic failure modes. Understanding them is the key to knowing when to use hybrid search.

Dense retrieval failures (when embeddings let you down):

Exact identifier queries: validate_jwt_token, ERRCODE_0x4A3F, PR-2847, SKU-78291-B
Rare proper nouns: obscure product names, person names, company names not well-represented in training data
Precise numerical values: "find all records where timeout is 30000ms" - embeddings conflate 30000 and 3000
Negation: "documents that do NOT mention rate limiting" - embeddings struggle with negation
Code syntax: specific API signatures, SQL queries, configuration snippets

Sparse retrieval failures (when BM25 lets you down):

Vocabulary mismatch: user asks about "car maintenance", document discusses "vehicle upkeep"
Paraphrase queries: "how to improve model performance" vs. "techniques for boosting accuracy"
Conceptual questions: "explain the philosophy behind our API design" - no single keyword captures this
Cross-lingual queries: user asks in English, document is in Spanish
Implicit context: "fix the same problem we had last quarter" - no keywords to match

Neither retriever is universally better. The winning strategy is to use both and intelligently merge their results.

The Vocabulary Mismatch Problem

The vocabulary mismatch problem is worth understanding in depth, because it motivated the shift from pure sparse retrieval (which dominated search for 30 years) to the dense retrieval era.

Consider a knowledge base about vehicle maintenance. A user asks: "How do I change the oil in my car?" The relevant document says: "Engine lubricant replacement procedure for automobiles."

BM25 word overlap:

"change" → not in doc
"oil" → not in doc (doc says "lubricant")
"car" → not in doc (doc says "automobile")
Overlap: 0 matching terms → BM25 score: 0

The document is perfectly relevant. BM25 gives it a score of zero because the user used different words than the document.

Dense retrieval solves this: the embedding for "change oil in car" is geometrically close to "engine lubricant replacement for automobile" because the training data contained both phrases and taught the model they mean the same thing.

But when a developer searches for validate_jwt_token, the dense retrieval model has never seen this specific identifier in training. It produces an embedding that is vaguely "authentication-related" - and retrieves generic authentication documentation instead of the exact call sites.

Hybrid search solves both problems simultaneously.

BM25: A Deep Dive

BM25 (Best Match 25) is a probabilistic relevance ranking function that has been the backbone of search engines since the 1990s. It improves upon the earlier TF-IDF approach with two key innovations.

TF-IDF Background

TF-IDF (Term Frequency-Inverse Document Frequency) scores a document for a query term as:

$\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t)$

Where:

$\text{TF}(t, d)$ = how many times term $t$ appears in document $d$ (favors documents that mention the term often)
$\text{IDF}(t) = \log\frac{N}{df(t)}$ = log of total documents $N$ divided by documents containing $t$ (penalizes common terms like "the")

TF-IDF has two known problems that BM25 fixes:

TF is unbounded: a document mentioning "python" 100 times scores 10x higher than one mentioning it 10 times. But those documents are probably equally relevant - diminishing returns matter.
Document length normalization is missing: a long document naturally contains more term occurrences. A 10,000-word document containing "python" 10 times is less relevant than a 500-word document containing it 10 times.

BM25 Formula

$\text{BM25}(d, Q) = \sum_{t \in Q} \text{IDF}(t) \cdot \frac{f(t, d) \cdot (k_1 + 1)}{f(t, d) + k_1 \cdot \left(1 - b + b \cdot \frac{|d|}{\text{avgdl}}\right)}$

Let us parse each component:

$\text{IDF}(t)$ : Inverse document frequency - penalizes common words. "The" appears in every document, so it gets near-zero weight. "validate_jwt_token" appears in few documents, so it gets high weight.

$f(t, d)$ : Raw term frequency - how many times term $t$ appears in document $d$ .

$k_1 \cdot (k_1 + 1)$ saturation factor: The $k_1$ parameter controls term frequency saturation. With $k_1 = 1.5$ , a document with 1 occurrence scores 66% as well as a document with 100 occurrences - diminishing returns. At $k_1 = 0$ , term frequency is completely ignored (binary presence/absence). At $k_1 = \infty$ , BM25 approaches TF-IDF. Typical range: 1.2–2.0.

Length normalization term: $\left(1 - b + b \cdot \frac{|d|}{\text{avgdl}}\right)$ normalizes by document length. $|d|$ is the length of document $d$ . $\text{avgdl}$ is the average document length in the corpus. The $b$ parameter controls how much length normalization is applied. $b = 1$ : full normalization. $b = 0$ : no normalization. Typical value: $b = 0.75$ .

When BM25 Shines

BM25 is the superior choice when:

Exact term queries: function names, error codes, product IDs, model numbers
Technical documentation search: users know the exact terminology
Short, specific queries: "ConnectionTimeout exception" - the user already knows the term they need
When users cannot paraphrase: searching logs, monitoring data, configuration files

Python BM25 from Scratch

import math
from collections import Counter, defaultdict

class BM25:
    """
    BM25 implementation from scratch.
    Shows exactly how the formula works.
    In production: use rank_bm25 (pip install rank-bm25).
    """

    def __init__(self, corpus: list[str], k1: float = 1.5, b: float = 0.75):
        self.k1 = k1
        self.b = b
        self.corpus = corpus
        self.n = len(corpus)  # Number of documents

        # Tokenize all documents
        self.tokenized_corpus = [self._tokenize(doc) for doc in corpus]

        # Build the inverted index: term → {doc_id → frequency}
        self.idf: dict[str, float] = {}
        self.doc_freqs: list[dict[str, int]] = []
        self.doc_lengths: list[int] = []

        df = defaultdict(int)  # Document frequency per term

        for tokens in self.tokenized_corpus:
            token_freq = Counter(tokens)
            self.doc_freqs.append(dict(token_freq))
            self.doc_lengths.append(len(tokens))
            for token in token_freq:
                df[token] += 1

        self.avgdl = sum(self.doc_lengths) / self.n

        # Compute IDF for each term
        # Robertson-Sparck Jones IDF: log((N - df + 0.5) / (df + 0.5))
        for term, freq in df.items():
            self.idf[term] = math.log(
                (self.n - freq + 0.5) / (freq + 0.5) + 1
            )

    def _tokenize(self, text: str) -> list[str]:
        """
        Simple whitespace + lowercase tokenization.
        In production: use NLTK or spaCy for stemming, stopword removal.
        """
        return text.lower().split()

    def score(self, query: str, doc_idx: int) -> float:
        """
        Compute BM25 score for a (query, document) pair.

        This is the core formula: sum over query terms of
        IDF * (f * (k1 + 1)) / (f + k1 * (1 - b + b * dl / avgdl))
        """
        query_tokens = self._tokenize(query)
        doc_freq = self.doc_freqs[doc_idx]
        doc_len = self.doc_lengths[doc_idx]

        score = 0.0
        for token in query_tokens:
            if token not in self.idf:
                continue  # Term not in corpus - contributes 0

            idf = self.idf[token]
            f = doc_freq.get(token, 0)  # Term frequency in this document

            # BM25 term frequency saturation
            numerator = f * (self.k1 + 1)
            denominator = f + self.k1 * (
                1 - self.b + self.b * (doc_len / self.avgdl)
            )

            score += idf * (numerator / denominator)

        return score

    def retrieve(self, query: str, k: int = 10) -> list[tuple[int, float]]:
        """
        Retrieve top-k documents for a query.
        Returns list of (doc_idx, score) pairs sorted by score descending.
        """
        scores = [(i, self.score(query, i)) for i in range(self.n)]
        scores.sort(key=lambda x: x[1], reverse=True)
        # Filter out zero-score results (no term overlap)
        return [(idx, s) for idx, s in scores[:k] if s > 0]

Dense Retrieval: The Complementary Strength

Dense retrieval encodes both queries and documents into the same embedding space and finds nearest neighbors. The key distinction from sparse retrieval: it captures semantic meaning rather than lexical overlap.

Bi-encoder vs. Cross-encoder

Bi-encoder (what embedding models use): The query and document are encoded independently into fixed vectors. Similarity is computed as the dot product or cosine similarity of these two vectors.

query  → encoder → q_vec
doc    → encoder → d_vec
score  = dot_product(q_vec, d_vec)

Speed: O(n) once all document vectors are pre-computed (just one dot product per document). This is why you can search millions of documents in milliseconds - the document embeddings are pre-computed and indexed.

Cross-encoder: The query and document are concatenated and run through a transformer together. Full attention between every query token and every document token.

[query, doc] → transformer → relevance_score

Quality: dramatically better than bi-encoder because it can attend to the specific query terms in the context of the specific document. "Is mention of X relevant to asking about Y?" - the cross-encoder can reason about this. The bi-encoder cannot.

Speed: O(n × model_inference_time) per query. If each inference takes 10ms and you have 1M documents, that is 10,000 seconds. Cross-encoders cannot scale to retrieval - they are only feasible for reranking a small set (10–50 candidates).

The BEIR Benchmark

BEIR (Benchmarking Information Retrieval) is the standard evaluation suite for retrieval models. It tests models across 18 diverse datasets - news, scientific papers, biomedical, code, FAQ, etc.

Key finding: no single retrieval model wins across all domains. Dense models win on paraphrase-heavy tasks. Sparse models win on exact-match tasks. Hybrid models tend to win on average across all tasks. This is exactly why hybrid search became the production standard.

Fusion Algorithms

When you have two ranked lists - one from BM25, one from dense retrieval - how do you combine them into a single ranking?

Reciprocal Rank Fusion (RRF)

RRF was introduced by Cormack, Clarke, and Buettcher in 2009 and has remained the dominant fusion method because of its simplicity, robustness, and empirical effectiveness.

The RRF score for a document $d$ across rankings $R$ is:

$\text{RRF}(d) = \sum_{r \in R} \frac{1}{k + \text{rank}_r(d)}$

Where:

$k$ = smoothing constant (default: 60)
$\text{rank}_r(d)$ = position of document $d$ in ranking $r$ (1-indexed)

If a document does not appear in a ranking, it contributes 0 for that ranking.

Why k=60? The constant $k$ prevents documents ranked first from dominating too strongly. With $k=60$ , the difference between rank 1 and rank 2 is only $\frac{1}{61} - \frac{1}{62} \approx 0.00026$ . Documents that appear in multiple rankings benefit from additive fusion, which creates strong signal for documents that rank well in both retrievers.

Why RRF beats weighted score averaging: Raw scores from BM25 and dense retrieval live on completely different scales. BM25 scores might range from 0 to 50; cosine similarity scores range from -1 to 1. Naively adding them together would let the larger-scale scorer dominate. RRF avoids this entirely by working only with ranks, not scores. Rank 1 in BM25 is worth the same as Rank 1 in dense retrieval, regardless of what the actual scores were.

from collections import defaultdict
from dataclasses import dataclass

@dataclass
class RankedResult:
    doc_id: str
    score: float
    rank: int

def rrf_fusion(
    ranked_lists: list[list[RankedResult]],
    k: int = 60,
) -> list[RankedResult]:
    """
    Reciprocal Rank Fusion over multiple ranked lists.

    Args:
        ranked_lists: Each inner list is a ranked list of results
                      (e.g., [bm25_results, dense_results])
        k: RRF smoothing constant (default 60 per original paper)

    Returns:
        Merged and re-ranked list of results.
    """
    rrf_scores: dict[str, float] = defaultdict(float)

    for ranked_list in ranked_lists:
        for result in ranked_list:
            # Each document contributes 1 / (k + rank) to its RRF score
            rrf_scores[result.doc_id] += 1.0 / (k + result.rank)

    # Sort by RRF score descending
    sorted_docs = sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)

    return [
        RankedResult(doc_id=doc_id, score=score, rank=rank)
        for rank, (doc_id, score) in enumerate(sorted_docs, 1)
    ]

RRF example with three documents:

Document	BM25 rank	Dense rank	BM25 contribution	Dense contribution	RRF score
Doc A	1	5	1/61 = 0.01639	1/65 = 0.01538	0.03177
Doc B	3	1	1/63 = 0.01587	1/61 = 0.01639	0.03226
Doc C	2	-	1/62 = 0.01613	0	0.01613

Doc B wins (rank 1 in dense + rank 3 in BM25), edging out Doc A (rank 1 in BM25 + rank 5 in dense). Doc C, despite rank 2 in BM25, loses because it did not appear in dense retrieval at all.

Weighted Fusion

When you have domain knowledge that one retriever is more reliable for your use case, use weighted fusion:

def weighted_fusion(
    dense_results: list[RankedResult],
    sparse_results: list[RankedResult],
    dense_weight: float = 0.7,
    sparse_weight: float = 0.3,
) -> list[RankedResult]:
    """
    Weighted combination of normalized scores.
    Use when you know which retriever is more reliable for your domain.
    """
    # Normalize scores to [0, 1] range for each retriever
    def normalize(results: list[RankedResult]) -> dict[str, float]:
        if not results:
            return {}
        min_s = min(r.score for r in results)
        max_s = max(r.score for r in results)
        span = max_s - min_s
        if span < 1e-10:
            return {r.doc_id: 1.0 for r in results}
        return {r.doc_id: (r.score - min_s) / span for r in results}

    dense_norm  = normalize(dense_results)
    sparse_norm = normalize(sparse_results)

    all_doc_ids = set(dense_norm) | set(sparse_norm)
    combined: dict[str, float] = {}

    for doc_id in all_doc_ids:
        d_score = dense_norm.get(doc_id, 0.0)
        s_score = sparse_norm.get(doc_id, 0.0)
        combined[doc_id] = dense_weight * d_score + sparse_weight * s_score

    sorted_docs = sorted(combined.items(), key=lambda x: x[1], reverse=True)
    return [
        RankedResult(doc_id=doc_id, score=score, rank=rank)
        for rank, (doc_id, score) in enumerate(sorted_docs, 1)
    ]

When to use weighted fusion over RRF: Use weighted fusion when you have empirical evidence (from your evaluation set) that one retriever is consistently better for your domain. For code search: weight BM25 higher. For cross-lingual search: weight dense higher. When unsure, prefer RRF - it is more robust to score scale differences.

Cross-Encoder Reranking

Retrieval and ranking are fundamentally different problems. Retrieval must be fast - O(log n) or better - because it searches millions of documents. Ranking can be slow - O(m) for the retrieved candidates - because it only processes tens to hundreds of documents.

Cross-encoder rerankers exploit this separation: use a fast retrieval method (hybrid search) to find m=30–100 candidates, then use a high-quality but slow cross-encoder to precisely rank the top candidates.

How Cross-Encoders Work

A bi-encoder embedding model processes query and document separately, losing the opportunity for query terms to attend to document terms. A cross-encoder processes them together:

Input: [CLS] query tokens [SEP] document tokens [SEP]
Output: a single relevance score (sigmoid of CLS token)

Because every query token can attend to every document token through the transformer's attention mechanism, the cross-encoder can reason about nuanced relevance. "Is this document's mention of 'timeout' specifically the kind of timeout the user is asking about?" The bi-encoder cannot answer this. The cross-encoder can.

Reranking Options

Cohere Rerank: The most commonly used managed reranker. rerank-english-v3.0 is their English model, rerank-multilingual-v3.0 for multilingual. Simple API: pass the query and a list of documents, get back ranked results with relevance scores. Charged per search unit (1000 chunks reranked = 1 search unit).

ms-marco-MiniLM cross-encoders: Open-source models fine-tuned on the MS MARCO passage retrieval dataset. Available on HuggingFace. The cross-encoder/ms-marco-MiniLM-L-6-v2 model is an excellent balance of speed and quality. Self-hosted on CPU or GPU.

BGE Reranker: BAAI's bge-reranker-large is competitive with Cohere Rerank for many English retrieval tasks. Open-source, self-hostable.

The Two-Stage Pipeline

Retrieve N=30 candidates via hybrid search (fast)
    ↓
Rerank N=30 with cross-encoder (slow but only N documents)
    ↓
Return top k=5 to Claude for generation

This two-stage approach is the state of the art for production RAG retrieval quality. The intuition: hybrid search is good at "probably relevant" (high recall), cross-encoder reranking is good at "definitely relevant" (high precision).

Practical parameters:

Retrieve N=20–50 from hybrid search (higher N = better recall but slower reranking)
Rerank all N, take top 5 for generation
Measure end-to-end latency: hybrid retrieval (10–50ms) + reranking (50–500ms) + generation (1000–3000ms)

Architecture and Diagrams

Dense vs. Sparse Retrieval Failure Modes

Hybrid Search + Reranking Pipeline

RRF Fusion Algorithm Visualization

Production Code

The following implementation covers BM25 from scratch, hybrid search with RRF, adaptive weight detection, cross-encoder reranking, and a complete RAG pipeline with Claude generation.

"""
Hybrid Search and Reranking for Production RAG
===============================================
Covers:
  - BM25Retriever: full BM25 implementation from scratch
  - DenseRetriever: cosine similarity with mocked embeddings
  - rrf_fusion(): Reciprocal Rank Fusion
  - WeightedFusion: configurable weighted combination
  - HybridSearcher: adaptive BM25 + dense combination
  - CrossEncoderReranker: interface with mock implementation
  - HybridRAGPipeline: retrieve → rerank → generate with Claude

Install: pip install anthropic numpy
Production deps: pip install rank-bm25 sentence-transformers cohere voyageai
"""

import anthropic
import math
import numpy as np
import time
import re
import logging
from collections import Counter, defaultdict
from dataclasses import dataclass, field
from typing import Optional, Callable

logger = logging.getLogger(__name__)


# ─────────────────────────────────────────────
# Data Structures
# ─────────────────────────────────────────────

@dataclass
class Document:
    """A document with text and optional metadata."""
    doc_id: str
    text: str
    metadata: dict = field(default_factory=dict)


@dataclass
class SearchResult:
    """Unified result type for both sparse and dense retrievers."""
    doc_id: str
    text: str
    score: float
    rank: int
    retriever: str        # "bm25", "dense", "hybrid", "reranked"
    metadata: dict = field(default_factory=dict)


# ─────────────────────────────────────────────
# BM25 Retriever
# ─────────────────────────────────────────────

class BM25Retriever:
    """
    Full BM25 implementation from scratch.

    In production: use rank_bm25 (pip install rank-bm25):
        from rank_bm25 import BM25Okapi
        bm25 = BM25Okapi([doc.text.split() for doc in corpus])
        scores = bm25.get_scores(query.split())
    """

    def __init__(
        self,
        documents: list[Document],
        k1: float = 1.5,
        b: float = 0.75,
    ):
        self.documents = documents
        self.k1 = k1
        self.b = b
        self.n = len(documents)

        # Tokenize
        self.tokenized = [self._tokenize(doc.text) for doc in documents]
        self.doc_lengths = [len(tokens) for tokens in self.tokenized]
        self.avgdl = sum(self.doc_lengths) / self.n if self.n > 0 else 0

        # Build inverted index and compute IDF
        self.doc_freqs: list[dict[str, int]] = [Counter(t) for t in self.tokenized]
        df: dict[str, int] = defaultdict(int)
        for token_counts in self.doc_freqs:
            for token in token_counts:
                df[token] += 1

        # Robertson-Sparck Jones IDF
        self.idf: dict[str, float] = {
            term: math.log((self.n - freq + 0.5) / (freq + 0.5) + 1)
            for term, freq in df.items()
        }

    def _tokenize(self, text: str) -> list[str]:
        """Lowercase, split on non-alphanumeric chars."""
        # Keep underscores (important for function names like validate_jwt_token)
        tokens = re.findall(r'\b[\w]+\b', text.lower())
        return tokens

    def score(self, query: str, doc_idx: int) -> float:
        """BM25 score for a query-document pair."""
        query_tokens = self._tokenize(query)
        doc_freq = self.doc_freqs[doc_idx]
        doc_len  = self.doc_lengths[doc_idx]

        score = 0.0
        for token in query_tokens:
            if token not in self.idf:
                continue
            f = doc_freq.get(token, 0)
            if f == 0:
                continue

            idf = self.idf[token]
            # BM25 term frequency saturation + length normalization
            tf_normalized = (f * (self.k1 + 1)) / (
                f + self.k1 * (1 - self.b + self.b * doc_len / self.avgdl)
            )
            score += idf * tf_normalized

        return score

    def retrieve(self, query: str, k: int = 10) -> list[SearchResult]:
        """Retrieve top-k documents by BM25 score."""
        scored = [
            (i, self.score(query, i))
            for i in range(self.n)
        ]
        # Sort by score descending, filter zero scores
        scored = [(i, s) for i, s in scored if s > 0]
        scored.sort(key=lambda x: x[1], reverse=True)

        results = []
        for rank, (idx, score) in enumerate(scored[:k], 1):
            results.append(SearchResult(
                doc_id=self.documents[idx].doc_id,
                text=self.documents[idx].text,
                score=score,
                rank=rank,
                retriever="bm25",
                metadata=self.documents[idx].metadata,
            ))
        return results


# ─────────────────────────────────────────────
# Dense Retriever
# ─────────────────────────────────────────────

class DenseRetriever:
    """
    Dense retrieval using pre-computed embeddings.

    In production: replace _mock_embed with your actual embedding client:
        import voyageai
        client = voyageai.Client(api_key=VOYAGE_API_KEY)
        embeddings = client.embed(texts, model="voyage-3").embeddings
    """

    def __init__(
        self,
        documents: list[Document],
        embed_fn: Callable[[list[str]], list[list[float]]],
    ):
        self.documents = documents
        self.embed_fn = embed_fn

        # Pre-compute and normalize document embeddings
        print(f"  Pre-computing embeddings for {len(documents)} documents...")
        raw_embeddings = embed_fn([doc.text for doc in documents])
        self.embeddings = [self._normalize(e) for e in raw_embeddings]

    def _normalize(self, embedding: list[float]) -> np.ndarray:
        arr = np.array(embedding, dtype=np.float32)
        norm = np.linalg.norm(arr)
        return arr / norm if norm > 1e-10 else arr

    def retrieve(self, query: str, k: int = 10) -> list[SearchResult]:
        """Retrieve top-k documents by cosine similarity."""
        query_embedding = self._normalize(self.embed_fn([query])[0])

        # Vectorized dot product (cosine similarity on normalized vectors)
        matrix = np.stack(self.embeddings, axis=0)  # (n, d)
        scores  = matrix @ query_embedding             # (n,)

        k_actual = min(k, len(self.documents))
        top_indices = np.argpartition(scores, -k_actual)[-k_actual:]
        top_indices = top_indices[np.argsort(scores[top_indices])[::-1]]

        results = []
        for rank, idx in enumerate(top_indices, 1):
            results.append(SearchResult(
                doc_id=self.documents[idx].doc_id,
                text=self.documents[idx].text,
                score=float(scores[idx]),
                rank=rank,
                retriever="dense",
                metadata=self.documents[idx].metadata,
            ))
        return results


# ─────────────────────────────────────────────
# RRF Fusion
# ─────────────────────────────────────────────

def rrf_fusion(
    *ranked_lists: list[SearchResult],
    k: int = 60,
    max_results: int = None,
) -> list[SearchResult]:
    """
    Reciprocal Rank Fusion over multiple ranked lists.

    Formula: score(doc) = sum over lists of 1 / (k + rank_in_list)

    Args:
        *ranked_lists: Variable number of ranked result lists
        k: RRF smoothing constant (default 60 per Cormack et al. 2009)
        max_results: Return at most this many results

    Returns:
        Single merged and re-ranked list of SearchResults.
    """
    rrf_scores: dict[str, float] = defaultdict(float)
    doc_data: dict[str, dict] = {}  # Store text and metadata by doc_id

    for ranked_list in ranked_lists:
        for result in ranked_list:
            # Core RRF formula: 1 / (k + rank)
            rrf_scores[result.doc_id] += 1.0 / (k + result.rank)
            # Store doc data from first occurrence
            if result.doc_id not in doc_data:
                doc_data[result.doc_id] = {
                    "text": result.text,
                    "metadata": result.metadata,
                }

    # Sort by RRF score descending
    sorted_docs = sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)

    if max_results:
        sorted_docs = sorted_docs[:max_results]

    return [
        SearchResult(
            doc_id=doc_id,
            text=doc_data[doc_id]["text"],
            score=score,
            rank=rank,
            retriever="hybrid_rrf",
            metadata=doc_data[doc_id]["metadata"],
        )
        for rank, (doc_id, score) in enumerate(sorted_docs, 1)
    ]


def weighted_fusion(
    dense_results: list[SearchResult],
    sparse_results: list[SearchResult],
    dense_weight: float = 0.7,
    sparse_weight: float = 0.3,
) -> list[SearchResult]:
    """
    Weighted combination using min-max normalized scores.
    Use when you have empirical evidence one retriever outperforms the other.
    """
    def normalize(results: list[SearchResult]) -> dict[str, float]:
        if not results:
            return {}
        scores = [r.score for r in results]
        min_s, max_s = min(scores), max(scores)
        span = max_s - min_s
        if span < 1e-10:
            return {r.doc_id: 1.0 for r in results}
        return {r.doc_id: (r.score - min_s) / span for r in results}

    dense_norm  = normalize(dense_results)
    sparse_norm = normalize(sparse_results)

    # Collect all text and metadata
    doc_data: dict[str, dict] = {}
    for r in dense_results + sparse_results:
        if r.doc_id not in doc_data:
            doc_data[r.doc_id] = {"text": r.text, "metadata": r.metadata}

    all_ids = set(dense_norm) | set(sparse_norm)
    combined = {
        doc_id: (
            dense_weight  * dense_norm.get(doc_id, 0.0) +
            sparse_weight * sparse_norm.get(doc_id, 0.0)
        )
        for doc_id in all_ids
    }

    sorted_docs = sorted(combined.items(), key=lambda x: x[1], reverse=True)

    return [
        SearchResult(
            doc_id=doc_id,
            text=doc_data[doc_id]["text"],
            score=score,
            rank=rank,
            retriever="hybrid_weighted",
            metadata=doc_data[doc_id]["metadata"],
        )
        for rank, (doc_id, score) in enumerate(sorted_docs, 1)
    ]


# ─────────────────────────────────────────────
# Hybrid Searcher with Adaptive Weights
# ─────────────────────────────────────────────

class HybridSearcher:
    """
    Combines BM25 and dense retrieval with RRF or weighted fusion.
    Supports adaptive weighting based on query type detection.
    """

    def __init__(
        self,
        bm25_retriever: BM25Retriever,
        dense_retriever: DenseRetriever,
        fusion: str = "rrf",       # "rrf" or "weighted"
        rrf_k: int = 60,
        dense_weight: float = 0.7,
        sparse_weight: float = 0.3,
    ):
        self.bm25    = bm25_retriever
        self.dense   = dense_retriever
        self.fusion  = fusion
        self.rrf_k   = rrf_k
        self.dense_weight  = dense_weight
        self.sparse_weight = sparse_weight

    def _detect_query_type(self, query: str) -> str:
        """
        Heuristic classification of query type.

        Returns "lexical" for identifier/exact-match queries,
        "semantic" for conceptual queries.

        In production: use claude-haiku-4-5 for more accurate classification.
        """
        query_lower = query.lower()

        # Signals of a lexical/identifier query
        lexical_signals = [
            bool(re.search(r'[_][a-z]', query_lower)),  # snake_case identifiers
            bool(re.search(r'[A-Z][a-z]+[A-Z]', query)),  # camelCase
            bool(re.search(r'0x[0-9a-f]+', query_lower)),  # hex codes
            bool(re.search(r'\b[A-Z]{3,}\b', query)),  # ALLCAPS codes
            bool(re.search(r'\d{4,}', query)),  # Long numbers (error codes)
            len(query.split()) <= 3,  # Very short query
        ]

        if sum(lexical_signals) >= 2:
            return "lexical"

        # Signals of a semantic query
        semantic_signals = [
            len(query.split()) > 5,
            any(w in query_lower for w in [
                "how", "why", "what", "explain", "describe",
                "understand", "overview", "difference", "compare",
            ]),
        ]

        if sum(semantic_signals) >= 1:
            return "semantic"

        return "balanced"

    def search(
        self,
        query: str,
        k: int = 10,
        adaptive: bool = True,
    ) -> list[SearchResult]:
        """
        Hybrid search combining BM25 and dense retrieval.

        Args:
            query: User query string
            k: Number of results to return
            adaptive: If True, adjust weights based on detected query type
        """
        # Retrieve candidates from both retrievers
        retrieve_n = k * 3  # Cast wide, then fuse down to k
        bm25_results  = self.bm25.retrieve(query, k=retrieve_n)
        dense_results = self.dense.retrieve(query, k=retrieve_n)

        # Adaptive weight adjustment
        dense_w  = self.dense_weight
        sparse_w = self.sparse_weight

        if adaptive:
            query_type = self._detect_query_type(query)
            if query_type == "lexical":
                dense_w, sparse_w = 0.3, 0.7  # Lean toward BM25 for identifiers
                logger.debug(f"Lexical query detected - weights: dense={dense_w}, sparse={sparse_w}")
            elif query_type == "semantic":
                dense_w, sparse_w = 0.8, 0.2  # Lean toward dense for concepts
                logger.debug(f"Semantic query detected - weights: dense={dense_w}, sparse={sparse_w}")

        # Fuse results
        if self.fusion == "rrf":
            merged = rrf_fusion(bm25_results, dense_results, k=self.rrf_k, max_results=k)
        else:
            merged = weighted_fusion(
                dense_results, bm25_results,
                dense_weight=dense_w, sparse_weight=sparse_w
            )
            merged = merged[:k]

        return merged


# ─────────────────────────────────────────────
# Cross-Encoder Reranker
# ─────────────────────────────────────────────

class CrossEncoderReranker:
    """
    Cross-encoder reranker interface.

    This mock shows the interface. In production, use one of:

    Option 1 - Cohere Rerank (managed):
        import cohere
        co = cohere.Client(api_key=COHERE_API_KEY)
        results = co.rerank(
            model="rerank-english-v3.0",
            query=query,
            documents=[r.text for r in candidates],
            top_n=top_k,
        )

    Option 2 - sentence-transformers (self-hosted):
        from sentence_transformers import CrossEncoder
        model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
        scores = model.predict([(query, r.text) for r in candidates])

    Option 3 - BGE Reranker (self-hosted):
        from FlagEmbedding import FlagReranker
        reranker = FlagReranker("BAAI/bge-reranker-large")
        scores = reranker.compute_score([(query, r.text) for r in candidates])
    """

    def __init__(self, model_name: str = "cross-encoder/ms-marco-MiniLM-L-6-v2"):
        self.model_name = model_name
        self._model = None  # Lazy load in production

    def _load_model(self):
        """Load the cross-encoder model (lazy initialization)."""
        try:
            from sentence_transformers import CrossEncoder
            self._model = CrossEncoder(self.model_name)
        except ImportError:
            logger.warning(
                "sentence-transformers not installed. Using mock scores. "
                "pip install sentence-transformers"
            )
            self._model = None

    def rerank(
        self,
        query: str,
        candidates: list[SearchResult],
        top_k: int = 5,
    ) -> list[SearchResult]:
        """
        Rerank candidates using cross-encoder relevance scoring.
        Returns top_k results sorted by cross-encoder score.
        """
        if not candidates:
            return []

        if self._model is None:
            self._load_model()

        if self._model is not None:
            # Real cross-encoder scoring
            pairs = [(query, r.text) for r in candidates]
            scores = self._model.predict(pairs)

            scored = list(zip(candidates, scores))
            scored.sort(key=lambda x: x[1], reverse=True)

            return [
                SearchResult(
                    doc_id=result.doc_id,
                    text=result.text,
                    score=float(score),
                    rank=rank,
                    retriever="reranked",
                    metadata=result.metadata,
                )
                for rank, (result, score) in enumerate(scored[:top_k], 1)
            ]

        else:
            # Mock: simulate reranking with perturbed original scores
            # In a real mock test, you would inject scores via dependency injection
            np.random.seed(hash(query) % (2**31))
            perturbations = np.random.normal(0, 0.1, len(candidates))
            scored = [
                (r, r.score + p)
                for r, p in zip(candidates, perturbations)
            ]
            scored.sort(key=lambda x: x[1], reverse=True)

            return [
                SearchResult(
                    doc_id=result.doc_id,
                    text=result.text,
                    score=float(score),
                    rank=rank,
                    retriever="reranked_mock",
                    metadata=result.metadata,
                )
                for rank, (result, score) in enumerate(scored[:top_k], 1)
            ]


# ─────────────────────────────────────────────
# Complete Hybrid RAG Pipeline
# ─────────────────────────────────────────────

class HybridRAGPipeline:
    """
    Production RAG pipeline with hybrid retrieval and optional reranking.

    Architecture:
      Query → BM25 (k=30) + Dense (k=30) → RRF → [Reranker] → Claude
    """

    def __init__(
        self,
        hybrid_searcher: HybridSearcher,
        anthropic_api_key: str,
        reranker: Optional[CrossEncoderReranker] = None,
        retrieve_k: int = 20,
        final_k: int = 5,
    ):
        self.searcher    = hybrid_searcher
        self.reranker    = reranker
        self.retrieve_k  = retrieve_k
        self.final_k     = final_k
        self.claude      = anthropic.Anthropic(api_key=anthropic_api_key)

    def retrieve_and_rerank(
        self, query: str
    ) -> tuple[list[SearchResult], list[SearchResult]]:
        """
        Two-stage retrieval: hybrid search → optional reranking.

        Returns (hybrid_results, final_results) for diagnostics.
        """
        # Stage 1: Hybrid retrieval
        hybrid_results = self.searcher.search(query, k=self.retrieve_k)

        if not hybrid_results:
            return [], []

        # Stage 2: Optional reranking
        if self.reranker:
            final_results = self.reranker.rerank(
                query, hybrid_results, top_k=self.final_k
            )
        else:
            final_results = hybrid_results[:self.final_k]

        return hybrid_results, final_results

    def generate(
        self,
        query: str,
        final_results: list[SearchResult],
    ) -> str:
        """Generate a cited answer using Claude."""
        if not final_results:
            return "No relevant documents found for this query."

        context = "\n\n".join(
            f'<document id="{r.doc_id}" '
            f'relevance="{r.score:.4f}" '
            f'retriever="{r.retriever}">\n{r.text}\n</document>'
            for r in final_results
        )

        prompt = f"""You are an expert assistant with access to retrieved documents.

Answer the question using ONLY the information in the provided documents.
When making a claim, cite the relevant document by its id attribute, e.g., [doc-42].
If documents don't contain sufficient information, say so clearly.

<retrieved_documents>
{context}
</retrieved_documents>

<question>{query}</question>

Provide a comprehensive answer with inline citations."""

        response = self.claude.messages.create(
            model="claude-opus-4-6",
            max_tokens=1024,
            messages=[{"role": "user", "content": prompt}],
        )
        return response.content[0].text

    def answer(self, query: str) -> dict:
        """Complete pipeline: retrieve → rerank → generate."""
        t0 = time.time()

        t_retrieve = time.time()
        hybrid_results, final_results = self.retrieve_and_rerank(query)
        retrieve_ms = (time.time() - t_retrieve) * 1000

        t_generate = time.time()
        answer_text = self.generate(query, final_results)
        generate_ms = (time.time() - t_generate) * 1000

        total_ms = (time.time() - t0) * 1000

        return {
            "query":   query,
            "answer":  answer_text,
            "sources": [
                {
                    "doc_id":    r.doc_id,
                    "score":     round(r.score, 4),
                    "retriever": r.retriever,
                    "preview":   r.text[:120] + "...",
                }
                for r in final_results
            ],
            "diagnostics": {
                "hybrid_candidates": len(hybrid_results),
                "final_results":     len(final_results),
                "retrieve_ms":       round(retrieve_ms, 1),
                "generate_ms":       round(generate_ms, 1),
                "total_ms":          round(total_ms, 1),
                "reranking_used":    self.reranker is not None,
            },
        }


# ─────────────────────────────────────────────
# Benchmark: Dense vs. Sparse vs. Hybrid
# ─────────────────────────────────────────────

def benchmark_retrievers():
    """
    Compare dense-only, sparse-only, and hybrid retrieval
    on a set of test queries demonstrating different failure modes.
    """
    print("=== Hybrid Search Benchmark ===\n")

    # Sample corpus mixing code, concepts, and documentation
    docs = [
        Document("d1", "def validate_jwt_token(token: str) -> bool: ..."),
        Document("d2", "class JWTValidator: validates json web tokens"),
        Document("d3", "Authentication flow: user login, token generation, validate_jwt_token call"),
        Document("d4", "How does our authentication system work? It uses JWT for stateless auth"),
        Document("d5", "JSON Web Tokens (JWT) provide stateless authentication for REST APIs"),
        Document("d6", "User login process: POST /auth/login returns access_token and refresh_token"),
        Document("d7", "Token expiration: access tokens expire in 15 minutes, refresh in 7 days"),
        Document("d8", "Security: never store tokens in localStorage, use httpOnly cookies"),
        Document("d9", "Rate limiting is applied to the /auth endpoints to prevent brute force"),
        Document("d10", "validate_jwt_token raises InvalidTokenError if signature is tampered"),
    ]

    # Mock embedding function using random vectors (stable per text)
    def mock_embed(texts: list[str]) -> list[list[float]]:
        """Deterministic mock embeddings for testing."""
        result = []
        for text in texts:
            np.random.seed(hash(text) % (2**31))
            vec = np.random.randn(64).astype(np.float32)
            vec /= np.linalg.norm(vec)
            result.append(vec.tolist())
        return result

    bm25    = BM25Retriever(docs)
    dense   = DenseRetriever(docs, mock_embed)
    hybrid  = HybridSearcher(bm25, dense, fusion="rrf")

    test_queries = [
        ("validate_jwt_token", "LEXICAL - should find d1, d3, d10"),
        ("how does authentication work", "SEMANTIC - should find d4, d5, d6"),
        ("token expiration policy", "BALANCED - should find d7"),
    ]

    for query, description in test_queries:
        print(f"Query: '{query}'")
        print(f"  Expected: {description}")

        bm25_r   = bm25.retrieve(query, k=3)
        dense_r  = dense.retrieve(query, k=3)
        hybrid_r = hybrid.search(query, k=3)

        print(f"  BM25 top-3:   {[r.doc_id for r in bm25_r]}")
        print(f"  Dense top-3:  {[r.doc_id for r in dense_r]}")
        print(f"  Hybrid top-3: {[r.doc_id for r in hybrid_r]}")

        # Quick analysis: does hybrid beat both?
        bm25_set   = set(r.doc_id for r in bm25_r)
        dense_set  = set(r.doc_id for r in dense_r)
        hybrid_set = set(r.doc_id for r in hybrid_r)

        union_single = bm25_set | dense_set
        print(
            f"  Union coverage: {len(union_single)} unique docs across both retrievers. "
            f"Hybrid covers: {len(hybrid_set & union_single)}/{len(hybrid_set)} from union."
        )
        print()


def demonstrate_rrf():
    """Show RRF fusion arithmetic in detail."""
    print("=== RRF Fusion Arithmetic Demo ===\n")

    # Two ranked lists
    bm25_results = [
        SearchResult("doc-A", "Doc A text", score=42.7, rank=1, retriever="bm25"),
        SearchResult("doc-C", "Doc C text", score=38.1, rank=2, retriever="bm25"),
        SearchResult("doc-B", "Doc B text", score=31.5, rank=3, retriever="bm25"),
        SearchResult("doc-E", "Doc E text", score=18.2, rank=4, retriever="bm25"),
    ]

    dense_results = [
        SearchResult("doc-B", "Doc B text", score=0.94, rank=1, retriever="dense"),
        SearchResult("doc-A", "Doc A text", score=0.87, rank=2, retriever="dense"),
        SearchResult("doc-D", "Doc D text", score=0.81, rank=3, retriever="dense"),
        SearchResult("doc-C", "Doc C text", score=0.71, rank=4, retriever="dense"),
    ]

    print("BM25 ranking:  A(1), C(2), B(3), E(4)")
    print("Dense ranking: B(1), A(2), D(3), C(4)")
    print()

    k = 60
    print(f"RRF with k={k}:")
    all_docs = set(r.doc_id for r in bm25_results + dense_results)
    rrf_scores = {}
    doc_ranks = {}

    for r in bm25_results:
        rrf_scores[r.doc_id] = rrf_scores.get(r.doc_id, 0) + 1 / (k + r.rank)
        doc_ranks.setdefault(r.doc_id, {})["bm25"] = r.rank

    for r in dense_results:
        rrf_scores[r.doc_id] = rrf_scores.get(r.doc_id, 0) + 1 / (k + r.rank)
        doc_ranks.setdefault(r.doc_id, {})["dense"] = r.rank

    sorted_docs = sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)

    print(f"{'Doc':>6} | {'BM25 rank':>10} | {'Dense rank':>10} | {'RRF score':>12}")
    print("-" * 46)
    for doc_id, score in sorted_docs:
        bm25_rank  = doc_ranks[doc_id].get("bm25",  "-")
        dense_rank = doc_ranks[doc_id].get("dense", "-")
        print(f"{doc_id:>6} | {str(bm25_rank):>10} | {str(dense_rank):>10} | {score:.6f}")

    print()
    merged = rrf_fusion(bm25_results, dense_results, k=k)
    print(f"Final RRF ranking: {[r.doc_id for r in merged]}")
    print(f"Observation: Doc-A and Doc-B tied - both appeared in both lists.")
    print(f"Doc-D appeared only in dense list - lower combined score than dual-list docs.")


if __name__ == "__main__":
    benchmark_retrievers()
    demonstrate_rrf()

When to Use Each Approach

Query Type	Best Approach	Example
Exact identifiers, function names, error codes	BM25-heavy hybrid	`validate_jwt_token`, `NullPointerException`
Conceptual / semantic questions	Dense-heavy hybrid	"how does our caching work"
Product names, people names	BM25-heavy	`Stripe` API, `John_Smith`
Cross-language queries	Dense (multilingual model)	English query, Spanish docs
General knowledge base search	Balanced hybrid + reranking	Default for most RAG
Compliance / adversarial queries	Reranking essential	Legal, medical, financial

Common Mistakes

:::danger Forgetting BM25 for Exact-Match Queries Pure dense retrieval systems fail silently on identifier queries. Error codes, function names, configuration keys, and product SKUs require exact term matching that BM25 provides. If your RAG system serves a technical domain, BM25 is not optional - it is required. :::

:::danger Normalizing BM25 Scores Before RRF Never normalize raw BM25 scores before feeding them into RRF. RRF does not use scores at all - only ranks. If you normalize and then use weighted fusion, the normalization is valid, but do not apply it to RRF. RRF is scale-invariant by design. :::

:::warning Skipping Reranking for High-Stakes Queries If incorrect retrieval could cause real harm (medical, legal, financial), always add a cross-encoder reranker. The 23% accuracy improvement in the opening story's case is typical - reranking consistently closes the gap between "probably correct" and "reliably correct." :::

:::tip Set RRF k Based on Your Retrieve-N The RRF parameter $k$ interacts with how many candidates you retrieve. A common mistake: retrieving N=5 from each retriever but leaving k=60. With only 5 results, rank 1 scores 1/61 and rank 5 scores 1/65 - barely different. Use a smaller k (10–20) when N is small so that rank differences matter more. :::

:::tip Use Haiku for Query Classification Classifying whether a query is lexical or semantic can meaningfully improve hybrid search weights. Use claude-haiku-4-5-20251001 (cheap and fast) to classify queries at runtime. A prompt like "Is this query looking for exact terms or conceptual information? Answer: lexical | semantic | balanced" adds about 100ms and can improve retrieval quality noticeably. :::

:::warning Reranking Increases Latency - Budget for It A cross-encoder reranking 30 documents with MiniLM adds 50–150ms. Cohere Rerank adds 100–300ms. This is usually acceptable given the quality improvement, but factor it into your latency SLA. If P99 latency must be under 1 second total, reranking 50 candidates may not be feasible. :::

Interview Questions and Answers

Q1: What is BM25 and why does it sometimes outperform dense retrieval?

BM25 is a probabilistic term-frequency ranking function. Given a query and a document, it computes a relevance score based on how often the query terms appear in the document (term frequency, with saturation to prevent runaway scores from repeated terms) and how rare those terms are across the entire corpus (inverse document frequency - rare terms are more informative than common ones). It also normalizes for document length. BM25 outperforms dense retrieval on exact-match queries because dense models generalize away from surface form - they understand that "car" and "automobile" are similar, but this generalization means they cannot distinguish validate_jwt_token from validate_oauth_token. BM25 sees these as completely different token sequences and ranks exactly matching documents at the top. BM25 also tends to outperform dense retrieval on technical documentation where users already know the precise term they want.

Q2: Explain Reciprocal Rank Fusion and why it is preferred over weighted score averaging.

RRF fuses multiple ranked lists by summing $1 / (k + \text{rank})$ for each document across all lists. A document that ranks first in both BM25 and dense retrieval scores higher than one that ranks first in only one list. The key insight is that RRF works only with ranks, not scores. BM25 scores might range from 0 to 50; cosine similarity scores range from 0 to 1. These scales are incomparable - naively adding them together lets the larger-scale scorer dominate regardless of relevance. RRF is immune to this because it normalizes by position, not magnitude. A document at rank 1 in BM25 and rank 1 in dense retrieval gets the same RRF contribution regardless of whether the BM25 score was 2.1 or 42.7. The $k=60$ smoothing constant prevents the first-place document from dominating too strongly - it spreads the score difference between ranks 1 and 2 to just about 0.00026. This robustness to scale differences is why RRF consistently outperforms weighted averaging without requiring careful calibration.

Q3: When would you use a cross-encoder reranker and what are the trade-offs?

A cross-encoder reranker is warranted when retrieval quality is critical enough to justify additional latency (50–500ms) and cost. Cross-encoders work by processing the query and document together through a transformer - every query token attends to every document token, enabling much more nuanced relevance scoring than a bi-encoder (which processes them independently). The practical use case: retrieve 20–50 candidates with fast hybrid search, then rerank with a cross-encoder to get the top 5 with high precision. The trade-off: latency increases by 50–500ms depending on model size and number of candidates, and cost increases if using a managed service like Cohere Rerank. For conversational applications, this is often acceptable. For real-time applications requiring sub-500ms total latency, you may need to limit candidates to 10–15 or use a smaller cross-encoder model.

Q4: How do you detect whether a query should be handled by BM25 or dense retrieval?

There are several heuristic signals. Lexical queries (favor BM25) tend to be short, contain underscores or camelCase identifiers, include hex codes or all-caps abbreviations, or use specific product names. Semantic queries (favor dense retrieval) tend to be longer sentences, include question words (how, why, what), or use conceptual language. In production, I would implement this as a two-phase decision: first, a fast regex-based heuristic that catches obvious cases (snake_case identifiers, error codes) with no API cost; second, for ambiguous queries, use claude-haiku-4-5-20251001 with a classification prompt - it is cheap (less than 1 cent per 1000 queries) and adds only about 100ms. The classification output adjusts the weight split in weighted fusion, not the RRF formula itself - if you use RRF, both retrievers always contribute, and the fusion algorithm naturally handles the case where one retriever has no relevant results.

Q5: A user complains that your RAG system returns great answers for general questions but completely fails for questions about specific functions in the codebase. What is wrong and how do you fix it?

This is the classic pure dense retrieval failure mode on exact-match queries. The user is searching for specific function names like validate_jwt_token or parse_config_file. Dense embedding models generalize these identifiers to their semantic neighborhood (authentication, configuration), losing the exact identifier. The fix is adding BM25 sparse retrieval to create hybrid search. For a codebase, I would weight BM25 more heavily (60–70%) because developers searching for function names want exact matches. I would also customize the BM25 tokenizer to preserve identifiers intact - using split() would tokenize validate_jwt_token into ["validate", "jwt", "token"], which is fine, but some tokenizers would strip underscores. The full fix: add BM25 indexing to the corpus, implement RRF to fuse BM25 and dense results, add query type detection (snake_case identifiers → upweight BM25), and validate the fix with a test set of 50–100 function-name queries.

Q6: What is the two-stage retrieval + reranking architecture and why does it work better than single-stage?

The two-stage architecture acknowledges that retrieval and ranking are different problems requiring different algorithms. Stage one (retrieval) must be fast - sub-50ms for millions of documents - so it uses approximate methods (HNSW for dense, inverted index for BM25). Speed requires approximation, and approximation means the top-k results are "probably" the most relevant but not guaranteed. Stage two (reranking) only needs to process the small set of candidates from stage one - typically 20–100 documents. Because the candidate set is small, you can use a slow but accurate cross-encoder that reads each query-document pair with full attention. The cross-encoder makes nuanced relevance decisions that the retrieval stage cannot: it can reason about whether the document's mention of "authentication" is specifically about JWT authentication (what the user asked) or about a different authentication method. This two-stage architecture is why production RAG systems achieve 85–95% top-5 recall, while single-stage pure dense retrieval typically achieves 75–85%. The quality improvement is consistent and significant enough that adding reranking is standard practice in any serious RAG deployment.

Summary

Hybrid search and reranking represent the state of the art in production RAG retrieval. The key architectural principles:

No single retriever is best everywhere - dense retrieval fails on exact-match queries; BM25 fails on semantic paraphrasing. Use both.
RRF is the default fusion - it is robust to score scale differences, requires no calibration, and empirically outperforms most alternatives on diverse query sets.
Adaptive weighting improves quality - detecting whether a query is lexical or semantic and adjusting fusion weights accordingly adds meaningful retrieval quality for free.
Reranking is the quality multiplier - retrieve broadly (N=30), rerank precisely (top-5). The two-stage pipeline consistently outperforms single-stage retrieval.
Match retrieval to your domain - code search, legal search, medical search each have different query patterns. Profile your actual queries before committing to a fixed architecture.

The developer platform from the opening story now handles both "how does authentication work" and "find validate_jwt_token call sites" with equal accuracy. Neither of those queries was possible to serve well with a pure dense retrieval system. Hybrid search made both possible with a 23% overall improvement in retrieval accuracy. That is the payoff of understanding why each retrieval method exists.

The Code Search That Broke at Midnight​

Why Hybrid Search Exists​

The Two Failure Modes of Pure Retrieval​

The Vocabulary Mismatch Problem​

BM25: A Deep Dive​

TF-IDF Background​

BM25 Formula​

When BM25 Shines​

Python BM25 from Scratch​

Dense Retrieval: The Complementary Strength​

Bi-encoder vs. Cross-encoder​

The BEIR Benchmark​

Fusion Algorithms​

Reciprocal Rank Fusion (RRF)​

Weighted Fusion​

Cross-Encoder Reranking​

How Cross-Encoders Work​

Reranking Options​

The Two-Stage Pipeline​

Architecture and Diagrams​

Dense vs. Sparse Retrieval Failure Modes​

Hybrid Search + Reranking Pipeline​

RRF Fusion Algorithm Visualization​

Production Code​

When to Use Each Approach​

Common Mistakes​

Interview Questions and Answers​

Summary​