What is neural search system?

Redesigning an Elasticsearch-only search system with neural search - from BM25 baseline through dense retrieval, learning to rank, query understanding, and search quality evaluation.

How does learning to rank work in practice?

Search and Retrieval Systems covers neural search system, learning to rank, BM25 dense retrieval from first principles with code examples. Free lesson at https://engineersofai.com/docs/ai-systems/case-studies/search-and-retrieval-systems

What is the difference between neural search system and BM25 dense retrieval?

See the full breakdown at https://engineersofai.com/docs/ai-systems/case-studies/search-and-retrieval-systems

:::tip 🎮 Interactive Playground Visualize this concept: Try the ANN Algorithms demo on the EngineersOfAI Playground - no code required. :::

Search and Retrieval Systems

From 40% to 72% User Satisfaction: Rebuilding Search with Neural Retrieval

The support ticket came from the head of product at a mid-size B2B SaaS company. Their internal knowledge base had 80,000 documents - product documentation, support articles, API references, release notes, internal wikis. Their search system was Elasticsearch with BM25 ranking. User satisfaction with search was measured at 40% via post-search surveys: "Did you find what you were looking for?"

The failures were predictable in retrospect. A developer searching for "how to authenticate" got articles about "authentication errors" at rank 1, not the authentication setup guide. A support engineer searching for "customer cannot log in" got zero results - the relevant articles all used the phrase "SSO login failure." A product manager searching for "pricing tier comparison" got an article about "pricing calculator" when what they needed was the "plan comparison" page that never used the word "tier."

All three failures are the same root cause: BM25 is a keyword matcher. It finds documents that contain the query terms. It has no understanding of synonyms, paraphrase, semantic equivalence, or intent. A user who says "authenticate" and a document that says "authorization" are in different vocabularies. BM25 fails.

The engineering challenge: Elasticsearch is deeply embedded in their infrastructure. They can not rip it out. The solution must layer neural understanding on top of the existing BM25 baseline, not replace it. The new system needs to go from 40% to at least 65% satisfaction within three months, using a team of three engineers, with a latency SLA of 300ms.

This case study covers the full redesign, from architecture through evaluation.

Requirements Analysis

Functional requirements:

Full-text search over 80K documents
Real-time indexing of new documents (within 5 minutes of publication)
Support for filters (document type, team, date range)
Ranked results with snippets highlighting relevant passages
Spell correction and query completion

Non-functional requirements:

Latency: 300ms p99 end-to-end
Relevance: user satisfaction rate above 65% (measured via survey)
Scale: 10K queries per day, growing to 100K
Index freshness: new documents searchable within 5 minutes

Constraints:

Must retain Elasticsearch - too much existing infrastructure depends on it
No labeled relevance judgments exist - must bootstrap evaluation
Team of 3 engineers, 3-month timeline

System Architecture

Component 1: BM25 Baseline (Keep and Improve)

BM25 is the TF-IDF-based ranking function that Elasticsearch uses by default. The BM25 score for document $d$ given query $q$ with terms $t_1, \ldots, t_n$ :

$\text{BM25}(d, q) = \sum_{t \in q} \text{IDF}(t) \cdot \frac{\text{TF}(t, d) \cdot (k_1 + 1)}{\text{TF}(t, d) + k_1 \cdot (1 - b + b \cdot \frac{|d|}{\text{avgdl}})}$

where $k_1 = 1.2$ controls term frequency saturation and $b = 0.75$ controls length normalization.

BM25 improvements before adding neural search:

Field weighting: title matches should count more than body matches. In Elasticsearch, set field boosts: title^3, headings^2, body^1.
Synonyms: add a custom synonym filter to the Elasticsearch analyzer. "authenticate" expands to "authenticate, auth, login, sign in". This directly fixes the vocabulary mismatch for known synonym pairs.
Phrase queries: add a phrase match boost so exact phrase matches in the title rank higher.

These improvements alone typically move satisfaction from 40% to 50-55%. They are the cheapest wins and should be done first.

Component 2: Dense Retrieval

Dense retrieval uses a bi-encoder: a neural network that encodes queries and documents into a shared embedding space. Similar documents and queries end up close together even when they use different words.

Model selection: For a team of three with a 3-month timeline, fine-tuning a large model from scratch is impractical. Use a pre-trained bi-encoder from the Sentence Transformers library and fine-tune it on domain-specific data.

Recommended starting points:

msmarco-roberta-base-v2: trained on MS MARCO passage retrieval, good general-purpose baseline
bge-base-en-v1.5: strong BEIR benchmark performance, efficient for self-hosting
voyage-2 (API): highest quality, no infrastructure management

Domain adaptation without labeled data: Use the existing BM25 results to generate pseudo-labeled training data. For each query, BM25's top-3 results are pseudo-positive documents; BM25's rank 50-100 results are hard pseudo-negatives. Fine-tune the bi-encoder on this pseudo-labeled data. This is called GPL (Generative Pseudo-Labeling) and consistently improves domain-specific performance.

from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader
from typing import List, Tuple
import faiss
import numpy as np


class NeuralSearchEngine:
    def __init__(
        self,
        model_name: str = "msmarco-roberta-base-v2",
        embedding_dim: int = 768,
        index_path: str = None,
    ):
        self.model = SentenceTransformer(model_name)
        self.embedding_dim = embedding_dim
        self.doc_ids = []
        self.doc_texts = []

        # HNSW index for fast approximate search - good for 80K docs
        if index_path:
            self.index = faiss.read_index(index_path)
        else:
            self.index = faiss.IndexHNSWFlat(embedding_dim, 32)  # M=32
            self.index.hnsw.efConstruction = 200  # higher = better recall
            self.index.hnsw.efSearch = 50         # higher = better recall, slower

    def index_documents(self, documents: List[dict], batch_size: int = 128):
        """Encode and index all documents."""
        texts = [f"{doc['title']} {doc['body']}" for doc in documents]
        self.doc_ids = [doc["id"] for doc in documents]
        self.doc_texts = texts

        print(f"Encoding {len(texts)} documents...")
        embeddings = self.model.encode(
            texts,
            batch_size=batch_size,
            show_progress_bar=True,
            normalize_embeddings=True,  # cosine similarity via dot product
        )
        self.index.add(embeddings.astype(np.float32))

    def retrieve(self, query: str, top_k: int = 50) -> List[Tuple[str, float]]:
        """Retrieve top-k documents by dense similarity."""
        query_embedding = self.model.encode(
            [query], normalize_embeddings=True
        ).astype(np.float32)
        distances, indices = self.index.search(query_embedding, top_k)
        return [
            (self.doc_ids[idx], float(dist))
            for idx, dist in zip(indices[0], distances[0])
            if idx >= 0  # HNSW returns -1 for padded results
        ]

    def fine_tune_on_pseudo_labels(
        self,
        pseudo_labeled_pairs: List[Tuple[str, str, str]],
        # [(query, positive_doc_text, negative_doc_text), ...]
        epochs: int = 3,
    ):
        """Fine-tune the bi-encoder on pseudo-labeled training data."""
        examples = [
            InputExample(texts=[q, pos, neg], label=1.0)
            for q, pos, neg in pseudo_labeled_pairs
        ]
        loader = DataLoader(examples, batch_size=16, shuffle=True)
        loss = losses.TripletLoss(self.model)

        self.model.fit(
            train_objectives=[(loader, loss)],
            epochs=epochs,
            warmup_steps=100,
            output_path="./fine_tuned_model",
        )

Component 3: Query Understanding

Query understanding transforms the raw user query before retrieval:

Spell correction: Use a character-level language model or the Symspell library. Essential for technical queries where users misspell product-specific terms.

Query expansion: Append synonyms from a domain-specific synonym dictionary. "auth" expands to "auth OR authentication OR authorization OR SSO." Keep the original query as the primary signal; expansion is additive.

Intent classification: Classify queries into: navigational (user wants a specific page), informational (user wants to learn something), troubleshooting (user has a problem). Different intents may benefit from different ranking weights.

from transformers import pipeline
import re


class QueryUnderstanding:
    def __init__(self, synonym_dict: dict = None):
        self.synonym_dict = synonym_dict or {}
        self.intent_classifier = pipeline(
            "text-classification",
            model="distilbert-base-uncased-finetuned-sst-2-english",  # placeholder
        )

    def process(self, raw_query: str) -> dict:
        """Full query understanding pipeline."""
        cleaned = raw_query.strip().lower()
        spell_corrected = self._spell_correct(cleaned)
        expanded_terms = self._expand_synonyms(spell_corrected)
        intent = self._classify_intent(spell_corrected)

        return {
            "original": raw_query,
            "cleaned": spell_corrected,
            "expanded_terms": expanded_terms,
            "intent": intent,
            "retrieval_query": spell_corrected,  # base query for retrieval
            "bm25_boost": expanded_terms,         # additional terms for BM25
        }

    def _expand_synonyms(self, query: str) -> List[str]:
        """Expand query terms with domain synonyms."""
        expanded = []
        for term in query.split():
            if term in self.synonym_dict:
                expanded.extend(self.synonym_dict[term])
        return list(set(expanded))

    def _spell_correct(self, query: str) -> str:
        # Placeholder - use Symspell or a custom domain spell checker
        return query

    def _classify_intent(self, query: str) -> str:
        # Simple heuristic - in production, use a fine-tuned classifier
        if any(w in query for w in ["error", "fail", "broken", "cannot", "not working"]):
            return "troubleshooting"
        if re.match(r"^how (to|do)", query):
            return "how-to"
        return "informational"

Component 4: Cross-Encoder Reranking

The cross-encoder is the highest-quality but most expensive component. It processes query and document together (no precomputation), allowing deep interaction modeling.

Applied to the top-20 candidates from RRF fusion, the cross-encoder produces a relevance score that is significantly more accurate than bi-encoder similarity. For 80K documents with 300ms latency, running the cross-encoder on 20 candidates takes 50-80ms on a GPU - acceptable.

from sentence_transformers import CrossEncoder


class CrossEncoderReranker:
    def __init__(self, model_name: str = "cross-encoder/ms-marco-MiniLM-L-6-v2"):
        """
        ms-marco-MiniLM-L-6-v2: fast (6-layer MiniLM), good quality.
        cross-encoder/ms-marco-electra-base: slower, higher quality.
        """
        self.model = CrossEncoder(model_name, max_length=512)

    def rerank(
        self,
        query: str,
        candidates: List[dict],
        top_k: int = 10,
    ) -> List[dict]:
        """Rerank candidates using cross-encoder, return top_k."""
        pairs = [(query, doc["text"][:512]) for doc in candidates]
        scores = self.model.predict(pairs, show_progress_bar=False)

        for doc, score in zip(candidates, scores):
            doc["rerank_score"] = float(score)

        reranked = sorted(candidates, key=lambda x: x["rerank_score"], reverse=True)
        return reranked[:top_k]

Component 5: Hybrid Search with RRF

Combining BM25 and dense retrieval:

def hybrid_search(
    query_processed: dict,
    bm25_retriever,        # Elasticsearch client
    dense_retriever: NeuralSearchEngine,
    reranker: CrossEncoderReranker,
    bm25_top_k: int = 50,
    dense_top_k: int = 50,
    final_k: int = 10,
) -> List[dict]:
    """Full hybrid search pipeline."""
    # BM25 retrieval (with synonym expansion)
    bm25_query = query_processed["retrieval_query"]
    if query_processed["bm25_boost"]:
        bm25_query += " " + " ".join(query_processed["bm25_boost"])

    bm25_results = bm25_retriever.search(bm25_query, top_k=bm25_top_k)

    # Dense retrieval
    dense_results = dense_retriever.retrieve(
        query_processed["retrieval_query"],
        top_k=dense_top_k
    )

    # RRF fusion
    fused = reciprocal_rank_fusion(
        [[(doc["id"], doc["score"]) for doc in bm25_results],
         dense_results],
        k=60,
        weights=[0.4, 0.6],  # dense slightly higher for semantic-heavy corpus
    )

    # Get top-20 candidates with full text
    top_20_ids = [doc_id for doc_id, _ in fused[:20]]
    candidates = bm25_retriever.get_documents_by_ids(top_20_ids)

    # Cross-encoder reranking
    reranked = reranker.rerank(query_processed["retrieval_query"], candidates, top_k=final_k)
    return reranked

Evaluation: Building a Test Set Without Labels

The hardest part of this case study: there are no relevance judgments. How do you measure progress?

Step 1: Collect implicit feedback labels. Log every search query and every document click. A query with a click on document $d$ is a weak positive label for $(q, d)$ . This is noisy but abundant.

Step 2: Build a curated evaluation set. Sample 200 representative queries. Have 3 domain experts manually rate the top-5 results from the baseline system on a 4-point scale (0 = irrelevant, 3 = highly relevant). This gives you an NDCG@5 baseline.

Step 3: Measure NDCG@5 and MRR on the curated set.

$\text{NDCG@k} = \frac{\text{DCG@k}}{\text{IDCG@k}} \quad \text{where} \quad \text{DCG@k} = \sum_{i=1}^{k} \frac{2^{r_i} - 1}{\log_2(i+1)}$

Step 4: User satisfaction survey. After each search, show a thumbs up/thumbs down prompt. Track satisfaction rate weekly as the primary business metric.

Step 5: Zero-result rate. What fraction of queries return zero results? This is a direct failure signal.

import numpy as np
from typing import List


def ndcg_at_k(relevance_scores: List[float], k: int) -> float:
    """
    Compute NDCG@k for a single query.

    Args:
        relevance_scores: list of relevance grades in the ranked order returned by system
        k: cutoff position
    """
    relevance_scores = relevance_scores[:k]
    if not relevance_scores:
        return 0.0

    # DCG: discount later positions
    dcg = sum(
        (2 ** r - 1) / np.log2(i + 2)
        for i, r in enumerate(relevance_scores)
    )

    # Ideal DCG: best possible ordering
    ideal_scores = sorted(relevance_scores, reverse=True)
    idcg = sum(
        (2 ** r - 1) / np.log2(i + 2)
        for i, r in enumerate(ideal_scores)
    )

    return dcg / idcg if idcg > 0 else 0.0


def mean_reciprocal_rank(results_with_relevance: List[List[float]]) -> float:
    """
    Mean Reciprocal Rank across multiple queries.
    MRR = mean(1/rank_of_first_relevant_result).
    """
    rr_scores = []
    for relevance_list in results_with_relevance:
        for rank, rel in enumerate(relevance_list, start=1):
            if rel > 0:
                rr_scores.append(1.0 / rank)
                break
        else:
            rr_scores.append(0.0)
    return np.mean(rr_scores)

Learning to Rank (Future Direction)

Once you have labeled data (from the evaluation set + accumulated click logs), the system can graduate to a Learning to Rank model:

LambdaMART: Gradient boosted trees trained with LambdaRank objective. Takes a feature vector per (query, document) pair and outputs a relevance score. Features: BM25 score, dense similarity score, cross-encoder score, document recency, document popularity, query-document click rate.

LambdaMART learns optimal weights for combining these signals, outperforming hand-tuned RRF weights for complex queries. The downside: it requires labeled training data (relevance judgments), which takes time to accumulate.

Common Mistakes

danger

Mistake: Deploying dense retrieval alone and removing BM25.

Dense retrieval excels at semantic similarity but fails at exact match. A user searching for a specific error code ("ERR_SSL_PROTOCOL_ERROR") gets better results from BM25 (exact token match) than from dense retrieval (semantic similarity to vague concepts). Hybrid search consistently outperforms either approach alone. Never remove BM25 entirely.

warning

Mistake: Using the same embedding model for indexing and querying different-length texts.

Most bi-encoders are trained on query-passage pairs where queries are short (5-10 tokens) and passages are longer (100-200 tokens). If you encode full documents (500-2000 tokens) at indexing time, the representation will be poor - the model was not trained to encode long texts as a single vector. Always chunk documents before indexing, and retrieve chunks not full documents.

tip

Tip: Implement search result monitoring before optimization.

Before changing anything, set up monitoring: query logs, click rates per position, zero-result rate, search abandonment rate, time-to-first-click. This establishes baselines and lets you measure the impact of each change. The most impactful improvements are often invisible without instrumentation.

Interview Q&A

Q: How would you improve a BM25-only search system to handle semantic queries?

A: I would take a layered approach, adding neural components on top of the existing BM25 baseline rather than replacing it. First, quick wins with BM25 itself: field weighting (title matches worth more than body), synonym expansion using a domain-specific synonym dictionary, and phrase matching boosts. These typically move satisfaction by 10-15 percentage points with low engineering cost. Second, add dense retrieval: deploy a bi-encoder (Sentence Transformers, pre-trained on MS MARCO) to produce document embeddings indexed in FAISS. At query time, embed the query and retrieve the top-50 candidates by cosine similarity. Combine with BM25's top-50 using Reciprocal Rank Fusion. Third, add cross-encoder reranking on the top-20 RRF results. The cross-encoder processes query and document jointly, producing much more accurate relevance scores than bi-encoder similarity. This three-stage system (BM25 + dense retrieval + cross-encoder reranking) typically achieves 25-35 percentage point improvements in NDCG@5 over BM25 alone.

Q: What is NDCG and why is it the standard metric for search evaluation?

A: NDCG (Normalized Discounted Cumulative Gain) measures ranking quality when documents have graded relevance scores (0-3, not just binary relevant/irrelevant). It applies position discounts - a relevant document at rank 1 is worth more than the same document at rank 5. The "normalized" part divides by the ideal DCG (the score if all relevant documents were shown at the top), making it comparable across queries with different numbers of relevant documents. NDCG is preferred over precision@K because it: (1) accounts for graded relevance - "very relevant" should count more than "somewhat relevant"; (2) accounts for position - showing relevant documents at rank 1 vs rank 5 matters; (3) is normalized - allows averaging across queries with different relevance distributions. MRR (Mean Reciprocal Rank) is preferred when only finding any one relevant document matters (e.g., navigational queries), while NDCG is preferred when the full ranking quality matters.

Q: How do you handle queries for which your corpus has no relevant documents?

A: First, measure the scope: track the zero-result rate and the low-satisfaction rate. Understand whether the issue is vocabulary mismatch (documents exist but weren't retrieved) or a corpus gap (the information truly doesn't exist). For vocabulary mismatch: synonym expansion, dense retrieval, and query expansion typically fix this. For true corpus gaps: implement a "no results" experience that suggests related documents and captures the failed query for content gap analysis. Route zero-result and low-satisfaction queries to a content team for corpus improvement. Consider adding an LLM-backed fallback for questions where the corpus is incomplete - the LLM answers from its parametric knowledge with a disclaimer that the answer is not from the internal corpus. Track which queries consistently result in user dissatisfaction and prioritize them for content creation.

From 40% to 72% User Satisfaction: Rebuilding Search with Neural Retrieval​

Requirements Analysis​

System Architecture​

Component 1: BM25 Baseline (Keep and Improve)​

Component 2: Dense Retrieval​

Component 3: Query Understanding​

Component 4: Cross-Encoder Reranking​

Component 5: Hybrid Search with RRF​

Evaluation: Building a Test Set Without Labels​

Learning to Rank (Future Direction)​

Common Mistakes​

Interview Q&A​