:::tip 🎮 Interactive Playground Visualize this concept: Try the ANN Algorithms demo on the EngineersOfAI Playground - no code required. :::
Search and Retrieval Systems
From 40% to 72% User Satisfaction: Rebuilding Search with Neural Retrieval
The support ticket came from the head of product at a mid-size B2B SaaS company. Their internal knowledge base had 80,000 documents - product documentation, support articles, API references, release notes, internal wikis. Their search system was Elasticsearch with BM25 ranking. User satisfaction with search was measured at 40% via post-search surveys: "Did you find what you were looking for?"
The failures were predictable in retrospect. A developer searching for "how to authenticate" got articles about "authentication errors" at rank 1, not the authentication setup guide. A support engineer searching for "customer cannot log in" got zero results - the relevant articles all used the phrase "SSO login failure." A product manager searching for "pricing tier comparison" got an article about "pricing calculator" when what they needed was the "plan comparison" page that never used the word "tier."
All three failures are the same root cause: BM25 is a keyword matcher. It finds documents that contain the query terms. It has no understanding of synonyms, paraphrase, semantic equivalence, or intent. A user who says "authenticate" and a document that says "authorization" are in different vocabularies. BM25 fails.
The engineering challenge: Elasticsearch is deeply embedded in their infrastructure. They can not rip it out. The solution must layer neural understanding on top of the existing BM25 baseline, not replace it. The new system needs to go from 40% to at least 65% satisfaction within three months, using a team of three engineers, with a latency SLA of 300ms.
This case study covers the full redesign, from architecture through evaluation.
Requirements Analysis
Functional requirements:
- Full-text search over 80K documents
- Real-time indexing of new documents (within 5 minutes of publication)
- Support for filters (document type, team, date range)
- Ranked results with snippets highlighting relevant passages
- Spell correction and query completion
Non-functional requirements:
- Latency: 300ms p99 end-to-end
- Relevance: user satisfaction rate above 65% (measured via survey)
- Scale: 10K queries per day, growing to 100K
- Index freshness: new documents searchable within 5 minutes
Constraints:
- Must retain Elasticsearch - too much existing infrastructure depends on it
- No labeled relevance judgments exist - must bootstrap evaluation
- Team of 3 engineers, 3-month timeline
System Architecture
Component 1: BM25 Baseline (Keep and Improve)
BM25 is the TF-IDF-based ranking function that Elasticsearch uses by default. The BM25 score for document given query with terms :
where controls term frequency saturation and controls length normalization.
BM25 improvements before adding neural search:
- Field weighting: title matches should count more than body matches. In Elasticsearch, set field boosts:
title^3, headings^2, body^1. - Synonyms: add a custom synonym filter to the Elasticsearch analyzer. "authenticate" expands to "authenticate, auth, login, sign in". This directly fixes the vocabulary mismatch for known synonym pairs.
- Phrase queries: add a phrase match boost so exact phrase matches in the title rank higher.
These improvements alone typically move satisfaction from 40% to 50-55%. They are the cheapest wins and should be done first.
Component 2: Dense Retrieval
Dense retrieval uses a bi-encoder: a neural network that encodes queries and documents into a shared embedding space. Similar documents and queries end up close together even when they use different words.
Model selection: For a team of three with a 3-month timeline, fine-tuning a large model from scratch is impractical. Use a pre-trained bi-encoder from the Sentence Transformers library and fine-tune it on domain-specific data.
Recommended starting points:
msmarco-roberta-base-v2: trained on MS MARCO passage retrieval, good general-purpose baselinebge-base-en-v1.5: strong BEIR benchmark performance, efficient for self-hostingvoyage-2(API): highest quality, no infrastructure management
Domain adaptation without labeled data: Use the existing BM25 results to generate pseudo-labeled training data. For each query, BM25's top-3 results are pseudo-positive documents; BM25's rank 50-100 results are hard pseudo-negatives. Fine-tune the bi-encoder on this pseudo-labeled data. This is called GPL (Generative Pseudo-Labeling) and consistently improves domain-specific performance.
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader
from typing import List, Tuple
import faiss
import numpy as np
class NeuralSearchEngine:
def __init__(
self,
model_name: str = "msmarco-roberta-base-v2",
embedding_dim: int = 768,
index_path: str = None,
):
self.model = SentenceTransformer(model_name)
self.embedding_dim = embedding_dim
self.doc_ids = []
self.doc_texts = []
# HNSW index for fast approximate search - good for 80K docs
if index_path:
self.index = faiss.read_index(index_path)
else:
self.index = faiss.IndexHNSWFlat(embedding_dim, 32) # M=32
self.index.hnsw.efConstruction = 200 # higher = better recall
self.index.hnsw.efSearch = 50 # higher = better recall, slower
def index_documents(self, documents: List[dict], batch_size: int = 128):
"""Encode and index all documents."""
texts = [f"{doc['title']} {doc['body']}" for doc in documents]
self.doc_ids = [doc["id"] for doc in documents]
self.doc_texts = texts
print(f"Encoding {len(texts)} documents...")
embeddings = self.model.encode(
texts,
batch_size=batch_size,
show_progress_bar=True,
normalize_embeddings=True, # cosine similarity via dot product
)
self.index.add(embeddings.astype(np.float32))
def retrieve(self, query: str, top_k: int = 50) -> List[Tuple[str, float]]:
"""Retrieve top-k documents by dense similarity."""
query_embedding = self.model.encode(
[query], normalize_embeddings=True
).astype(np.float32)
distances, indices = self.index.search(query_embedding, top_k)
return [
(self.doc_ids[idx], float(dist))
for idx, dist in zip(indices[0], distances[0])
if idx >= 0 # HNSW returns -1 for padded results
]
def fine_tune_on_pseudo_labels(
self,
pseudo_labeled_pairs: List[Tuple[str, str, str]],
# [(query, positive_doc_text, negative_doc_text), ...]
epochs: int = 3,
):
"""Fine-tune the bi-encoder on pseudo-labeled training data."""
examples = [
InputExample(texts=[q, pos, neg], label=1.0)
for q, pos, neg in pseudo_labeled_pairs
]
loader = DataLoader(examples, batch_size=16, shuffle=True)
loss = losses.TripletLoss(self.model)
self.model.fit(
train_objectives=[(loader, loss)],
epochs=epochs,
warmup_steps=100,
output_path="./fine_tuned_model",
)
Component 3: Query Understanding
Query understanding transforms the raw user query before retrieval:
Spell correction: Use a character-level language model or the Symspell library. Essential for technical queries where users misspell product-specific terms.
Query expansion: Append synonyms from a domain-specific synonym dictionary. "auth" expands to "auth OR authentication OR authorization OR SSO." Keep the original query as the primary signal; expansion is additive.
Intent classification: Classify queries into: navigational (user wants a specific page), informational (user wants to learn something), troubleshooting (user has a problem). Different intents may benefit from different ranking weights.
from transformers import pipeline
import re
class QueryUnderstanding:
def __init__(self, synonym_dict: dict = None):
self.synonym_dict = synonym_dict or {}
self.intent_classifier = pipeline(
"text-classification",
model="distilbert-base-uncased-finetuned-sst-2-english", # placeholder
)
def process(self, raw_query: str) -> dict:
"""Full query understanding pipeline."""
cleaned = raw_query.strip().lower()
spell_corrected = self._spell_correct(cleaned)
expanded_terms = self._expand_synonyms(spell_corrected)
intent = self._classify_intent(spell_corrected)
return {
"original": raw_query,
"cleaned": spell_corrected,
"expanded_terms": expanded_terms,
"intent": intent,
"retrieval_query": spell_corrected, # base query for retrieval
"bm25_boost": expanded_terms, # additional terms for BM25
}
def _expand_synonyms(self, query: str) -> List[str]:
"""Expand query terms with domain synonyms."""
expanded = []
for term in query.split():
if term in self.synonym_dict:
expanded.extend(self.synonym_dict[term])
return list(set(expanded))
def _spell_correct(self, query: str) -> str:
# Placeholder - use Symspell or a custom domain spell checker
return query
def _classify_intent(self, query: str) -> str:
# Simple heuristic - in production, use a fine-tuned classifier
if any(w in query for w in ["error", "fail", "broken", "cannot", "not working"]):
return "troubleshooting"
if re.match(r"^how (to|do)", query):
return "how-to"
return "informational"
Component 4: Cross-Encoder Reranking
The cross-encoder is the highest-quality but most expensive component. It processes query and document together (no precomputation), allowing deep interaction modeling.
Applied to the top-20 candidates from RRF fusion, the cross-encoder produces a relevance score that is significantly more accurate than bi-encoder similarity. For 80K documents with 300ms latency, running the cross-encoder on 20 candidates takes 50-80ms on a GPU - acceptable.
from sentence_transformers import CrossEncoder
class CrossEncoderReranker:
def __init__(self, model_name: str = "cross-encoder/ms-marco-MiniLM-L-6-v2"):
"""
ms-marco-MiniLM-L-6-v2: fast (6-layer MiniLM), good quality.
cross-encoder/ms-marco-electra-base: slower, higher quality.
"""
self.model = CrossEncoder(model_name, max_length=512)
def rerank(
self,
query: str,
candidates: List[dict],
top_k: int = 10,
) -> List[dict]:
"""Rerank candidates using cross-encoder, return top_k."""
pairs = [(query, doc["text"][:512]) for doc in candidates]
scores = self.model.predict(pairs, show_progress_bar=False)
for doc, score in zip(candidates, scores):
doc["rerank_score"] = float(score)
reranked = sorted(candidates, key=lambda x: x["rerank_score"], reverse=True)
return reranked[:top_k]
Component 5: Hybrid Search with RRF
Combining BM25 and dense retrieval:
def hybrid_search(
query_processed: dict,
bm25_retriever, # Elasticsearch client
dense_retriever: NeuralSearchEngine,
reranker: CrossEncoderReranker,
bm25_top_k: int = 50,
dense_top_k: int = 50,
final_k: int = 10,
) -> List[dict]:
"""Full hybrid search pipeline."""
# BM25 retrieval (with synonym expansion)
bm25_query = query_processed["retrieval_query"]
if query_processed["bm25_boost"]:
bm25_query += " " + " ".join(query_processed["bm25_boost"])
bm25_results = bm25_retriever.search(bm25_query, top_k=bm25_top_k)
# Dense retrieval
dense_results = dense_retriever.retrieve(
query_processed["retrieval_query"],
top_k=dense_top_k
)
# RRF fusion
fused = reciprocal_rank_fusion(
[[(doc["id"], doc["score"]) for doc in bm25_results],
dense_results],
k=60,
weights=[0.4, 0.6], # dense slightly higher for semantic-heavy corpus
)
# Get top-20 candidates with full text
top_20_ids = [doc_id for doc_id, _ in fused[:20]]
candidates = bm25_retriever.get_documents_by_ids(top_20_ids)
# Cross-encoder reranking
reranked = reranker.rerank(query_processed["retrieval_query"], candidates, top_k=final_k)
return reranked
Evaluation: Building a Test Set Without Labels
The hardest part of this case study: there are no relevance judgments. How do you measure progress?
Step 1: Collect implicit feedback labels. Log every search query and every document click. A query with a click on document is a weak positive label for . This is noisy but abundant.
Step 2: Build a curated evaluation set. Sample 200 representative queries. Have 3 domain experts manually rate the top-5 results from the baseline system on a 4-point scale (0 = irrelevant, 3 = highly relevant). This gives you an NDCG@5 baseline.
Step 3: Measure NDCG@5 and MRR on the curated set.
Step 4: User satisfaction survey. After each search, show a thumbs up/thumbs down prompt. Track satisfaction rate weekly as the primary business metric.
Step 5: Zero-result rate. What fraction of queries return zero results? This is a direct failure signal.
import numpy as np
from typing import List
def ndcg_at_k(relevance_scores: List[float], k: int) -> float:
"""
Compute NDCG@k for a single query.
Args:
relevance_scores: list of relevance grades in the ranked order returned by system
k: cutoff position
"""
relevance_scores = relevance_scores[:k]
if not relevance_scores:
return 0.0
# DCG: discount later positions
dcg = sum(
(2 ** r - 1) / np.log2(i + 2)
for i, r in enumerate(relevance_scores)
)
# Ideal DCG: best possible ordering
ideal_scores = sorted(relevance_scores, reverse=True)
idcg = sum(
(2 ** r - 1) / np.log2(i + 2)
for i, r in enumerate(ideal_scores)
)
return dcg / idcg if idcg > 0 else 0.0
def mean_reciprocal_rank(results_with_relevance: List[List[float]]) -> float:
"""
Mean Reciprocal Rank across multiple queries.
MRR = mean(1/rank_of_first_relevant_result).
"""
rr_scores = []
for relevance_list in results_with_relevance:
for rank, rel in enumerate(relevance_list, start=1):
if rel > 0:
rr_scores.append(1.0 / rank)
break
else:
rr_scores.append(0.0)
return np.mean(rr_scores)
Learning to Rank (Future Direction)
Once you have labeled data (from the evaluation set + accumulated click logs), the system can graduate to a Learning to Rank model:
LambdaMART: Gradient boosted trees trained with LambdaRank objective. Takes a feature vector per (query, document) pair and outputs a relevance score. Features: BM25 score, dense similarity score, cross-encoder score, document recency, document popularity, query-document click rate.
LambdaMART learns optimal weights for combining these signals, outperforming hand-tuned RRF weights for complex queries. The downside: it requires labeled training data (relevance judgments), which takes time to accumulate.
Common Mistakes
Mistake: Deploying dense retrieval alone and removing BM25.
Dense retrieval excels at semantic similarity but fails at exact match. A user searching for a specific error code ("ERR_SSL_PROTOCOL_ERROR") gets better results from BM25 (exact token match) than from dense retrieval (semantic similarity to vague concepts). Hybrid search consistently outperforms either approach alone. Never remove BM25 entirely.
Mistake: Using the same embedding model for indexing and querying different-length texts.
Most bi-encoders are trained on query-passage pairs where queries are short (5-10 tokens) and passages are longer (100-200 tokens). If you encode full documents (500-2000 tokens) at indexing time, the representation will be poor - the model was not trained to encode long texts as a single vector. Always chunk documents before indexing, and retrieve chunks not full documents.
Tip: Implement search result monitoring before optimization.
Before changing anything, set up monitoring: query logs, click rates per position, zero-result rate, search abandonment rate, time-to-first-click. This establishes baselines and lets you measure the impact of each change. The most impactful improvements are often invisible without instrumentation.
Interview Q&A
Q: How would you improve a BM25-only search system to handle semantic queries?
A: I would take a layered approach, adding neural components on top of the existing BM25 baseline rather than replacing it. First, quick wins with BM25 itself: field weighting (title matches worth more than body), synonym expansion using a domain-specific synonym dictionary, and phrase matching boosts. These typically move satisfaction by 10-15 percentage points with low engineering cost. Second, add dense retrieval: deploy a bi-encoder (Sentence Transformers, pre-trained on MS MARCO) to produce document embeddings indexed in FAISS. At query time, embed the query and retrieve the top-50 candidates by cosine similarity. Combine with BM25's top-50 using Reciprocal Rank Fusion. Third, add cross-encoder reranking on the top-20 RRF results. The cross-encoder processes query and document jointly, producing much more accurate relevance scores than bi-encoder similarity. This three-stage system (BM25 + dense retrieval + cross-encoder reranking) typically achieves 25-35 percentage point improvements in NDCG@5 over BM25 alone.
Q: What is NDCG and why is it the standard metric for search evaluation?
A: NDCG (Normalized Discounted Cumulative Gain) measures ranking quality when documents have graded relevance scores (0-3, not just binary relevant/irrelevant). It applies position discounts - a relevant document at rank 1 is worth more than the same document at rank 5. The "normalized" part divides by the ideal DCG (the score if all relevant documents were shown at the top), making it comparable across queries with different numbers of relevant documents. NDCG is preferred over precision@K because it: (1) accounts for graded relevance - "very relevant" should count more than "somewhat relevant"; (2) accounts for position - showing relevant documents at rank 1 vs rank 5 matters; (3) is normalized - allows averaging across queries with different relevance distributions. MRR (Mean Reciprocal Rank) is preferred when only finding any one relevant document matters (e.g., navigational queries), while NDCG is preferred when the full ranking quality matters.
Q: How do you handle queries for which your corpus has no relevant documents?
A: First, measure the scope: track the zero-result rate and the low-satisfaction rate. Understand whether the issue is vocabulary mismatch (documents exist but weren't retrieved) or a corpus gap (the information truly doesn't exist). For vocabulary mismatch: synonym expansion, dense retrieval, and query expansion typically fix this. For true corpus gaps: implement a "no results" experience that suggests related documents and captures the failed query for content gap analysis. Route zero-result and low-satisfaction queries to a content team for corpus improvement. Consider adding an LLM-backed fallback for questions where the corpus is incomplete - the LLM answers from its parametric knowledge with a disclaimer that the answer is not from the internal corpus. Track which queries consistently result in user dissatisfaction and prioritize them for content creation.
