Skip to main content

:::tip 🎮 Interactive Playground Visualize this concept: Try the Two-Tower Model demo on the EngineersOfAI Playground - no code required. :::

Designing a Search Ranking System

The Relevance Crisis

The engineering team at an e-commerce platform runs the numbers and discovers something alarming: 23% of searches return zero-click results. Users search, see the results page, and leave without clicking anything. A further 18% of searches result in a click on the first result followed immediately by a back-navigation - the "pogo-sticking" signal that the result was wrong.

The current system uses Elasticsearch with BM25 ranking and a set of manually tuned boosts: recency boost, in-stock boost, category match boost. The boosts were set two years ago and have not been updated since. Product search quality has degraded as the catalog expanded from 2 million to 8 million SKUs without corresponding updates to the ranking logic.

The root causes are distinct. Zero-click results often involve queries where the correct answer exists in the catalog but BM25 fails to surface it - users query by synonym ("sneakers" vs "trainers"), attribute ("waterproof jacket" when the attribute is stored as "water-resistant"), or semantic concept ("good laptop for programming" when no product description contains all those words). Pogo-sticking results often involve queries where BM25 finds the right product family but ranks the wrong variant at the top - the most textually similar item is not the most relevant one.

The solution requires two things: better retrieval (getting the right item into the candidate set) and better ranking (ordering the candidates correctly). This lesson builds both.


Requirements

Functional requirements:

  • Given a text query, return a ranked list of items
  • Support the full query taxonomy: navigational (specific item), informational (category exploration), transactional (ready to buy)
  • Support spelling correction, synonym expansion, and multi-language queries

Non-functional requirements:

  • Serving latency: p99 under 150ms for a full search response
  • Freshness: new items indexed within 5 minutes
  • Scale: 5 million queries per day (roughly 60 queries per second average, 500 QPS peak)

The Search Ranking Pipeline


Query Understanding

Before retrieving results, the query must be interpreted. Query understanding consists of three components: spell correction, intent classification, and query expansion.

Spell Correction

Noisy user queries ("snekaers", "laptpo") are corrected before retrieval. A simple edit-distance approach (Peter Norvig's spell corrector) works well for common misspellings. For a product catalog, supplement with a catalog-aware spell corrector: prefer corrections that produce known product names.

from collections import Counter
import re
import math


class CatalogAwareSpellCorrector:
"""
Spell corrector that prefers corrections matching catalog terms.
Combines Norvig-style edit distance with catalog term frequency.
"""

def __init__(self, catalog_terms: list):
# Build bigram language model from catalog
self.catalog_vocab = Counter(catalog_terms)
total = sum(self.catalog_vocab.values())
self.log_probs = {
term: math.log(count / total)
for term, count in self.catalog_vocab.items()
}

def edits1(self, word: str) -> set:
"""All strings 1 edit away from word."""
letters = "abcdefghijklmnopqrstuvwxyz"
splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
deletes = [L + R[1:] for L, R in splits if R]
transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R) > 1]
replaces = [L + c + R[1:] for L, R in splits if R for c in letters]
inserts = [L + c + R for L, R in splits for c in letters]
return set(deletes + transposes + replaces + inserts)

def correct(self, word: str) -> str:
"""Return most probable correction for word."""
if word in self.catalog_vocab:
return word # already a known term

candidates = self.edits1(word) & set(self.catalog_vocab.keys())
if not candidates:
return word # no known correction

# Pick candidate with highest catalog probability
best = max(candidates, key=lambda c: self.log_probs.get(c, float("-inf")))
return best

def correct_query(self, query: str) -> str:
"""Correct all words in a query."""
words = query.lower().split()
corrected = [self.correct(w) for w in words]
return " ".join(corrected)

Query Expansion

Synonym expansion addresses vocabulary mismatch. "Sneakers" and "trainers" refer to the same product type. Manually curated synonym dictionaries work for the most common cases; word embeddings (Word2Vec, GloVe trained on product descriptions) handle the long tail.

class QueryExpander:
"""
Expand queries with synonyms and related terms.
Injects synonyms as OR clauses in the Elasticsearch query.
"""

SYNONYM_DICT = {
"sneakers": ["trainers", "athletic shoes", "running shoes"],
"couch": ["sofa", "settee", "loveseat"],
"fridge": ["refrigerator", "freezer"],
"laptop": ["notebook", "ultrabook"],
"waterproof": ["water-resistant", "water-repellent", "weatherproof"],
}

def expand(self, query: str) -> list:
"""
Return list of query variants including synonym expansions.
Elasticsearch bool/should query takes all variants.
"""
tokens = query.lower().split()
variants = [query]

for token in tokens:
if token in self.SYNONYM_DICT:
for synonym in self.SYNONYM_DICT[token]:
expanded = query.replace(token, synonym)
variants.append(expanded)

return variants

BM25 Sparse Retrieval

BM25 (Best Match 25) is the workhorse of keyword search. It scores documents by term frequency (how often query terms appear) weighted by inverse document frequency (rare terms score higher) and document length normalization.

from elasticsearch import Elasticsearch


class ElasticsearchRetriever:
"""
BM25-based sparse retrieval via Elasticsearch.
Handles query expansion with multi_match boost.
"""

def __init__(self, es_host: str = "localhost:9200", index: str = "products"):
self.es = Elasticsearch([es_host])
self.index = index

def search(
self,
query: str,
expanded_queries: list,
top_k: int = 1000,
filters: dict = None,
) -> list:
"""
Multi-field BM25 search with expanded queries.
Returns list of (doc_id, bm25_score) tuples.
"""
# Build should clauses for expanded queries
should_clauses = []
for expanded_q in expanded_queries:
should_clauses.append({
"multi_match": {
"query": expanded_q,
"fields": [
"title^3", # title match is worth 3x body match
"description",
"brand^2",
"category^1.5",
"tags",
],
"type": "best_fields",
"operator": "or",
}
})

filter_clauses = []
if filters:
if filters.get("in_stock"):
filter_clauses.append({"term": {"in_stock": True}})
if filters.get("price_max"):
filter_clauses.append({
"range": {"price": {"lte": filters["price_max"]}}
})

query_body = {
"size": top_k,
"query": {
"bool": {
"should": should_clauses,
"filter": filter_clauses,
"minimum_should_match": 1,
}
},
"_source": ["product_id", "title", "category", "price"],
}

response = self.es.search(index=self.index, body=query_body)
return [
(hit["_source"]["product_id"], hit["_score"])
for hit in response["hits"]["hits"]
]

Dense Retrieval (Bi-Encoder)

Dense retrieval encodes both the query and documents as dense vectors, then retrieves by vector similarity. It captures semantic meaning that BM25 misses ("sneakers" and "trainers" will have similar embeddings even if they share no tokens).

import torch
import numpy as np
from sentence_transformers import SentenceTransformer
import faiss


class DenseRetriever:
"""
Bi-encoder dense retrieval: encode query and documents independently.
Query encoding: online (real-time)
Document encoding: offline (precomputed, indexed in FAISS)
"""

def __init__(
self,
model_name: str = "sentence-transformers/all-MiniLM-L6-v2",
embedding_dim: int = 384,
):
self.model = SentenceTransformer(model_name)
self.embedding_dim = embedding_dim
self.index = None
self.doc_ids = None

def index_catalog(self, products: list) -> None:
"""
Pre-compute and index all product embeddings.
Run this offline daily (or on product updates).
"""
# Concatenate title + category + brand for richer document representation
texts = [
f"{p['title']} {p['category']} {p['brand']}"
for p in products
]

print(f"[Dense] Encoding {len(texts):,} products...")
embeddings = self.model.encode(
texts,
batch_size=256,
show_progress_bar=True,
normalize_embeddings=True, # L2 normalize for cosine similarity
)

self.doc_ids = [p["product_id"] for p in products]
embeddings = embeddings.astype("float32")

# Use HNSW index for better recall than IVF at moderate scale
self.index = faiss.IndexHNSWFlat(self.embedding_dim, 32)
self.index.hnsw.efConstruction = 200
self.index.add(embeddings)
print(f"[Dense] Indexed {self.index.ntotal:,} products")

def search(
self,
query: str,
top_k: int = 1000,
) -> list:
"""Encode query and retrieve top-K similar documents."""
query_embedding = self.model.encode(
[query],
normalize_embeddings=True,
).astype("float32")

# Increase efSearch for higher recall at serving time
self.index.hnsw.efSearch = 100
scores, indices = self.index.search(query_embedding, top_k)

return [
(self.doc_ids[idx], float(score))
for idx, score in zip(indices[0], scores[0])
if idx >= 0
]

Hybrid Fusion: Combining BM25 and Dense

from collections import defaultdict


def reciprocal_rank_fusion(
bm25_results: list,
dense_results: list,
k: int = 60,
bm25_weight: float = 0.5,
dense_weight: float = 0.5,
) -> list:
"""
Reciprocal Rank Fusion (RRF) combines BM25 and dense retrieval results.
RRF score = sum(weight / (k + rank)) for each retrieval method.

RRF is robust to score scale differences between BM25 and dense models
(BM25 scores are in 0-30 range; cosine similarities are in 0-1 range).
"""
scores: dict = defaultdict(float)

for rank, (doc_id, _) in enumerate(bm25_results):
scores[doc_id] += bm25_weight / (k + rank + 1)

for rank, (doc_id, _) in enumerate(dense_results):
scores[doc_id] += dense_weight / (k + rank + 1)

# Sort by combined RRF score
return sorted(scores.items(), key=lambda x: -x[1])

Learning to Rank

LTR takes the top 1,000 candidates from hybrid retrieval and produces a ranked list using hundreds of features per (query, document) pair.

Feature Engineering for Search Ranking

import numpy as np
from typing import Optional


class SearchRankingFeatureExtractor:
"""
Extract features for LTR model from (query, document) pairs.
"""

def extract(self, query: str, doc: dict, user_context: dict) -> np.ndarray:
"""
Extract feature vector for one (query, document) pair.
Returns array of floats for the LTR model.
"""
features = []

# --- Text match features ---
query_tokens = set(query.lower().split())
title_tokens = set(doc.get("title", "").lower().split())

# Title token overlap
features.append(len(query_tokens & title_tokens) / max(len(query_tokens), 1))

# Title starts with query
features.append(
1.0 if doc.get("title", "").lower().startswith(query.lower()) else 0.0
)

# Exact phrase match in title
features.append(1.0 if query.lower() in doc.get("title", "").lower() else 0.0)

# BM25 score (from retrieval stage, already computed)
features.append(doc.get("bm25_score", 0.0))

# Dense similarity score (from retrieval stage)
features.append(doc.get("dense_score", 0.0))

# --- Item quality features ---
features.append(float(doc.get("avg_rating", 3.0)))
features.append(min(float(doc.get("review_count", 0)), 1000) / 1000)
features.append(1.0 if doc.get("in_stock") else 0.0)
features.append(float(doc.get("price", 50.0)) / 1000.0) # normalized price

# Recency: days since listing (capped at 365)
features.append(
min(float(doc.get("days_since_listed", 365)), 365) / 365.0
)

# --- User personalization features ---
# Has user viewed this category before?
user_cats = set(user_context.get("viewed_categories", []))
doc_cat = doc.get("category", "")
features.append(1.0 if doc_cat in user_cats else 0.0)

# User's price range match
user_avg_price = user_context.get("avg_purchase_price", 50.0)
doc_price = doc.get("price", 50.0)
price_ratio = min(doc_price, user_avg_price) / max(doc_price, user_avg_price, 1)
features.append(price_ratio)

# --- Historical query-item performance ---
# CTR for this (query, item) pair (from query logs)
features.append(float(doc.get("historical_ctr", 0.02)))

# Conversion rate for this item on this query type
features.append(float(doc.get("conversion_rate", 0.01)))

return np.array(features, dtype=np.float32)

LambdaMART for Listwise Ranking

import lightgbm as lgb
import numpy as np
from sklearn.model_selection import GroupShuffleSplit


class LambdaMARTRanker:
"""
LambdaMART ranking model (LightGBM implementation).
Optimizes NDCG directly using listwise ranking loss.
"""

def __init__(self, n_estimators: int = 500, learning_rate: float = 0.05):
self.model = lgb.LGBMRanker(
objective="lambdarank",
metric="ndcg",
ndcg_eval_at=[1, 5, 10],
n_estimators=n_estimators,
learning_rate=learning_rate,
num_leaves=127,
max_depth=-1,
subsample=0.8,
colsample_bytree=0.8,
min_child_samples=20,
)

def train(
self,
X: np.ndarray, # (N, F) feature matrix
y: np.ndarray, # (N,) relevance labels (0=irrelevant, 1=clicked, 2=purchased)
groups: np.ndarray, # (N,) query group sizes
X_val: np.ndarray = None,
y_val: np.ndarray = None,
groups_val: np.ndarray = None,
) -> None:
"""
Train LambdaMART on query-grouped training data.
groups[i] = number of documents for the i-th query.
"""
eval_set = None
if X_val is not None:
eval_set = [(X_val, y_val)]

self.model.fit(
X, y,
group=groups,
eval_set=eval_set,
eval_group=[groups_val] if groups_val is not None else None,
callbacks=[lgb.early_stopping(50), lgb.log_evaluation(100)],
)

def predict(self, X: np.ndarray) -> np.ndarray:
"""Score documents. Higher score = more relevant."""
return self.model.predict(X)

def rank_candidates(
self,
query: str,
candidates: list,
feature_extractor: SearchRankingFeatureExtractor,
user_context: dict,
top_n: int = 100,
) -> list:
"""Rank a list of candidates for a query."""
features = np.vstack([
feature_extractor.extract(query, doc, user_context)
for doc in candidates
])
scores = self.predict(features)
ranked = sorted(
zip(candidates, scores),
key=lambda x: -x[1],
)
return [(doc, score) for doc, score in ranked[:top_n]]

Semantic Reranking: Cross-Encoder

The bi-encoder (used in dense retrieval) encodes query and document independently. The cross-encoder encodes them jointly, allowing full cross-attention. It is much slower (cannot precompute document embeddings) but much more accurate. Run it only on the top 100 candidates.

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch


class CrossEncoderReranker:
"""
Cross-encoder reranker for the top-100 candidates.
Encodes (query, document) pairs jointly - full cross-attention.
Much more accurate than bi-encoder but too slow for full retrieval.
"""

def __init__(
self,
model_name: str = "cross-encoder/ms-marco-MiniLM-L-6-v2",
):
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForSequenceClassification.from_pretrained(model_name)
self.model.eval()

def rerank(
self,
query: str,
candidates: list,
top_k: int = 20,
) -> list:
"""
Rerank candidates using cross-encoder scores.
candidates: list of {"product_id": ..., "title": ..., "description": ...}
"""
# Build (query, document) pairs
pairs = [
(query, f"{doc['title']} {doc.get('description', '')[:200]}")
for doc in candidates
]

# Tokenize in batch
inputs = self.tokenizer(
pairs,
padding=True,
truncation=True,
max_length=256,
return_tensors="pt",
)

with torch.no_grad():
logits = self.model(**inputs).logits.squeeze(-1)
scores = torch.sigmoid(logits).cpu().numpy()

# Sort candidates by cross-encoder score
scored = sorted(
zip(candidates, scores.tolist()),
key=lambda x: -x[1],
)
return [(doc, score) for doc, score in scored[:top_k]]

Search Metrics and A/B Testing

Offline Metrics

import numpy as np


def dcg_at_k(relevances: list, k: int) -> float:
"""
Discounted Cumulative Gain at k.
relevances: list of relevance labels (0=bad, 1=ok, 2=good, 3=perfect)
"""
relevances = relevances[:k]
if not relevances:
return 0.0
gains = [
(2 ** r - 1) / np.log2(i + 2)
for i, r in enumerate(relevances)
]
return sum(gains)


def ndcg_at_k(relevances: list, k: int) -> float:
"""Normalized DCG at k. Normalizes by ideal ranking."""
actual_dcg = dcg_at_k(relevances, k)
ideal_dcg = dcg_at_k(sorted(relevances, reverse=True), k)
if ideal_dcg == 0:
return 0.0
return actual_dcg / ideal_dcg


def mean_reciprocal_rank(relevance_lists: list) -> float:
"""MRR: average of 1/rank of first relevant result."""
reciprocal_ranks = []
for relevances in relevance_lists:
for rank, rel in enumerate(relevances, start=1):
if rel > 0:
reciprocal_ranks.append(1.0 / rank)
break
else:
reciprocal_ranks.append(0.0)
return float(np.mean(reciprocal_ranks))


def evaluate_ranker(
ranker,
test_queries: list,
ground_truth: dict,
ks: list = [1, 5, 10],
) -> dict:
"""Evaluate a ranker on a set of test queries."""
ndcg_scores = {k: [] for k in ks}
mrr_scores = []

for query_data in test_queries:
query = query_data["query"]
candidates = query_data["candidates"]
ranked = ranker.rank(query, candidates)

relevances = [
ground_truth.get((query, doc["product_id"]), 0)
for doc, _ in ranked
]

for k in ks:
ndcg_scores[k].append(ndcg_at_k(relevances, k))
mrr_scores.append(
mean_reciprocal_rank([relevances])
)

return {
**{f"ndcg@{k}": float(np.mean(ndcg_scores[k])) for k in ks},
"mrr": float(np.mean(mrr_scores)),
}

Search A/B testing requires large sample sizes because the treatment effect (ranking quality improvement) is small. A typical search ranking improvement of NDCG@10 from 0.62 to 0.64 (a 3.2% improvement) requires roughly 50,000 queries per variant to detect with 80% power at p less than 0.05.

The key metric for search A/B tests: click-through rate at position 1 (P1 CTR) and time to first click (users who find the right result immediately click faster). Zero-click rate (user abandons without clicking) is the primary failure metric.


:::danger BM25 Exact Match Bias

BM25 over-ranks documents that contain the exact query tokens and under-ranks semantically equivalent documents that use different vocabulary. For a query "running shoes for women," a product titled "women's athletic running shoes" ranks higher than a product titled "ladies' jogging footwear" even if the latter is more popular and better-rated. This creates the vocabulary mismatch problem.

Solution: hybrid retrieval (BM25 + dense) with proper fusion weighting. Empirically, for e-commerce search, BM25 handles navigational queries (exact brand/SKU lookups) better, while dense retrieval handles informational queries better. Set the RRF weights based on query intent classification: 0.7 BM25 / 0.3 dense for navigational, 0.3 BM25 / 0.7 dense for informational. :::

:::warning Click Bias in LTR Training Data

LTR training labels come from user clicks. Clicks are heavily biased by position - position 1 gets 5-10x the clicks of position 5, regardless of actual relevance. A model trained on raw click data learns that "position 1 items are relevant" rather than "the item the user actually needed is relevant."

Solution: use position-debiased labels. Randomized logging (Listing Exploration in A/B): randomly shuffle a small fraction (1-5%) of search results and use those queries as unbiased training signal. Alternatively, use inverse propensity scoring: weight each training example by 1/P(click | position), where the propensity P is estimated from randomized exploration data. :::


Interview Q&A

Q1: What is the difference between BM25 and dense retrieval, and when should you use each?

BM25 is a bag-of-words keyword matching algorithm. It scores a document by summing TF-IDF-like weights for query tokens that appear in the document. Strengths: fast (inverted index lookup), exact brand/product name matching, handles rare tokens well (a query for a specific SKU number will find that SKU). Weaknesses: vocabulary mismatch (cannot match "sneakers" to "trainers"), no semantic understanding, sensitive to query term choice.

Dense retrieval encodes queries and documents into a shared semantic space using neural networks. Similar meaning maps to similar vectors. Strengths: handles vocabulary mismatch, captures semantic similarity, works well for informational queries. Weaknesses: slower (needs ANN search), poor at exact matching (a specific 16-digit product code will not have a unique semantic embedding), requires large training datasets, index must be rebuilt when embeddings change.

Best practice: hybrid retrieval with both, fused using RRF. BM25 handles the tail of queries with specific product codes, brand names, and technical specifications. Dense handles the broader informational queries where semantic understanding matters.


Q2: Explain Learning to Rank and the difference between pointwise, pairwise, and listwise approaches.

Learning to Rank (LTR) is a family of supervised ML approaches that learn to rank documents given a query. The three paradigms differ in what the training objective operates on.

Pointwise: treat ranking as regression or classification on individual (query, document) pairs. Predict a relevance score for each pair independently. Simple to train, but the model does not directly optimize ranking quality - it is possible to get every individual score right and still produce a bad ranking.

Pairwise: train on (query, document_A, document_B) triples where document_A is more relevant than document_B. RankNet (Burges et al., 2005) uses this approach. The model learns to produce higher scores for the more relevant document. Better than pointwise because it directly compares documents, but still does not optimize the full ranking.

Listwise: train on the full ranked list for each query, directly optimizing a ranking metric like NDCG. LambdaMART (Burges, 2010) is the most widely deployed listwise approach. It computes "lambda gradients" that measure how swapping each pair of documents would change NDCG, then trains a gradient boosted tree model (MART) to predict these gradients. LambdaMART consistently outperforms pointwise and pairwise approaches and is the industry standard for production search ranking.


Q3: How do you evaluate a search ranking system? What metrics matter?

Offline evaluation uses judgement-based relevance labels. Human annotators (or query logs) label (query, document) pairs with relevance grades (0=irrelevant, 1=fair, 2=good, 3=perfect). NDCG@k measures the quality of the top-k ranking, discounting items at lower positions. NDCG@10 is the standard for web search; NDCG@5 is more common for e-commerce where users rarely scroll below position 5.

MRR (Mean Reciprocal Rank) measures how often the correct result appears at the top. Useful for navigational queries where there is one correct answer.

Online A/B metrics: CTR@1 (click rate on the first result), time to first click (how quickly the user found what they needed), zero-click rate (fraction of queries where the user did not click anything - the primary failure mode), reformulation rate (fraction of queries followed by a modified query - indicates the user did not find what they wanted).

The key challenge: offline NDCG improvements do not always translate to online engagement improvements. Always validate offline gains with an A/B test before deploying.


Q4: How does cross-encoder reranking improve search quality?

A bi-encoder (used in dense retrieval) encodes queries and documents independently, then uses dot product for similarity. This means each document's embedding is fixed regardless of the query - it cannot capture query-specific relevance signals. A document about "red running shoes" has the same embedding whether the query is "running shoes" or "shoes for pain relief."

A cross-encoder encodes the (query, document) pair jointly using full cross-attention in a transformer. The attention mechanism can relate every query token to every document token. This allows the model to identify that a query for "waterproof jacket for hiking" is highly relevant to a product titled "breathable rain shell for mountain activities" even though the words do not overlap much - the cross-attention identifies semantic equivalence.

The catch: cross-encoders require a forward pass for every (query, document) pair, and document representations cannot be precomputed. At 200ms latency budget, you can afford a cross-encoder on the top 100 candidates (roughly 30ms for a MiniLM cross-encoder on 100 pairs), but not on all 8 million products. The staged architecture - bi-encoder for retrieval, cross-encoder for reranking - gets you the best of both: the speed of bi-encoder for the full catalog and the accuracy of cross-encoder for the final top-20.


Q5: How did LinkedIn improve their job search using ML?

LinkedIn's job search (published in engineering blog posts 2018-2021) evolved from BM25 to a multi-stage ML system. The key improvements:

Stage 1: improved candidate retrieval by adding a query understanding layer that classifies job search queries into intents (skill-based, company-based, role-based) and expands queries with skill synonyms from LinkedIn's skill graph. This improved recall - more relevant jobs entered the candidate set.

Stage 2: LTR ranking with 200+ features including job-seeker fit features (skill match between job description and member profile, seniority match), relevance features (BM25 score, semantic similarity), and quality features (company reputation, salary range, freshness). The model is LambdaMART, trained on historical applications as positive labels and ignored jobs as negatives, with position debiasing.

Stage 3: personalization through MTL - the ranking model jointly predicts click probability and apply probability. Items with high apply probability (user actually submitted an application) are weighted more heavily in the training signal than pure clicks, which reduced optimizing for job clickbait.

The result: 30% increase in job application rates and significant improvement in job seeker satisfaction as measured by post-application surveys.


Summary

A production search ranking system has four stages: query understanding (spell correction, intent classification, expansion), hybrid candidate retrieval (BM25 sparse + dense bi-encoder, fused with RRF), LTR ranking (LambdaMART with 200+ features across the top 1,000 candidates), and cross-encoder reranking (the top 100 to produce the final 20). BM25 handles exact-match navigational queries; dense retrieval handles semantic informational queries; their fusion via RRF captures both. LambdaMART directly optimizes NDCG and is the industry standard for production ranking. Cross-encoders provide the highest-quality final reranking but are too slow for large candidate sets. A/B testing with online metrics (CTR@1, zero-click rate, time to first click) validates offline NDCG improvements before full deployment.

© 2026 EngineersOfAI. All rights reserved.