Design: Search Ranking - Multi-Stage Ranking at Scale

Reading time: ~25 min | Interview relevance: Critical | Roles: MLE, AI Eng

The Real Interview Moment

"Design the search ranking system for an e-commerce platform." You start describing TF-IDF and BM25. The interviewer nods but then asks: "A user searches for 'running shoes.' They've been browsing hiking gear all session. How does your system handle that?" You hadn't thought about personalization. "Also, how do you balance relevance with revenue? The highest-relevance result might be a $20 shoe, but the business wants to promote the$ 120 premium version."

Search ranking is harder than recommendation because it's query-dependent, latency-sensitive, and involves multi-objective trade-offs. The best candidates show they understand the full pipeline - from query understanding to serving.

What You Will Master

Query understanding: spelling correction, query expansion, intent classification
Multi-stage retrieval: boolean → BM25 → semantic → re-ranking
Learning-to-rank: pointwise, pairwise, and listwise approaches
Feature engineering for search relevance
Personalization within search results
Balancing relevance with business objectives
Online evaluation for search quality (interleaving, NDCG)

The Complete Design

Step 1: Requirements (5 min)

Functional requirements:

E-commerce search for 50M products, 500M queries/day
Autocomplete, spell correction, query suggestions
Ranked results with filtering (price, category, brand)
Personalized results based on user history

Non-functional requirements:

Latency: <150ms p99 for search results
Relevance: Users find what they want in top 5 results 80% of the time
Freshness: New products searchable within 1 hour
Availability: 99.99% - search is the primary product experience

Step 2: Problem Formulation (5 min)

Business Goal	ML Objective	Primary Metric	Guardrail Metrics
Users find and buy products	Learning-to-rank: order results by P(purchase \| query, item, user)	NDCG@10, Purchase-through rate	Query abandonment rate, zero-result rate, latency

ML problem type: Learning-to-rank (LTR)

Three approaches to LTR:

Approach	How It Works	Pro	Con
Pointwise	Predict relevance score per item independently	Simple, easy to train	Ignores relative ordering
Pairwise (LambdaMART)	Learn to order pairs correctly	Better ranking quality	More training data needed
Listwise (ListNet, Deep Ranking)	Optimize ranking metrics directly	Best quality, handles position bias	Complex, slower training

Step 3: Features & Data (8 min)

The Search Pipeline

Search Ranking Pipeline - Query → Query Understanding → Retrieval → Ranking → Re-Ranking → Results

Query Understanding

Component	What It Does	Example
Spell correction	Fix typos	"runnign shoes" → "running shoes"
Query expansion	Add synonyms	"laptop" → "laptop OR notebook OR computer"
Intent classification	Detect query type	"cheap running shoes" → intent: price-sensitive
Entity recognition	Extract structured info	"Nike Air Max size 10" → brand: Nike, product: Air Max, size: 10

Feature Categories

Category	Features	Signal Strength
Query-Item	BM25 score, semantic similarity, title match, description match	Very high
Item	Sales rank, review rating, review count, price, conversion rate, return rate	High
Query	Query length, query frequency, query category, commercial intent score	Medium
User-Query	Past clicks for similar queries, purchase history in category	High
Context	Device, time of day, location, session browse history	Medium

Training Data

Human relevance judgments: Expert annotators label query-item pairs (0-4 relevance scale). Expensive but high quality.
Click-through data: Implicit labels from clicks. Abundant but biased - users can only click what was shown, and position 1 gets more clicks regardless.
Purchase data: Strongest signal but sparse - only 2-3% of searches lead to purchase.

Common Trap

Click data has severe position bias. An item at position 1 gets 10x more clicks than the same item at position 5. If you train on raw click data, you'll learn to replicate position bias, not relevance. Use inverse propensity weighting or counterfactual learning to debias. Mention this in the interview - it separates Strong Hire from Lean Hire.

Step 4: Model (8 min)

The Progression

Search Ranking Model Progression - BM25 → BM25+LR → LambdaMART → Deep Ranking

Model	NDCG@10 (typical)	Latency	When to Use
BM25	0.35	5ms	Baseline, always have as fallback
BM25 + LR	0.42	8ms	First ML model, easy to debug
LambdaMART	0.48	15ms	Production workhorse, interpretable
Deep Ranking (BERT-based)	0.52	50ms	Best quality, requires GPU serving

Recommendation: LambdaMART (XGBoost with LambdaRank loss) is the sweet spot for most production systems - strong quality, fast inference, interpretable feature importance.

Step 5: Serving (8 min)

Multi-Stage Architecture

Multi-Stage Search Architecture - Query Understanding → L0 Boolean → L1 BM25 → L2 Light Rank → L3 Full Rank → Re-Rank

Why so many stages? Each stage trades off quality for speed. L0 is exact match (milliseconds on an inverted index). L1 adds BM25 scoring. L2 uses a lightweight model (small tree ensemble). L3 uses the full ranking model with all features. Each stage reduces the candidate set for the next, more expensive stage.

Latency Budget

Stage	Candidates	Time Budget	Model
Query Understanding	1 query	10ms	Rule-based + small classifier
L0: Boolean Filter	50M → 100K	5ms	Inverted index (Elasticsearch)
L1: BM25 Retrieval	100K → 10K	10ms	BM25 scoring
L2: Lightweight Rank	10K → 500	20ms	Small GBDT (20 features)
L3: Full Rank	500 → 50	40ms	Full LambdaMART or deep model
Re-Ranking	50 → 50	10ms	Business rules, diversity
Total		<100ms

Step 6: Evaluation & Iteration (8 min)

Offline Evaluation

Metric	What It Measures	Target
NDCG@10	Ranking quality	> 0.45
MRR	Position of first relevant result	> 0.6
Zero-result rate	% of queries with no results	< 2%

Online Evaluation

Interleaving: Merge results from control and treatment models in alternating positions. Measure which model's results get more clicks. More sensitive than A/B testing for ranking.
A/B Testing: Measure purchase-through rate, revenue per search, session abandonment.

Key Trade-Off: Relevance vs. Revenue

This is often the deepest interview conversation:

Pure relevance ranking → users find cheap products, lower revenue
Pure revenue ranking → users see expensive products, leave frustrated
Solution: Multi-objective ranking with constraints. Primary objective: relevance. Add revenue as a secondary signal with a tunable weight. Monitor user satisfaction (abandonment rate) as a guardrail.

Practice Problems

Problem 1: Handle the Query "Apple"

Direction

"Apple" is ambiguous - it could mean the fruit, the tech company, or Apple Records. How does your search system handle this?

Key Insight

Use query intent classification to detect ambiguity. For ambiguous queries, diversify results across intents rather than committing to one. Show Apple (tech) products, apple (fruit) products, and use user history to personalize the ranking. If the user has been browsing electronics, boost Apple Inc. results. Key concept: intent diversification - hedge your bets on ambiguous queries.

Problem 2: Latency Optimization

Direction

Your ranking model takes 80ms for 500 candidates, exceeding your 50ms budget. How do you speed it up without sacrificing quality?

Key Insight

Options: (1) Reduce candidates from 500 to 200 with a faster L2 stage. (2) Model distillation - train a smaller model to mimic the large model. (3) Feature pruning - remove features with low importance. (4) Quantization - INT8 inference. (5) Caching - cache scores for popular query-item pairs. The best answer combines multiple approaches and quantifies the quality-latency trade-off.

Interview Cheat Sheet

Question Pattern	Framework	Key Phrases
"Design search for X"	Multi-stage retrieval + LTR	"L0 boolean → L1 BM25 → L2 lightweight → L3 full model"
"How do you handle typos?"	Query understanding pipeline	"Spell correction, query expansion, intent classification"
"How do you train the ranking model?"	Training data + debiasing	"Click data with inverse propensity weighting to correct for position bias"
"Relevance vs. revenue?"	Multi-objective ranking	"Relevance as primary, revenue as secondary signal with guardrails on abandonment"

Spaced Repetition Checkpoints

Day 0: Draw the multi-stage search architecture from memory (Query Understanding → L0 → L1 → L2 → L3 → Re-rank).
Day 3: Explain pointwise vs. pairwise vs. listwise LTR. When would you use each?
Day 7: Design search ranking for a job board in 45 minutes. Focus on query understanding and personalization.
Day 14: Explain position bias and how to correct for it in training data.
Day 21: Mock interview on search ranking with follow-ups on latency optimization and relevance-revenue trade-offs.

What's Next

Fraud Detection - Real-time classification with extreme class imbalance
Ad Click Prediction - Related ranking problem in advertising context

The Real Interview Moment​

What You Will Master​

The Complete Design​

Step 1: Requirements (5 min)​

Step 2: Problem Formulation (5 min)​

Step 3: Features & Data (8 min)​

The Search Pipeline​

Query Understanding​

Feature Categories​

Training Data​

Step 4: Model (8 min)​

The Progression​

Step 5: Serving (8 min)​

Multi-Stage Architecture​

Latency Budget​

Step 6: Evaluation & Iteration (8 min)​

Offline Evaluation​

Online Evaluation​

Key Trade-Off: Relevance vs. Revenue​

Practice Problems​

Problem 1: Handle the Query "Apple"​

Problem 2: Latency Optimization​

Interview Cheat Sheet​

Spaced Repetition Checkpoints​

What's Next​