Skip to main content

Design: Search Ranking - Multi-Stage Ranking at Scale

Reading time: ~25 min | Interview relevance: Critical | Roles: MLE, AI Eng

The Real Interview Moment

"Design the search ranking system for an e-commerce platform." You start describing TF-IDF and BM25. The interviewer nods but then asks: "A user searches for 'running shoes.' They've been browsing hiking gear all session. How does your system handle that?" You hadn't thought about personalization. "Also, how do you balance relevance with revenue? The highest-relevance result might be a 20shoe,butthebusinesswantstopromotethe20 shoe, but the business wants to promote the 120 premium version."

Search ranking is harder than recommendation because it's query-dependent, latency-sensitive, and involves multi-objective trade-offs. The best candidates show they understand the full pipeline - from query understanding to serving.

What You Will Master

  • Query understanding: spelling correction, query expansion, intent classification
  • Multi-stage retrieval: boolean → BM25 → semantic → re-ranking
  • Learning-to-rank: pointwise, pairwise, and listwise approaches
  • Feature engineering for search relevance
  • Personalization within search results
  • Balancing relevance with business objectives
  • Online evaluation for search quality (interleaving, NDCG)

The Complete Design

Step 1: Requirements (5 min)

Functional requirements:

  • E-commerce search for 50M products, 500M queries/day
  • Autocomplete, spell correction, query suggestions
  • Ranked results with filtering (price, category, brand)
  • Personalized results based on user history

Non-functional requirements:

  • Latency: <150ms p99 for search results
  • Relevance: Users find what they want in top 5 results 80% of the time
  • Freshness: New products searchable within 1 hour
  • Availability: 99.99% - search is the primary product experience

Step 2: Problem Formulation (5 min)

Business GoalML ObjectivePrimary MetricGuardrail Metrics
Users find and buy productsLearning-to-rank: order results by P(purchase | query, item, user)NDCG@10, Purchase-through rateQuery abandonment rate, zero-result rate, latency

ML problem type: Learning-to-rank (LTR)

Three approaches to LTR:

ApproachHow It WorksProCon
PointwisePredict relevance score per item independentlySimple, easy to trainIgnores relative ordering
Pairwise (LambdaMART)Learn to order pairs correctlyBetter ranking qualityMore training data needed
Listwise (ListNet, Deep Ranking)Optimize ranking metrics directlyBest quality, handles position biasComplex, slower training

Step 3: Features & Data (8 min)

The Search Pipeline

Search Ranking Pipeline - Query → Query Understanding → Retrieval → Ranking → Re-Ranking → Results

Query Understanding

ComponentWhat It DoesExample
Spell correctionFix typos"runnign shoes" → "running shoes"
Query expansionAdd synonyms"laptop" → "laptop OR notebook OR computer"
Intent classificationDetect query type"cheap running shoes" → intent: price-sensitive
Entity recognitionExtract structured info"Nike Air Max size 10" → brand: Nike, product: Air Max, size: 10

Feature Categories

CategoryFeaturesSignal Strength
Query-ItemBM25 score, semantic similarity, title match, description matchVery high
ItemSales rank, review rating, review count, price, conversion rate, return rateHigh
QueryQuery length, query frequency, query category, commercial intent scoreMedium
User-QueryPast clicks for similar queries, purchase history in categoryHigh
ContextDevice, time of day, location, session browse historyMedium

Training Data

  • Human relevance judgments: Expert annotators label query-item pairs (0-4 relevance scale). Expensive but high quality.
  • Click-through data: Implicit labels from clicks. Abundant but biased - users can only click what was shown, and position 1 gets more clicks regardless.
  • Purchase data: Strongest signal but sparse - only 2-3% of searches lead to purchase.
Common Trap

Click data has severe position bias. An item at position 1 gets 10x more clicks than the same item at position 5. If you train on raw click data, you'll learn to replicate position bias, not relevance. Use inverse propensity weighting or counterfactual learning to debias. Mention this in the interview - it separates Strong Hire from Lean Hire.

Step 4: Model (8 min)

The Progression

Search Ranking Model Progression - BM25 → BM25+LR → LambdaMART → Deep Ranking

ModelNDCG@10 (typical)LatencyWhen to Use
BM250.355msBaseline, always have as fallback
BM25 + LR0.428msFirst ML model, easy to debug
LambdaMART0.4815msProduction workhorse, interpretable
Deep Ranking (BERT-based)0.5250msBest quality, requires GPU serving

Recommendation: LambdaMART (XGBoost with LambdaRank loss) is the sweet spot for most production systems - strong quality, fast inference, interpretable feature importance.

Step 5: Serving (8 min)

Multi-Stage Architecture

Multi-Stage Search Architecture - Query Understanding → L0 Boolean → L1 BM25 → L2 Light Rank → L3 Full Rank → Re-Rank

Why so many stages? Each stage trades off quality for speed. L0 is exact match (milliseconds on an inverted index). L1 adds BM25 scoring. L2 uses a lightweight model (small tree ensemble). L3 uses the full ranking model with all features. Each stage reduces the candidate set for the next, more expensive stage.

Latency Budget

StageCandidatesTime BudgetModel
Query Understanding1 query10msRule-based + small classifier
L0: Boolean Filter50M → 100K5msInverted index (Elasticsearch)
L1: BM25 Retrieval100K → 10K10msBM25 scoring
L2: Lightweight Rank10K → 50020msSmall GBDT (20 features)
L3: Full Rank500 → 5040msFull LambdaMART or deep model
Re-Ranking50 → 5010msBusiness rules, diversity
Total<100ms

Step 6: Evaluation & Iteration (8 min)

Offline Evaluation

MetricWhat It MeasuresTarget
NDCG@10Ranking quality> 0.45
MRRPosition of first relevant result> 0.6
Zero-result rate% of queries with no results< 2%

Online Evaluation

  • Interleaving: Merge results from control and treatment models in alternating positions. Measure which model's results get more clicks. More sensitive than A/B testing for ranking.
  • A/B Testing: Measure purchase-through rate, revenue per search, session abandonment.

Key Trade-Off: Relevance vs. Revenue

This is often the deepest interview conversation:

  • Pure relevance ranking → users find cheap products, lower revenue
  • Pure revenue ranking → users see expensive products, leave frustrated
  • Solution: Multi-objective ranking with constraints. Primary objective: relevance. Add revenue as a secondary signal with a tunable weight. Monitor user satisfaction (abandonment rate) as a guardrail.

Practice Problems

Problem 1: Handle the Query "Apple"

Direction

"Apple" is ambiguous - it could mean the fruit, the tech company, or Apple Records. How does your search system handle this?

Key Insight

Use query intent classification to detect ambiguity. For ambiguous queries, diversify results across intents rather than committing to one. Show Apple (tech) products, apple (fruit) products, and use user history to personalize the ranking. If the user has been browsing electronics, boost Apple Inc. results. Key concept: intent diversification - hedge your bets on ambiguous queries.

Problem 2: Latency Optimization

Direction

Your ranking model takes 80ms for 500 candidates, exceeding your 50ms budget. How do you speed it up without sacrificing quality?

Key Insight

Options: (1) Reduce candidates from 500 to 200 with a faster L2 stage. (2) Model distillation - train a smaller model to mimic the large model. (3) Feature pruning - remove features with low importance. (4) Quantization - INT8 inference. (5) Caching - cache scores for popular query-item pairs. The best answer combines multiple approaches and quantifies the quality-latency trade-off.

Interview Cheat Sheet

Question PatternFrameworkKey Phrases
"Design search for X"Multi-stage retrieval + LTR"L0 boolean → L1 BM25 → L2 lightweight → L3 full model"
"How do you handle typos?"Query understanding pipeline"Spell correction, query expansion, intent classification"
"How do you train the ranking model?"Training data + debiasing"Click data with inverse propensity weighting to correct for position bias"
"Relevance vs. revenue?"Multi-objective ranking"Relevance as primary, revenue as secondary signal with guardrails on abandonment"

Spaced Repetition Checkpoints

  • Day 0: Draw the multi-stage search architecture from memory (Query Understanding → L0 → L1 → L2 → L3 → Re-rank).
  • Day 3: Explain pointwise vs. pairwise vs. listwise LTR. When would you use each?
  • Day 7: Design search ranking for a job board in 45 minutes. Focus on query understanding and personalization.
  • Day 14: Explain position bias and how to correct for it in training data.
  • Day 21: Mock interview on search ranking with follow-ups on latency optimization and relevance-revenue trade-offs.

What's Next

© 2026 EngineersOfAI. All rights reserved.