Design: Recommendation System - The Most Common Design Question
Reading time: ~25 min | Interview relevance: Critical | Roles: MLE, AI Eng
The Real Interview Moment
The interviewer says: "Design a recommendation system for a video streaming platform like Netflix." You immediately start talking about matrix factorization and collaborative filtering. Five minutes in, the interviewer asks: "How many users? What's your latency budget? Are you optimizing for engagement or retention?" You haven't asked. You've been designing in a vacuum.
The strongest candidates start by asking questions that change the design. "Are we recommending for the homepage or for a specific context like 'watch next'? Do we need real-time personalization or can we pre-compute? How do we handle new users with no history?" These questions shape every subsequent decision.
What You Will Master
- How to formulate recommendation as an ML problem with clear metrics
- Multi-stage retrieval + ranking architecture used by Netflix, YouTube, Spotify
- Collaborative filtering vs. content-based vs. hybrid approaches with trade-offs
- Cold start strategies for new users and new items
- Feature engineering for user-item interactions
- Real-time vs. batch serving trade-offs
- Online evaluation with A/B testing and interleaving
- Diversity, freshness, and business rule re-ranking
The Complete Design
Step 1: Requirements (5 min)
Functional requirements:
- Personalized recommendations for 100M+ users across 500K items
- Homepage recommendations (top 50 candidates across genres)
- "Because you watched X" contextual rows
- New user onboarding experience
Non-functional requirements:
- Latency: <200ms for homepage load
- Freshness: New content surfaced within hours
- Coverage: 80%+ of catalog recommended to at least some users
- Availability: 99.9% - fallback to popularity-based if model fails
When a candidate asks "Are we optimizing for watch time or number of sessions?", I know they've built real systems. This single question changes whether you use a regression model (predict watch time) or a classification model (predict will-click). It's the kind of question that separates Senior from Mid-level.
Step 2: Problem Formulation (5 min)
| Business Goal | ML Objective | Primary Metric | Guardrail Metrics |
|---|---|---|---|
| Increase engagement | Predict P(watch > 2 min | user, item) | Watch-through rate | Content diversity, catalog coverage, session length |
ML problem type: Two-stage system
- Retrieval: Given a user, retrieve 1000 candidate items from 500K (recall-focused)
- Ranking: Score and rank 1000 → top 50 (precision-focused)
Why two stages? Scoring all 500K items per user per request is computationally infeasible at <200ms. The retrieval stage uses cheap models to narrow the candidate set; the ranking stage uses an expensive model on a small set.
"I'd design this as a two-stage system. First, a retrieval stage that uses multiple candidate generators - collaborative filtering, content-based, and trending - to pull ~1000 candidates from the full catalog. Second, a ranking stage that scores each candidate with a deep model using user features, item features, and context features, predicting the probability the user watches more than 2 minutes. Finally, a re-ranking layer applies business rules for diversity, freshness, and content policies. This architecture lets us balance quality with latency - the retrieval stage is fast and recall-oriented, the ranking stage is precise but only scores 1000 items."
Step 3: Features & Data (8 min)
Feature Categories
| Category | Features | Freshness |
|---|---|---|
| User | Watch history, genre preferences, avg session length, time-of-day patterns, account age, subscription tier | Updated hourly |
| Item | Genre, director, cast, release year, avg rating, popularity score, duration, content embeddings | Updated on change |
| Context | Time of day, day of week, device type, recent watches (last 3), current session length | Real-time |
| User-Item Cross | User-genre affinity scores, user-actor preference, collaborative filtering scores | Batch (daily) |
Training Data
- Positive labels: User watched > 2 minutes (implicit feedback)
- Negative sampling: Items shown but not clicked + random negatives (ratio 1:4)
- Label delay: Immediate - we know within minutes if someone watched
- Data volume: ~1B interactions/day for a Netflix-scale platform
Don't use "user rated the item" as your primary signal - only 5-10% of users rate content. Implicit feedback (watch time, clicks, completions) is far more abundant and often more predictive. But be careful: implicit feedback is noisy. A user who fell asleep during a movie looks like a highly engaged viewer.
Data Challenges
- Selection bias: Users only interact with items that were recommended - you never observe what they'd do with items they weren't shown
- Popularity bias: Popular items get more interactions, creating a feedback loop
- Position bias: Items shown at position 1 get more clicks regardless of quality
Step 4: Model (8 min)
The Progression
Retrieval Models
| Method | How It Works | Pros | Cons |
|---|---|---|---|
| Collaborative Filtering (ALS) | Matrix factorization on user-item interaction matrix | Captures taste patterns, no content features needed | Cold start, can't explain recommendations |
| Content-Based | TF-IDF or embeddings on item metadata, match to user profile | Handles new items, explainable | Filter bubble, limited discovery |
| Two-Tower Model | Separate user and item encoder networks, dot product similarity | Best retrieval quality, handles cold start via features | Requires training infrastructure |
| Trending/Popular | Recently popular items by category | Simple, no cold start | Not personalized |
Recommendation: Use multiple retrieval sources in parallel, merge candidates, then rank.
Ranking Model
- Architecture: Deep neural network with feature crosses
- Input: User features + Item features + Context features + Retrieval source scores
- Output: P(watch > 2 min), predicted watch time (multi-task)
- Loss: Binary cross-entropy for click, MSE for watch time, combined with task weights
Step 5: Serving (8 min)
Key Architecture Decisions
| Component | Decision | Rationale |
|---|---|---|
| Candidate generation | Pre-computed ANN index (FAISS/ScaNN) | Sub-10ms retrieval from millions |
| Feature store | Redis for real-time features, Hive for batch | Fast lookups + large-scale aggregation |
| Model serving | TensorFlow Serving with batching | GPU utilization, latency optimization |
| Caching | Cache recommendations for 1 hour per user | Reduces compute, acceptable staleness |
| Fallback | Popularity-based recommendations by genre | When model is down or new user has no features |
Cold Start Strategy
| Scenario | Strategy |
|---|---|
| New user, no history | Popularity-based + onboarding quiz (select 3 genres) |
| New user, 1-5 interactions | Content-based using metadata of watched items |
| New user, 5+ interactions | Full personalized pipeline kicks in |
| New item, no interactions | Content-based retrieval using metadata embeddings |
| New item, some interactions | Exploration boost - show to diverse user segments |
Step 6: Evaluation & Iteration (8 min)
Offline Metrics
| Metric | What It Measures | Target |
|---|---|---|
| Recall@K (retrieval) | % of items user would watch in top K candidates | Recall@1000 > 0.8 |
| NDCG@K (ranking) | Quality of ranking order | NDCG@50 > 0.4 |
| Hit Rate@K | % of users with at least one relevant item in top K | HR@20 > 0.9 |
| Coverage | % of catalog recommended to any user | > 80% |
Online Metrics
- Primary: Watch-through rate (% of recommendations watched > 2 min)
- Secondary: Session length, number of sessions per week
- Guardrails: Catalog coverage, genre diversity per user, content policy violations
A/B Testing
- Unit: User-level randomization (not session-level - users need consistent experience)
- Duration: 2 weeks minimum for engagement metrics, 4 weeks for retention
- Power analysis: Need ~500K users per arm for 1% relative lift detection
Monitoring
- Feature drift: Monitor embedding distributions and feature statistics daily
- Popularity bias: Track Gini coefficient of recommendation distribution
- Feedback loops: Monitor if recommendations become less diverse over time
Company Variations
| Company | Key Difference | What They Test |
|---|---|---|
| Netflix | Emphasizes artwork personalization + row generation | Multi-armed bandits, contextual bandits |
| YouTube | Watch time optimization, two-tower at massive scale | Candidate generation at 1B+ items |
| Spotify | Sequential recommendation (playlists), audio embeddings | Sequence models, content understanding |
| Amazon | Purchase prediction, "frequently bought together" | Session-based recommendation, cross-sell |
| TikTok | Real-time adaptation, short content, explore/exploit | Online learning, cold start for items |
Practice Problems
Problem 1: Design "People You May Know" for LinkedIn
Direction
Think about what signals indicate two people should connect. Consider graph-based features, mutual connections, and professional similarity.
Key Insight
This is a link prediction problem on a social graph. The strongest signal is mutual connections (triadic closure), but you also need professional features (same company, same school, same industry). The challenge is scale - LinkedIn has 1B+ users, so candidate generation must be extremely efficient. Graph-based retrieval (friends-of-friends) is the primary candidate source.
Full Solution & Scoring
Strong Hire: Frames as link prediction. Uses graph features (mutual connections, Jaccard similarity) + profile features (industry, company, location). Two-stage: graph-based retrieval → ML ranking. Discusses position bias in PYMK lists. Mentions that showing too many suggestions from one cluster reduces diversity.
Lean Hire: Reasonable approach but misses graph-based retrieval or only uses content features.
No Hire: Treats it as a generic recommendation problem without leveraging the graph structure.
Problem 2: Cold Start for a New Podcast App
Direction
You launch a podcast recommendation app with zero user data. How do you bootstrap recommendations? Consider what signals are available from day one.
Key Insight
With zero interaction data, you rely on: (1) Content features - podcast transcripts, categories, episode descriptions. (2) Onboarding - ask users to select topics/shows they like. (3) Transfer learning - use pre-trained audio/text embeddings. (4) Popularity - chart rankings, social media mentions. As data accumulates, gradually blend collaborative signals. The key insight is having a plan for transitioning from content-based to hybrid recommendations.
Problem 3: Diversity in Recommendations
Direction
Your recommendation system has high accuracy but users complain that recommendations are "all the same." How do you add diversity without sacrificing relevance?
Key Insight
Use Maximal Marginal Relevance (MMR) or a determinantal point process (DPP) in the re-ranking stage. Score = λ * relevance + (1-λ) * diversity. Diversity can be measured as intra-list distance using category or embedding-based similarity. The trade-off: diversity reduces immediate click rates but improves long-term engagement and catalog coverage. Run A/B tests measuring both short-term (CTR) and long-term (retention) metrics.
Interview Cheat Sheet
| Question Pattern | Framework | Key Phrases |
|---|---|---|
| "Design recommendations for X" | Two-stage retrieval + ranking | "Candidate generation narrows from millions to thousands, then a ranking model scores precisely" |
| "How do you handle new users?" | Cold start ladder | "Onboarding → content-based → collaborative as data accumulates" |
| "How do you measure success?" | Offline + online + guardrails | "NDCG offline, A/B test watch-through rate online, monitor diversity as a guardrail" |
| "How do you handle scale?" | ANN + caching + pre-computation | "Pre-computed embeddings with FAISS/ScaNN for sub-10ms retrieval" |
| "How do you avoid filter bubbles?" | Re-ranking + exploration | "MMR for diversity, epsilon-greedy exploration, coverage monitoring" |
Spaced Repetition Checkpoints
- Day 0: Memorize the two-stage architecture (retrieval → ranking → re-ranking). Draw it from memory.
- Day 3: Explain collaborative filtering vs. content-based vs. two-tower models. When would you use each?
- Day 7: Design a complete recommendation system for a music app in 45 minutes. Time yourself.
- Day 14: Practice the cold start question - explain your strategy for new users AND new items.
- Day 21: Do a mock interview. Have your partner ask follow-up questions about diversity, bias, and online evaluation.
What's Next
- Search Ranking - Multi-stage ranking with query understanding
- News Feed Ranking - Multi-objective optimization with real-time features
