Design: Recommendation System - The Most Common Design Question

Reading time: ~25 min | Interview relevance: Critical | Roles: MLE, AI Eng

The Real Interview Moment

The interviewer says: "Design a recommendation system for a video streaming platform like Netflix." You immediately start talking about matrix factorization and collaborative filtering. Five minutes in, the interviewer asks: "How many users? What's your latency budget? Are you optimizing for engagement or retention?" You haven't asked. You've been designing in a vacuum.

The strongest candidates start by asking questions that change the design. "Are we recommending for the homepage or for a specific context like 'watch next'? Do we need real-time personalization or can we pre-compute? How do we handle new users with no history?" These questions shape every subsequent decision.

What You Will Master

How to formulate recommendation as an ML problem with clear metrics
Multi-stage retrieval + ranking architecture used by Netflix, YouTube, Spotify
Collaborative filtering vs. content-based vs. hybrid approaches with trade-offs
Cold start strategies for new users and new items
Feature engineering for user-item interactions
Real-time vs. batch serving trade-offs
Online evaluation with A/B testing and interleaving
Diversity, freshness, and business rule re-ranking

The Complete Design

Step 1: Requirements (5 min)

Functional requirements:

Personalized recommendations for 100M+ users across 500K items
Homepage recommendations (top 50 candidates across genres)
"Because you watched X" contextual rows
New user onboarding experience

Non-functional requirements:

Latency: <200ms for homepage load
Freshness: New content surfaced within hours
Coverage: 80%+ of catalog recommended to at least some users
Availability: 99.9% - fallback to popularity-based if model fails

Interviewer's Perspective

When a candidate asks "Are we optimizing for watch time or number of sessions?", I know they've built real systems. This single question changes whether you use a regression model (predict watch time) or a classification model (predict will-click). It's the kind of question that separates Senior from Mid-level.

Step 2: Problem Formulation (5 min)

Business Goal	ML Objective	Primary Metric	Guardrail Metrics
Increase engagement	Predict P(watch > 2 min \| user, item)	Watch-through rate	Content diversity, catalog coverage, session length

ML problem type: Two-stage system

Retrieval: Given a user, retrieve 1000 candidate items from 500K (recall-focused)
Ranking: Score and rank 1000 → top 50 (precision-focused)

Why two stages? Scoring all 500K items per user per request is computationally infeasible at <200ms. The retrieval stage uses cheap models to narrow the candidate set; the ranking stage uses an expensive model on a small set.

60-Second Answer

"I'd design this as a two-stage system. First, a retrieval stage that uses multiple candidate generators - collaborative filtering, content-based, and trending - to pull ~1000 candidates from the full catalog. Second, a ranking stage that scores each candidate with a deep model using user features, item features, and context features, predicting the probability the user watches more than 2 minutes. Finally, a re-ranking layer applies business rules for diversity, freshness, and content policies. This architecture lets us balance quality with latency - the retrieval stage is fast and recall-oriented, the ranking stage is precise but only scores 1000 items."

Step 3: Features & Data (8 min)

Feature Categories

Category	Features	Freshness
User	Watch history, genre preferences, avg session length, time-of-day patterns, account age, subscription tier	Updated hourly
Item	Genre, director, cast, release year, avg rating, popularity score, duration, content embeddings	Updated on change
Context	Time of day, day of week, device type, recent watches (last 3), current session length	Real-time
User-Item Cross	User-genre affinity scores, user-actor preference, collaborative filtering scores	Batch (daily)

Training Data

Positive labels: User watched > 2 minutes (implicit feedback)
Negative sampling: Items shown but not clicked + random negatives (ratio 1:4)
Label delay: Immediate - we know within minutes if someone watched
Data volume: ~1B interactions/day for a Netflix-scale platform

Common Trap

Don't use "user rated the item" as your primary signal - only 5-10% of users rate content. Implicit feedback (watch time, clicks, completions) is far more abundant and often more predictive. But be careful: implicit feedback is noisy. A user who fell asleep during a movie looks like a highly engaged viewer.

Data Challenges

Selection bias: Users only interact with items that were recommended - you never observe what they'd do with items they weren't shown
Popularity bias: Popular items get more interactions, creating a feedback loop
Position bias: Items shown at position 1 get more clicks regardless of quality

Step 4: Model (8 min)

The Progression

Recommendation Model Progression - Popularity + CF → Two-Tower → Deep Ranking → Multi-Task

Retrieval Models

Method	How It Works	Pros	Cons
Collaborative Filtering (ALS)	Matrix factorization on user-item interaction matrix	Captures taste patterns, no content features needed	Cold start, can't explain recommendations
Content-Based	TF-IDF or embeddings on item metadata, match to user profile	Handles new items, explainable	Filter bubble, limited discovery
Two-Tower Model	Separate user and item encoder networks, dot product similarity	Best retrieval quality, handles cold start via features	Requires training infrastructure
Trending/Popular	Recently popular items by category	Simple, no cold start	Not personalized

Recommendation: Use multiple retrieval sources in parallel, merge candidates, then rank.

Ranking Model

Architecture: Deep neural network with feature crosses
Input: User features + Item features + Context features + Retrieval source scores
Output: P(watch > 2 min), predicted watch time (multi-task)
Loss: Binary cross-entropy for click, MSE for watch time, combined with task weights

Step 5: Serving (8 min)

Recommendation Serving Pipeline - Candidate Generation → Scoring → Re-Ranking → Response

Key Architecture Decisions

Component	Decision	Rationale
Candidate generation	Pre-computed ANN index (FAISS/ScaNN)	Sub-10ms retrieval from millions
Feature store	Redis for real-time features, Hive for batch	Fast lookups + large-scale aggregation
Model serving	TensorFlow Serving with batching	GPU utilization, latency optimization
Caching	Cache recommendations for 1 hour per user	Reduces compute, acceptable staleness
Fallback	Popularity-based recommendations by genre	When model is down or new user has no features

Cold Start Strategy

Scenario	Strategy
New user, no history	Popularity-based + onboarding quiz (select 3 genres)
New user, 1-5 interactions	Content-based using metadata of watched items
New user, 5+ interactions	Full personalized pipeline kicks in
New item, no interactions	Content-based retrieval using metadata embeddings
New item, some interactions	Exploration boost - show to diverse user segments

Step 6: Evaluation & Iteration (8 min)

Offline Metrics

Metric	What It Measures	Target
Recall@K (retrieval)	% of items user would watch in top K candidates	Recall@1000 > 0.8
NDCG@K (ranking)	Quality of ranking order	NDCG@50 > 0.4
Hit Rate@K	% of users with at least one relevant item in top K	HR@20 > 0.9
Coverage	% of catalog recommended to any user	> 80%

Online Metrics

Primary: Watch-through rate (% of recommendations watched > 2 min)
Secondary: Session length, number of sessions per week
Guardrails: Catalog coverage, genre diversity per user, content policy violations

A/B Testing

Unit: User-level randomization (not session-level - users need consistent experience)
Duration: 2 weeks minimum for engagement metrics, 4 weeks for retention
Power analysis: Need ~500K users per arm for 1% relative lift detection

Monitoring

Feature drift: Monitor embedding distributions and feature statistics daily
Popularity bias: Track Gini coefficient of recommendation distribution
Feedback loops: Monitor if recommendations become less diverse over time

Company Variations

Company Variation

Company	Key Difference	What They Test
Netflix	Emphasizes artwork personalization + row generation	Multi-armed bandits, contextual bandits
YouTube	Watch time optimization, two-tower at massive scale	Candidate generation at 1B+ items
Spotify	Sequential recommendation (playlists), audio embeddings	Sequence models, content understanding
Amazon	Purchase prediction, "frequently bought together"	Session-based recommendation, cross-sell
TikTok	Real-time adaptation, short content, explore/exploit	Online learning, cold start for items

Practice Problems

Problem 1: Design "People You May Know" for LinkedIn

Direction

Think about what signals indicate two people should connect. Consider graph-based features, mutual connections, and professional similarity.

Key Insight

This is a link prediction problem on a social graph. The strongest signal is mutual connections (triadic closure), but you also need professional features (same company, same school, same industry). The challenge is scale - LinkedIn has 1B+ users, so candidate generation must be extremely efficient. Graph-based retrieval (friends-of-friends) is the primary candidate source.

Full Solution & Scoring

Strong Hire: Frames as link prediction. Uses graph features (mutual connections, Jaccard similarity) + profile features (industry, company, location). Two-stage: graph-based retrieval → ML ranking. Discusses position bias in PYMK lists. Mentions that showing too many suggestions from one cluster reduces diversity.

Lean Hire: Reasonable approach but misses graph-based retrieval or only uses content features.

No Hire: Treats it as a generic recommendation problem without leveraging the graph structure.

Problem 2: Cold Start for a New Podcast App

Direction

You launch a podcast recommendation app with zero user data. How do you bootstrap recommendations? Consider what signals are available from day one.

Key Insight

With zero interaction data, you rely on: (1) Content features - podcast transcripts, categories, episode descriptions. (2) Onboarding - ask users to select topics/shows they like. (3) Transfer learning - use pre-trained audio/text embeddings. (4) Popularity - chart rankings, social media mentions. As data accumulates, gradually blend collaborative signals. The key insight is having a plan for transitioning from content-based to hybrid recommendations.

Problem 3: Diversity in Recommendations

Direction

Your recommendation system has high accuracy but users complain that recommendations are "all the same." How do you add diversity without sacrificing relevance?

Key Insight

Use Maximal Marginal Relevance (MMR) or a determinantal point process (DPP) in the re-ranking stage. Score = λ * relevance + (1-λ) * diversity. Diversity can be measured as intra-list distance using category or embedding-based similarity. The trade-off: diversity reduces immediate click rates but improves long-term engagement and catalog coverage. Run A/B tests measuring both short-term (CTR) and long-term (retention) metrics.

Interview Cheat Sheet

Question Pattern	Framework	Key Phrases
"Design recommendations for X"	Two-stage retrieval + ranking	"Candidate generation narrows from millions to thousands, then a ranking model scores precisely"
"How do you handle new users?"	Cold start ladder	"Onboarding → content-based → collaborative as data accumulates"
"How do you measure success?"	Offline + online + guardrails	"NDCG offline, A/B test watch-through rate online, monitor diversity as a guardrail"
"How do you handle scale?"	ANN + caching + pre-computation	"Pre-computed embeddings with FAISS/ScaNN for sub-10ms retrieval"
"How do you avoid filter bubbles?"	Re-ranking + exploration	"MMR for diversity, epsilon-greedy exploration, coverage monitoring"

Spaced Repetition Checkpoints

Day 0: Memorize the two-stage architecture (retrieval → ranking → re-ranking). Draw it from memory.
Day 3: Explain collaborative filtering vs. content-based vs. two-tower models. When would you use each?
Day 7: Design a complete recommendation system for a music app in 45 minutes. Time yourself.
Day 14: Practice the cold start question - explain your strategy for new users AND new items.
Day 21: Do a mock interview. Have your partner ask follow-up questions about diversity, bias, and online evaluation.

What's Next

Search Ranking - Multi-stage ranking with query understanding
News Feed Ranking - Multi-objective optimization with real-time features

The Real Interview Moment​

What You Will Master​

The Complete Design​

Step 1: Requirements (5 min)​

Step 2: Problem Formulation (5 min)​

Step 3: Features & Data (8 min)​

Feature Categories​

Training Data​

Data Challenges​

Step 4: Model (8 min)​

The Progression​

Retrieval Models​

Ranking Model​

Step 5: Serving (8 min)​

Key Architecture Decisions​

Cold Start Strategy​

Step 6: Evaluation & Iteration (8 min)​

Offline Metrics​

Online Metrics​

A/B Testing​

Monitoring​

Company Variations​

Practice Problems​

Problem 1: Design "People You May Know" for LinkedIn​

Problem 2: Cold Start for a New Podcast App​

Problem 3: Diversity in Recommendations​

Interview Cheat Sheet​

Spaced Repetition Checkpoints​

What's Next​

The Real Interview Moment

What You Will Master

The Complete Design

Step 1: Requirements (5 min)

Step 2: Problem Formulation (5 min)

Step 3: Features & Data (8 min)

Feature Categories

Training Data

Data Challenges

Step 4: Model (8 min)

The Progression

Retrieval Models

Ranking Model

Step 5: Serving (8 min)

Key Architecture Decisions

Cold Start Strategy

Step 6: Evaluation & Iteration (8 min)

Offline Metrics

Online Metrics

A/B Testing

Monitoring

Company Variations

Practice Problems

Problem 1: Design "People You May Know" for LinkedIn

Problem 2: Cold Start for a New Podcast App

Problem 3: Diversity in Recommendations

Interview Cheat Sheet

Spaced Repetition Checkpoints

What's Next