Skip to main content

Design: Recommendation System - The Most Common Design Question

Reading time: ~25 min | Interview relevance: Critical | Roles: MLE, AI Eng

The Real Interview Moment

The interviewer says: "Design a recommendation system for a video streaming platform like Netflix." You immediately start talking about matrix factorization and collaborative filtering. Five minutes in, the interviewer asks: "How many users? What's your latency budget? Are you optimizing for engagement or retention?" You haven't asked. You've been designing in a vacuum.

The strongest candidates start by asking questions that change the design. "Are we recommending for the homepage or for a specific context like 'watch next'? Do we need real-time personalization or can we pre-compute? How do we handle new users with no history?" These questions shape every subsequent decision.

What You Will Master

  • How to formulate recommendation as an ML problem with clear metrics
  • Multi-stage retrieval + ranking architecture used by Netflix, YouTube, Spotify
  • Collaborative filtering vs. content-based vs. hybrid approaches with trade-offs
  • Cold start strategies for new users and new items
  • Feature engineering for user-item interactions
  • Real-time vs. batch serving trade-offs
  • Online evaluation with A/B testing and interleaving
  • Diversity, freshness, and business rule re-ranking

The Complete Design

Step 1: Requirements (5 min)

Functional requirements:

  • Personalized recommendations for 100M+ users across 500K items
  • Homepage recommendations (top 50 candidates across genres)
  • "Because you watched X" contextual rows
  • New user onboarding experience

Non-functional requirements:

  • Latency: <200ms for homepage load
  • Freshness: New content surfaced within hours
  • Coverage: 80%+ of catalog recommended to at least some users
  • Availability: 99.9% - fallback to popularity-based if model fails
Interviewer's Perspective

When a candidate asks "Are we optimizing for watch time or number of sessions?", I know they've built real systems. This single question changes whether you use a regression model (predict watch time) or a classification model (predict will-click). It's the kind of question that separates Senior from Mid-level.

Step 2: Problem Formulation (5 min)

Business GoalML ObjectivePrimary MetricGuardrail Metrics
Increase engagementPredict P(watch > 2 min | user, item)Watch-through rateContent diversity, catalog coverage, session length

ML problem type: Two-stage system

  1. Retrieval: Given a user, retrieve 1000 candidate items from 500K (recall-focused)
  2. Ranking: Score and rank 1000 → top 50 (precision-focused)

Why two stages? Scoring all 500K items per user per request is computationally infeasible at <200ms. The retrieval stage uses cheap models to narrow the candidate set; the ranking stage uses an expensive model on a small set.

60-Second Answer

"I'd design this as a two-stage system. First, a retrieval stage that uses multiple candidate generators - collaborative filtering, content-based, and trending - to pull ~1000 candidates from the full catalog. Second, a ranking stage that scores each candidate with a deep model using user features, item features, and context features, predicting the probability the user watches more than 2 minutes. Finally, a re-ranking layer applies business rules for diversity, freshness, and content policies. This architecture lets us balance quality with latency - the retrieval stage is fast and recall-oriented, the ranking stage is precise but only scores 1000 items."

Step 3: Features & Data (8 min)

Feature Categories

CategoryFeaturesFreshness
UserWatch history, genre preferences, avg session length, time-of-day patterns, account age, subscription tierUpdated hourly
ItemGenre, director, cast, release year, avg rating, popularity score, duration, content embeddingsUpdated on change
ContextTime of day, day of week, device type, recent watches (last 3), current session lengthReal-time
User-Item CrossUser-genre affinity scores, user-actor preference, collaborative filtering scoresBatch (daily)

Training Data

  • Positive labels: User watched > 2 minutes (implicit feedback)
  • Negative sampling: Items shown but not clicked + random negatives (ratio 1:4)
  • Label delay: Immediate - we know within minutes if someone watched
  • Data volume: ~1B interactions/day for a Netflix-scale platform
Common Trap

Don't use "user rated the item" as your primary signal - only 5-10% of users rate content. Implicit feedback (watch time, clicks, completions) is far more abundant and often more predictive. But be careful: implicit feedback is noisy. A user who fell asleep during a movie looks like a highly engaged viewer.

Data Challenges

  • Selection bias: Users only interact with items that were recommended - you never observe what they'd do with items they weren't shown
  • Popularity bias: Popular items get more interactions, creating a feedback loop
  • Position bias: Items shown at position 1 get more clicks regardless of quality

Step 4: Model (8 min)

The Progression

Recommendation Model Progression - Popularity + CF → Two-Tower → Deep Ranking → Multi-Task

Retrieval Models

MethodHow It WorksProsCons
Collaborative Filtering (ALS)Matrix factorization on user-item interaction matrixCaptures taste patterns, no content features neededCold start, can't explain recommendations
Content-BasedTF-IDF or embeddings on item metadata, match to user profileHandles new items, explainableFilter bubble, limited discovery
Two-Tower ModelSeparate user and item encoder networks, dot product similarityBest retrieval quality, handles cold start via featuresRequires training infrastructure
Trending/PopularRecently popular items by categorySimple, no cold startNot personalized

Recommendation: Use multiple retrieval sources in parallel, merge candidates, then rank.

Ranking Model

  • Architecture: Deep neural network with feature crosses
  • Input: User features + Item features + Context features + Retrieval source scores
  • Output: P(watch > 2 min), predicted watch time (multi-task)
  • Loss: Binary cross-entropy for click, MSE for watch time, combined with task weights

Step 5: Serving (8 min)

Recommendation Serving Pipeline - Candidate Generation → Scoring → Re-Ranking → Response

Key Architecture Decisions

ComponentDecisionRationale
Candidate generationPre-computed ANN index (FAISS/ScaNN)Sub-10ms retrieval from millions
Feature storeRedis for real-time features, Hive for batchFast lookups + large-scale aggregation
Model servingTensorFlow Serving with batchingGPU utilization, latency optimization
CachingCache recommendations for 1 hour per userReduces compute, acceptable staleness
FallbackPopularity-based recommendations by genreWhen model is down or new user has no features

Cold Start Strategy

ScenarioStrategy
New user, no historyPopularity-based + onboarding quiz (select 3 genres)
New user, 1-5 interactionsContent-based using metadata of watched items
New user, 5+ interactionsFull personalized pipeline kicks in
New item, no interactionsContent-based retrieval using metadata embeddings
New item, some interactionsExploration boost - show to diverse user segments

Step 6: Evaluation & Iteration (8 min)

Offline Metrics

MetricWhat It MeasuresTarget
Recall@K (retrieval)% of items user would watch in top K candidatesRecall@1000 > 0.8
NDCG@K (ranking)Quality of ranking orderNDCG@50 > 0.4
Hit Rate@K% of users with at least one relevant item in top KHR@20 > 0.9
Coverage% of catalog recommended to any user> 80%

Online Metrics

  • Primary: Watch-through rate (% of recommendations watched > 2 min)
  • Secondary: Session length, number of sessions per week
  • Guardrails: Catalog coverage, genre diversity per user, content policy violations

A/B Testing

  • Unit: User-level randomization (not session-level - users need consistent experience)
  • Duration: 2 weeks minimum for engagement metrics, 4 weeks for retention
  • Power analysis: Need ~500K users per arm for 1% relative lift detection

Monitoring

  • Feature drift: Monitor embedding distributions and feature statistics daily
  • Popularity bias: Track Gini coefficient of recommendation distribution
  • Feedback loops: Monitor if recommendations become less diverse over time

Company Variations

Company Variation
CompanyKey DifferenceWhat They Test
NetflixEmphasizes artwork personalization + row generationMulti-armed bandits, contextual bandits
YouTubeWatch time optimization, two-tower at massive scaleCandidate generation at 1B+ items
SpotifySequential recommendation (playlists), audio embeddingsSequence models, content understanding
AmazonPurchase prediction, "frequently bought together"Session-based recommendation, cross-sell
TikTokReal-time adaptation, short content, explore/exploitOnline learning, cold start for items

Practice Problems

Problem 1: Design "People You May Know" for LinkedIn

Direction

Think about what signals indicate two people should connect. Consider graph-based features, mutual connections, and professional similarity.

Key Insight

This is a link prediction problem on a social graph. The strongest signal is mutual connections (triadic closure), but you also need professional features (same company, same school, same industry). The challenge is scale - LinkedIn has 1B+ users, so candidate generation must be extremely efficient. Graph-based retrieval (friends-of-friends) is the primary candidate source.

Full Solution & Scoring

Strong Hire: Frames as link prediction. Uses graph features (mutual connections, Jaccard similarity) + profile features (industry, company, location). Two-stage: graph-based retrieval → ML ranking. Discusses position bias in PYMK lists. Mentions that showing too many suggestions from one cluster reduces diversity.

Lean Hire: Reasonable approach but misses graph-based retrieval or only uses content features.

No Hire: Treats it as a generic recommendation problem without leveraging the graph structure.

Problem 2: Cold Start for a New Podcast App

Direction

You launch a podcast recommendation app with zero user data. How do you bootstrap recommendations? Consider what signals are available from day one.

Key Insight

With zero interaction data, you rely on: (1) Content features - podcast transcripts, categories, episode descriptions. (2) Onboarding - ask users to select topics/shows they like. (3) Transfer learning - use pre-trained audio/text embeddings. (4) Popularity - chart rankings, social media mentions. As data accumulates, gradually blend collaborative signals. The key insight is having a plan for transitioning from content-based to hybrid recommendations.

Problem 3: Diversity in Recommendations

Direction

Your recommendation system has high accuracy but users complain that recommendations are "all the same." How do you add diversity without sacrificing relevance?

Key Insight

Use Maximal Marginal Relevance (MMR) or a determinantal point process (DPP) in the re-ranking stage. Score = λ * relevance + (1-λ) * diversity. Diversity can be measured as intra-list distance using category or embedding-based similarity. The trade-off: diversity reduces immediate click rates but improves long-term engagement and catalog coverage. Run A/B tests measuring both short-term (CTR) and long-term (retention) metrics.

Interview Cheat Sheet

Question PatternFrameworkKey Phrases
"Design recommendations for X"Two-stage retrieval + ranking"Candidate generation narrows from millions to thousands, then a ranking model scores precisely"
"How do you handle new users?"Cold start ladder"Onboarding → content-based → collaborative as data accumulates"
"How do you measure success?"Offline + online + guardrails"NDCG offline, A/B test watch-through rate online, monitor diversity as a guardrail"
"How do you handle scale?"ANN + caching + pre-computation"Pre-computed embeddings with FAISS/ScaNN for sub-10ms retrieval"
"How do you avoid filter bubbles?"Re-ranking + exploration"MMR for diversity, epsilon-greedy exploration, coverage monitoring"

Spaced Repetition Checkpoints

  • Day 0: Memorize the two-stage architecture (retrieval → ranking → re-ranking). Draw it from memory.
  • Day 3: Explain collaborative filtering vs. content-based vs. two-tower models. When would you use each?
  • Day 7: Design a complete recommendation system for a music app in 45 minutes. Time yourself.
  • Day 14: Practice the cold start question - explain your strategy for new users AND new items.
  • Day 21: Do a mock interview. Have your partner ask follow-up questions about diversity, bias, and online evaluation.

What's Next

© 2026 EngineersOfAI. All rights reserved.