:::tip 🎮 Interactive Playground Visualize this concept: Try the Recommendation System Design demo on the EngineersOfAI Playground - no code required. :::
Recommendation Systems at Scale
Design YouTube Recommendations From Scratch
The interview question is direct: "Design the YouTube recommendation system. You have 2.7 billion logged-in users, 800 million videos, and 500 hours of video uploaded every minute. Users spend an average of 40 minutes per session. The recommendation system must serve personalized recommendations in under 200ms end-to-end. Walk me through your design."
This is not a hypothetical. This is the exact problem the YouTube engineering team solves every day, and it is the most common system design question in ML engineering interviews at Google, Meta, Netflix, and TikTok.
The stakes in this system are enormous: a 1% improvement in watch time represents roughly $1.5 billion in annual ad revenue for YouTube. A failure in the recommendation system - serving irrelevant content or collapsing into a feedback loop of viral content - risks user churn and regulatory scrutiny. Every architectural decision has a quantifiable business consequence.
Where most engineers start: "I would train a neural network to predict what video the user will watch next." Where experienced ML engineers start: "The problem has three distinct subproblems - what videos are candidates, how to rank them, and how to ensure the output serves long-term user value rather than short-term engagement." The difference in starting point determines the quality of the entire subsequent design.
Let's build this from scratch, the way an experienced ML systems engineer would design it.
Requirements Analysis
Functional requirements:
- Serve personalized home page recommendations (12-24 videos)
- Serve "Up Next" recommendations for currently-watching videos
- Surface new content from subscribed creators within 24 hours of upload
- Handle users with no history (new user cold start)
- Handle new videos with no interactions (item cold start)
Non-functional requirements:
- Latency: 200ms p99 end-to-end at the recommendation API level
- Scale: 2.7 billion users, 800 million videos
- Freshness: new video embeddings available within 30 minutes of upload
- Availability: 99.99% uptime (4.4 minutes downtime per month)
- Throughput: ~50,000 recommendation requests per second (peak)
Constraints:
- Cannot run any model on all 800M videos at query time - physics make it impossible
- Must serve minority-language content creators with few views alongside massive channels
- Must not amplify harmful content even if it would maximize engagement
System Architecture Overview
Stage 1: Candidate Generation
The first stage reduces 800M videos to a manageable candidate set. Multiple retrieval signals run in parallel:
Two-Tower Neural Retrieval (primary signal): The user tower encodes the current user context - user ID, recent watch history (last 100 videos), search queries from the past week, time of day, device type - into a 256-dimensional vector. All 800M video embeddings are pre-computed and indexed in FAISS IVF-PQ. ANN search retrieves the top-3,000 most similar videos.
Subscription signal: Videos from the user's subscribed creators, uploaded in the last 48 hours. This is a direct retrieval with no ML component - users expect to see content from creators they explicitly followed.
Trending and exploration pool: The top-1,000 trending videos globally, and 200 randomly sampled videos for exploration. The trending pool ensures users see culturally relevant content. The exploration pool combats the feedback loop.
Collaborative filtering recall: Videos watched by users with similar watch histories (computed by a neighborhood model offline). This captures serendipitous content the neural retrieval might miss.
All retrieval streams are merged with deduplication. The total candidate set is approximately 5,000-7,000 videos.
import asyncio
from typing import List, Dict
import numpy as np
class CandidateGenerator:
def __init__(self, video_index, subscription_db, trending_pool, cf_model):
self.video_index = video_index
self.subscription_db = subscription_db
self.trending_pool = trending_pool
self.cf_model = cf_model
async def generate_candidates(
self,
user_vector: np.ndarray,
user_id: str,
max_candidates: int = 7000,
) -> List[Dict]:
# All retrievals run in parallel
tasks = [
self._neural_retrieval(user_vector, k=3000),
self._subscription_retrieval(user_id, k=500),
self._trending_retrieval(k=1000),
self._cf_retrieval(user_id, k=500),
self._exploration_retrieval(k=200),
]
results = await asyncio.gather(*tasks)
# Merge and deduplicate
seen = set()
merged = []
for result_list in results:
for video in result_list:
if video["video_id"] not in seen:
seen.add(video["video_id"])
merged.append(video)
return merged[:max_candidates]
async def _neural_retrieval(self, user_vector, k):
distances, indices = self.video_index.search(user_vector[np.newaxis, :], k)
return [{"video_id": idx, "retrieval_score": float(dist), "source": "neural"}
for idx, dist in zip(indices[0], distances[0])]
async def _subscription_retrieval(self, user_id, k):
recent_uploads = self.subscription_db.get_recent_uploads(user_id, hours=48, limit=k)
return [{"video_id": v, "retrieval_score": 1.0, "source": "subscription"}
for v in recent_uploads]
async def _trending_retrieval(self, k):
return [{"video_id": v, "retrieval_score": score, "source": "trending"}
for v, score in self.trending_pool.get_top_k(k)]
async def _cf_retrieval(self, user_id, k):
similar_videos = self.cf_model.get_similar_user_videos(user_id, k=k)
return [{"video_id": v, "retrieval_score": score, "source": "cf"}
for v, score in similar_videos]
async def _exploration_retrieval(self, k):
return [{"video_id": v, "retrieval_score": 0.5, "source": "exploration"}
for v in np.random.choice(self.video_index.total_size, size=k, replace=False)]
Stage 2: Pre-Ranking
Pre-ranking filters 5,000-7,000 candidates to 200-500 using a lightweight model. The pre-ranker uses only precomputed features:
- Video category embedding similarity to user's top categories
- Historical CTR of the video in the past 7 days
- Video age and upload recency
- Creator subscriber count
- User's historical preference for this video's duration bucket
The model is a 3-layer MLP with 64-128 hidden units, chosen for its sub-millisecond per-candidate inference speed. It scores all candidates in a single GPU batch in under 20ms.
Stage 3: Full Ranking
The full ranker is the heart of the system. It operates on 200-500 candidates and uses the richest feature set. This is implemented as an MTL model with multiple output heads:
- Watch time prediction: how many minutes will the user watch?
- CTR prediction: will the user click?
- Like prediction: will the user like the video?
- Completion rate: will the user watch to completion?
- Satisfaction score: would the user rate this positively?
The architecture is DLRM (Deep Learning Recommendation Model) with product-of-sum interaction layers for cross-feature interactions between user and video embeddings.
The final ranking score is a weighted combination:
The weights are tuned via offline experiments and online A/B testing.
import torch
import torch.nn as nn
from typing import Dict
class DLRMRankingModel(nn.Module):
"""
YouTube-style DLRM ranking model with MTL output heads.
"""
def __init__(
self,
user_embed_dim: int = 256,
video_embed_dim: int = 256,
context_dim: int = 64,
mlp_dims: list = [512, 256, 128],
task_names: list = ["watch_time", "ctr", "like", "completion", "satisfaction"],
):
super().__init__()
self.task_names = task_names
# Interaction layer: product interactions between user and video features
interaction_dim = user_embed_dim + video_embed_dim + context_dim
# Add pairwise interactions
n_interactions = 3 # user×video, user×context, video×context
total_input = interaction_dim + n_interactions * user_embed_dim
# Shared deep layers
layers = []
prev_dim = total_input
for dim in mlp_dims:
layers.extend([nn.Linear(prev_dim, dim), nn.ReLU(), nn.BatchNorm1d(dim)])
prev_dim = dim
self.shared_mlp = nn.Sequential(*layers)
# Task-specific output heads
self.heads = nn.ModuleDict({
task: nn.Sequential(
nn.Linear(prev_dim, 64),
nn.ReLU(),
nn.Linear(64, 1),
)
for task in task_names
})
# Learned task weights for final score
self.task_weights = nn.ParameterDict({
task: nn.Parameter(torch.ones(1)) for task in task_names
})
def forward(
self,
user_embedding: torch.Tensor, # (batch, user_embed_dim)
video_embedding: torch.Tensor, # (batch, video_embed_dim)
context_features: torch.Tensor, # (batch, context_dim)
) -> Dict[str, torch.Tensor]:
# Concatenate base features
base = torch.cat([user_embedding, video_embedding, context_features], dim=-1)
# Add product interactions
uv_interact = user_embedding[:, :video_embedding.size(-1)] * video_embedding
uc_interact = user_embedding[:, :context_features.size(-1)] * context_features
vc_interact = video_embedding[:, :context_features.size(-1)] * context_features
combined = torch.cat([base, uv_interact, uc_interact, vc_interact], dim=-1)
shared = self.shared_mlp(combined)
predictions = {}
for task in self.task_names:
predictions[task] = self.heads[task](shared).squeeze(-1)
return predictions
def compute_ranking_score(self, predictions: Dict[str, torch.Tensor]) -> torch.Tensor:
"""Combine task predictions into a single ranking score."""
score = (
torch.sigmoid(self.task_weights["watch_time"]) * predictions["watch_time"]
+ torch.sigmoid(self.task_weights["ctr"]) * torch.sigmoid(predictions["ctr"])
+ torch.sigmoid(self.task_weights["satisfaction"]) * torch.sigmoid(predictions["satisfaction"])
- torch.sigmoid(self.task_weights["completion"]) * (1 - torch.sigmoid(predictions["completion"]))
)
return score
Stage 4: Post-Processing and Re-Ranking
The post-processing stage applies constraints the ranking model cannot encode:
Diversity enforcement: No more than 3 videos from the same creator on the home page. No more than 2 videos on the same topic in consecutive positions.
Content policy enforcement: Any video flagged by the trust and safety system (still under review, age-restricted for this user, region-locked) is removed.
Freshness boost: Subscription videos uploaded in the last 6 hours receive a position boost to ensure creators' audiences see new content.
Session deduplication: Remove videos the user has watched more than 80% of in the current session.
Cold Start Solutions
New video cold start: The item tower takes video metadata - title token embeddings, category, channel statistics, thumbnail visual embedding from a pre-trained vision model, upload time - and produces a meaningful embedding within 5 minutes of upload. The embedding improves as early interaction data arrives and triggers re-embedding. New videos also enter the trending pool if their early engagement velocity is high.
New user cold start: Users with fewer than 5 watch events are served from the "best content by demographic" pool - curated lists of high-quality content segmented by rough user attributes (geography, device, language). As users interact, they exit the cold-start pool and transition to personalized recommendations.
Session-Based vs Long-Term Modeling
YouTube solves a subtle tension: the user's current session context (they are binge-watching cooking tutorials) may differ from their long-term taste profile (they also love documentary films). The system maintains two user representations:
- Long-term taste vector: Average of the user's last 10,000 watch events, updated daily.
- Session vector: A Transformer model over the last 20 videos watched in the current session, updated at each recommendation request.
The final user vector is a learned weighted combination of both, where the session weight increases as the session lengthens.
Production Engineering Notes
Training Data Pipeline
Training data is collected from the recommendation system's own exposure decisions - a feedback loop by design. Mitigation strategies:
- Propensity correction: weight training examples by inverse exposure probability
- Exploration bucket: 5% of traffic receives random recommendations, generating unbiased labels
- Position bias correction: train a position bias model and remove position effects from labels
- Watchtime correction: only count watch time where the user actively watched (remove background tab plays)
Model Freshness
Training frequency:
- Two-tower retrieval model: weekly retraining, embeddings refreshed daily
- Pre-ranker: daily retraining
- Full ranker: continuous online learning with mini-batch updates every hour, full retrain weekly
Common Mistakes
Mistake: Optimizing only for watch time as the primary objective.
Pure watch time optimization produces recommendation systems that recommend long videos regardless of quality, push users toward increasingly extreme content (longer = more watch time), and exploit addictive content patterns. YouTube learned this lesson publicly. The ranking objective must include satisfaction signals, completion rates, and explicit "not interested" signals alongside watch time.
Mistake: Serving home page and "Up Next" recommendations with the same model.
Home page recommendations optimize for cold session discovery - you don't know what the user wants right now. "Up Next" recommendations must be highly contextually relevant - you know exactly what they are watching. The feature set, training objective, and acceptable diversity levels differ significantly. Build separate models or at least separate feature pipelines for each surface.
Tip: Implement the "interest evolution" model early.
User interests change over time. A model trained on the user's full watch history will recommend content relevant to their 2019 interests, which may be irrelevant in 2024. Time-decay the watch history (recent watches count more) and track interest velocity (how fast their topic interests are shifting). This dramatically improves recommendations for users with evolving tastes.
Interview Q&A
Q: Walk me through your design for YouTube recommendations. Start with requirements.
A: I'd start with constraints: 800M videos, 2.7B users, sub-200ms latency. This immediately tells me no single model can score all videos per request - I need a cascade architecture. The functional requirements are home page recommendations, "Up Next", subscription freshness, and cold start handling. My design is a four-stage cascade: (1) Multi-signal candidate generation using two-tower neural retrieval, subscription signals, trending pool, and exploration - produces 5-7K candidates in 25ms. (2) Lightweight pre-ranking MLP that reduces to 500 candidates in 20ms using only precomputed features. (3) Full MTL neural ranker (DLRM architecture) optimizing simultaneously for watch time, CTR, satisfaction, and completion rate - reduces to 100 candidates in 80ms. (4) Deterministic post-processing for diversity, content policy, and freshness - produces final 24 results in 15ms. Total: about 140ms, leaving margin for network and serving overhead.
Q: How do you handle the cold start problem for new videos uploaded by small creators?
A: Cold start for items is actually well-handled by the two-tower architecture - the item tower takes content features (title embeddings, category, thumbnail visual features, channel quality signals) and produces an embedding even with zero interaction history. A new video gets embedded within 5 minutes of upload. The embedding will be noisy initially but it is not zero. Additionally: new videos from subscribed channels enter the subscription retrieval pool directly, bypassing the neural retrieval stage. New videos that show high early engagement velocity (high CTR in first 100 impressions) are fast-promoted to the trending pool. New videos from new creators with no channel history are represented using channel-category priors - a cooking video from a new creator gets a content embedding similar to established cooking content, which is a reasonable starting point.
Q: How do you prevent the recommendation system from creating a feedback loop that amplifies popular content?
A: Three mechanisms. First, maintain a permanent 5% exploration bucket where random videos are shown, generating unbiased training signal about the full catalog. Second, apply inverse propensity weighting to all training data - the training loss for each example is weighted by the inverse probability that the system would have shown it. Items that are rarely shown (low propensity) receive higher training weights, preventing the model from ignoring unpopular content. Third, monitor catalog coverage metrics continuously - what fraction of the catalog receives at least one recommendation per week? If coverage declines below a threshold, increase the exploration rate automatically. The diversity enforcement in post-processing also helps by hard-capping the number of videos from any single creator on the home page.
Q: How would you evaluate the recommendation system beyond CTR and watch time?
A: The north star metric is long-term user retention - do users return to the platform tomorrow, next week, next month? This is hard to optimize for directly because it requires long experiments, so we use a hierarchy of proxy metrics. For the short-term experiment, we measure: watch time per session (primary), satisfaction survey response rate, completion rate, "not interested" rate (a negative signal), and diversity index (are we broadening or narrowing user content exposure?). For longer-term monitoring, we run a permanent 1% holdout that never receives new model updates - comparing this holdout to production tells us the compounding long-term value of all model improvements. We also track ecosystem health metrics: creator content production rate (are creators growing because recommendations reward quality?) and user diversity index (are users' tastes expanding over time?).
