What is recommendation system design?

End-to-end system design for YouTube-scale video recommendation - candidate generation, multi-stage ranking, post-processing for diversity, cold start, and session modeling.

How does youtube recommendations work in practice?

Recommendation Systems at Scale covers recommendation system design, youtube recommendations, two-tower retrieval from first principles with code examples. Free lesson at https://engineersofai.com/docs/ai-systems/case-studies/recommendation-systems

What is the difference between recommendation system design and two-tower retrieval?

See the full breakdown at https://engineersofai.com/docs/ai-systems/case-studies/recommendation-systems

:::tip 🎮 Interactive Playground Visualize this concept: Try the Recommendation System Design demo on the EngineersOfAI Playground - no code required. :::

Recommendation Systems at Scale

Design YouTube Recommendations From Scratch

The interview question is direct: "Design the YouTube recommendation system. You have 2.7 billion logged-in users, 800 million videos, and 500 hours of video uploaded every minute. Users spend an average of 40 minutes per session. The recommendation system must serve personalized recommendations in under 200ms end-to-end. Walk me through your design."

This is not a hypothetical. This is the exact problem the YouTube engineering team solves every day, and it is the most common system design question in ML engineering interviews at Google, Meta, Netflix, and TikTok.

The stakes in this system are enormous: a 1% improvement in watch time represents roughly $1.5 billion in annual ad revenue for YouTube. A failure in the recommendation system - serving irrelevant content or collapsing into a feedback loop of viral content - risks user churn and regulatory scrutiny. Every architectural decision has a quantifiable business consequence.

Where most engineers start: "I would train a neural network to predict what video the user will watch next." Where experienced ML engineers start: "The problem has three distinct subproblems - what videos are candidates, how to rank them, and how to ensure the output serves long-term user value rather than short-term engagement." The difference in starting point determines the quality of the entire subsequent design.

Let's build this from scratch, the way an experienced ML systems engineer would design it.

Requirements Analysis

Functional requirements:

Serve personalized home page recommendations (12-24 videos)
Serve "Up Next" recommendations for currently-watching videos
Surface new content from subscribed creators within 24 hours of upload
Handle users with no history (new user cold start)
Handle new videos with no interactions (item cold start)

Non-functional requirements:

Latency: 200ms p99 end-to-end at the recommendation API level
Scale: 2.7 billion users, 800 million videos
Freshness: new video embeddings available within 30 minutes of upload
Availability: 99.99% uptime (4.4 minutes downtime per month)
Throughput: ~50,000 recommendation requests per second (peak)

Constraints:

Cannot run any model on all 800M videos at query time - physics make it impossible
Must serve minority-language content creators with few views alongside massive channels
Must not amplify harmful content even if it would maximize engagement

System Architecture Overview

Stage 1: Candidate Generation

The first stage reduces 800M videos to a manageable candidate set. Multiple retrieval signals run in parallel:

Two-Tower Neural Retrieval (primary signal): The user tower encodes the current user context - user ID, recent watch history (last 100 videos), search queries from the past week, time of day, device type - into a 256-dimensional vector. All 800M video embeddings are pre-computed and indexed in FAISS IVF-PQ. ANN search retrieves the top-3,000 most similar videos.

Subscription signal: Videos from the user's subscribed creators, uploaded in the last 48 hours. This is a direct retrieval with no ML component - users expect to see content from creators they explicitly followed.

Trending and exploration pool: The top-1,000 trending videos globally, and 200 randomly sampled videos for exploration. The trending pool ensures users see culturally relevant content. The exploration pool combats the feedback loop.

Collaborative filtering recall: Videos watched by users with similar watch histories (computed by a neighborhood model offline). This captures serendipitous content the neural retrieval might miss.

All retrieval streams are merged with deduplication. The total candidate set is approximately 5,000-7,000 videos.

import asyncio
from typing import List, Dict
import numpy as np


class CandidateGenerator:
    def __init__(self, video_index, subscription_db, trending_pool, cf_model):
        self.video_index = video_index
        self.subscription_db = subscription_db
        self.trending_pool = trending_pool
        self.cf_model = cf_model

    async def generate_candidates(
        self,
        user_vector: np.ndarray,
        user_id: str,
        max_candidates: int = 7000,
    ) -> List[Dict]:
        # All retrievals run in parallel
        tasks = [
            self._neural_retrieval(user_vector, k=3000),
            self._subscription_retrieval(user_id, k=500),
            self._trending_retrieval(k=1000),
            self._cf_retrieval(user_id, k=500),
            self._exploration_retrieval(k=200),
        ]
        results = await asyncio.gather(*tasks)

        # Merge and deduplicate
        seen = set()
        merged = []
        for result_list in results:
            for video in result_list:
                if video["video_id"] not in seen:
                    seen.add(video["video_id"])
                    merged.append(video)

        return merged[:max_candidates]

    async def _neural_retrieval(self, user_vector, k):
        distances, indices = self.video_index.search(user_vector[np.newaxis, :], k)
        return [{"video_id": idx, "retrieval_score": float(dist), "source": "neural"}
                for idx, dist in zip(indices[0], distances[0])]

    async def _subscription_retrieval(self, user_id, k):
        recent_uploads = self.subscription_db.get_recent_uploads(user_id, hours=48, limit=k)
        return [{"video_id": v, "retrieval_score": 1.0, "source": "subscription"}
                for v in recent_uploads]

    async def _trending_retrieval(self, k):
        return [{"video_id": v, "retrieval_score": score, "source": "trending"}
                for v, score in self.trending_pool.get_top_k(k)]

    async def _cf_retrieval(self, user_id, k):
        similar_videos = self.cf_model.get_similar_user_videos(user_id, k=k)
        return [{"video_id": v, "retrieval_score": score, "source": "cf"}
                for v, score in similar_videos]

    async def _exploration_retrieval(self, k):
        return [{"video_id": v, "retrieval_score": 0.5, "source": "exploration"}
                for v in np.random.choice(self.video_index.total_size, size=k, replace=False)]

Stage 2: Pre-Ranking

Pre-ranking filters 5,000-7,000 candidates to 200-500 using a lightweight model. The pre-ranker uses only precomputed features:

Video category embedding similarity to user's top categories
Historical CTR of the video in the past 7 days
Video age and upload recency
Creator subscriber count
User's historical preference for this video's duration bucket

The model is a 3-layer MLP with 64-128 hidden units, chosen for its sub-millisecond per-candidate inference speed. It scores all candidates in a single GPU batch in under 20ms.

Stage 3: Full Ranking

The full ranker is the heart of the system. It operates on 200-500 candidates and uses the richest feature set. This is implemented as an MTL model with multiple output heads:

Watch time prediction: how many minutes will the user watch?
CTR prediction: will the user click?
Like prediction: will the user like the video?
Completion rate: will the user watch to completion?
Satisfaction score: would the user rate this positively?

The architecture is DLRM (Deep Learning Recommendation Model) with product-of-sum interaction layers for cross-feature interactions between user and video embeddings.

The final ranking score is a weighted combination:

$\text{score} = w_{\text{wt}} \cdot \hat{y}_{\text{watch\_time}} + w_{\text{ctr}} \cdot \hat{y}_{\text{ctr}} + w_{\text{sat}} \cdot \hat{y}_{\text{satisfaction}} - w_{\text{not\_int}} \cdot \hat{y}_{\text{not\_interested}}$

The weights are tuned via offline experiments and online A/B testing.

import torch
import torch.nn as nn
from typing import Dict


class DLRMRankingModel(nn.Module):
    """
    YouTube-style DLRM ranking model with MTL output heads.
    """

    def __init__(
        self,
        user_embed_dim: int = 256,
        video_embed_dim: int = 256,
        context_dim: int = 64,
        mlp_dims: list = [512, 256, 128],
        task_names: list = ["watch_time", "ctr", "like", "completion", "satisfaction"],
    ):
        super().__init__()
        self.task_names = task_names

        # Interaction layer: product interactions between user and video features
        interaction_dim = user_embed_dim + video_embed_dim + context_dim
        # Add pairwise interactions
        n_interactions = 3  # user×video, user×context, video×context
        total_input = interaction_dim + n_interactions * user_embed_dim

        # Shared deep layers
        layers = []
        prev_dim = total_input
        for dim in mlp_dims:
            layers.extend([nn.Linear(prev_dim, dim), nn.ReLU(), nn.BatchNorm1d(dim)])
            prev_dim = dim
        self.shared_mlp = nn.Sequential(*layers)

        # Task-specific output heads
        self.heads = nn.ModuleDict({
            task: nn.Sequential(
                nn.Linear(prev_dim, 64),
                nn.ReLU(),
                nn.Linear(64, 1),
            )
            for task in task_names
        })

        # Learned task weights for final score
        self.task_weights = nn.ParameterDict({
            task: nn.Parameter(torch.ones(1)) for task in task_names
        })

    def forward(
        self,
        user_embedding: torch.Tensor,    # (batch, user_embed_dim)
        video_embedding: torch.Tensor,   # (batch, video_embed_dim)
        context_features: torch.Tensor,  # (batch, context_dim)
    ) -> Dict[str, torch.Tensor]:
        # Concatenate base features
        base = torch.cat([user_embedding, video_embedding, context_features], dim=-1)

        # Add product interactions
        uv_interact = user_embedding[:, :video_embedding.size(-1)] * video_embedding
        uc_interact = user_embedding[:, :context_features.size(-1)] * context_features
        vc_interact = video_embedding[:, :context_features.size(-1)] * context_features
        combined = torch.cat([base, uv_interact, uc_interact, vc_interact], dim=-1)

        shared = self.shared_mlp(combined)

        predictions = {}
        for task in self.task_names:
            predictions[task] = self.heads[task](shared).squeeze(-1)

        return predictions

    def compute_ranking_score(self, predictions: Dict[str, torch.Tensor]) -> torch.Tensor:
        """Combine task predictions into a single ranking score."""
        score = (
            torch.sigmoid(self.task_weights["watch_time"]) * predictions["watch_time"]
            + torch.sigmoid(self.task_weights["ctr"]) * torch.sigmoid(predictions["ctr"])
            + torch.sigmoid(self.task_weights["satisfaction"]) * torch.sigmoid(predictions["satisfaction"])
            - torch.sigmoid(self.task_weights["completion"]) * (1 - torch.sigmoid(predictions["completion"]))
        )
        return score

Stage 4: Post-Processing and Re-Ranking

The post-processing stage applies constraints the ranking model cannot encode:

Diversity enforcement: No more than 3 videos from the same creator on the home page. No more than 2 videos on the same topic in consecutive positions.

Content policy enforcement: Any video flagged by the trust and safety system (still under review, age-restricted for this user, region-locked) is removed.

Freshness boost: Subscription videos uploaded in the last 6 hours receive a position boost to ensure creators' audiences see new content.

Session deduplication: Remove videos the user has watched more than 80% of in the current session.

Cold Start Solutions

New video cold start: The item tower takes video metadata - title token embeddings, category, channel statistics, thumbnail visual embedding from a pre-trained vision model, upload time - and produces a meaningful embedding within 5 minutes of upload. The embedding improves as early interaction data arrives and triggers re-embedding. New videos also enter the trending pool if their early engagement velocity is high.

New user cold start: Users with fewer than 5 watch events are served from the "best content by demographic" pool - curated lists of high-quality content segmented by rough user attributes (geography, device, language). As users interact, they exit the cold-start pool and transition to personalized recommendations.

Session-Based vs Long-Term Modeling

YouTube solves a subtle tension: the user's current session context (they are binge-watching cooking tutorials) may differ from their long-term taste profile (they also love documentary films). The system maintains two user representations:

Long-term taste vector: Average of the user's last 10,000 watch events, updated daily.
Session vector: A Transformer model over the last 20 videos watched in the current session, updated at each recommendation request.

The final user vector is a learned weighted combination of both, where the session weight increases as the session lengthens.

Production Engineering Notes

Training Data Pipeline

Training data is collected from the recommendation system's own exposure decisions - a feedback loop by design. Mitigation strategies:

Propensity correction: weight training examples by inverse exposure probability
Exploration bucket: 5% of traffic receives random recommendations, generating unbiased labels
Position bias correction: train a position bias model and remove position effects from labels
Watchtime correction: only count watch time where the user actively watched (remove background tab plays)

Model Freshness

Training frequency:

Two-tower retrieval model: weekly retraining, embeddings refreshed daily
Pre-ranker: daily retraining
Full ranker: continuous online learning with mini-batch updates every hour, full retrain weekly

Common Mistakes

danger

Mistake: Optimizing only for watch time as the primary objective.

Pure watch time optimization produces recommendation systems that recommend long videos regardless of quality, push users toward increasingly extreme content (longer = more watch time), and exploit addictive content patterns. YouTube learned this lesson publicly. The ranking objective must include satisfaction signals, completion rates, and explicit "not interested" signals alongside watch time.

warning

Mistake: Serving home page and "Up Next" recommendations with the same model.

Home page recommendations optimize for cold session discovery - you don't know what the user wants right now. "Up Next" recommendations must be highly contextually relevant - you know exactly what they are watching. The feature set, training objective, and acceptable diversity levels differ significantly. Build separate models or at least separate feature pipelines for each surface.

tip

Tip: Implement the "interest evolution" model early.

User interests change over time. A model trained on the user's full watch history will recommend content relevant to their 2019 interests, which may be irrelevant in 2024. Time-decay the watch history (recent watches count more) and track interest velocity (how fast their topic interests are shifting). This dramatically improves recommendations for users with evolving tastes.

Interview Q&A

Q: Walk me through your design for YouTube recommendations. Start with requirements.

A: I'd start with constraints: 800M videos, 2.7B users, sub-200ms latency. This immediately tells me no single model can score all videos per request - I need a cascade architecture. The functional requirements are home page recommendations, "Up Next", subscription freshness, and cold start handling. My design is a four-stage cascade: (1) Multi-signal candidate generation using two-tower neural retrieval, subscription signals, trending pool, and exploration - produces 5-7K candidates in 25ms. (2) Lightweight pre-ranking MLP that reduces to 500 candidates in 20ms using only precomputed features. (3) Full MTL neural ranker (DLRM architecture) optimizing simultaneously for watch time, CTR, satisfaction, and completion rate - reduces to 100 candidates in 80ms. (4) Deterministic post-processing for diversity, content policy, and freshness - produces final 24 results in 15ms. Total: about 140ms, leaving margin for network and serving overhead.

Q: How do you handle the cold start problem for new videos uploaded by small creators?

A: Cold start for items is actually well-handled by the two-tower architecture - the item tower takes content features (title embeddings, category, thumbnail visual features, channel quality signals) and produces an embedding even with zero interaction history. A new video gets embedded within 5 minutes of upload. The embedding will be noisy initially but it is not zero. Additionally: new videos from subscribed channels enter the subscription retrieval pool directly, bypassing the neural retrieval stage. New videos that show high early engagement velocity (high CTR in first 100 impressions) are fast-promoted to the trending pool. New videos from new creators with no channel history are represented using channel-category priors - a cooking video from a new creator gets a content embedding similar to established cooking content, which is a reasonable starting point.

Q: How do you prevent the recommendation system from creating a feedback loop that amplifies popular content?

A: Three mechanisms. First, maintain a permanent 5% exploration bucket where random videos are shown, generating unbiased training signal about the full catalog. Second, apply inverse propensity weighting to all training data - the training loss for each example is weighted by the inverse probability that the system would have shown it. Items that are rarely shown (low propensity) receive higher training weights, preventing the model from ignoring unpopular content. Third, monitor catalog coverage metrics continuously - what fraction of the catalog receives at least one recommendation per week? If coverage declines below a threshold, increase the exploration rate automatically. The diversity enforcement in post-processing also helps by hard-capping the number of videos from any single creator on the home page.

Q: How would you evaluate the recommendation system beyond CTR and watch time?

A: The north star metric is long-term user retention - do users return to the platform tomorrow, next week, next month? This is hard to optimize for directly because it requires long experiments, so we use a hierarchy of proxy metrics. For the short-term experiment, we measure: watch time per session (primary), satisfaction survey response rate, completion rate, "not interested" rate (a negative signal), and diversity index (are we broadening or narrowing user content exposure?). For longer-term monitoring, we run a permanent 1% holdout that never receives new model updates - comparing this holdout to production tells us the compounding long-term value of all model improvements. We also track ecosystem health metrics: creator content production rate (are creators growing because recommendations reward quality?) and user diversity index (are users' tastes expanding over time?).

Design YouTube Recommendations From Scratch​

Requirements Analysis​

System Architecture Overview​

Stage 1: Candidate Generation​

Stage 2: Pre-Ranking​

Stage 3: Full Ranking​

Stage 4: Post-Processing and Re-Ranking​

Cold Start Solutions​

Session-Based vs Long-Term Modeling​

Production Engineering Notes​

Training Data Pipeline​

Model Freshness​

Common Mistakes​

Interview Q&A​