Skip to main content

:::tip 🎮 Interactive Playground Visualize this concept: Try the Recommendation System Design demo on the EngineersOfAI Playground - no code required. :::

Designing a Recommendation System at Scale

The Scale Problem

The engineering team at a streaming platform has just crossed 100 million active users and 50 million items in the catalog. The recommendation system - a collaborative filtering model trained weekly - is showing its age. Its offline AUC looks fine. But engineers have noticed that users consistently get recommendations for items they have already seen, items from categories they have never shown interest in, and nothing from the 30% of the catalog added in the past month. Engagement is down 8% quarter-over-quarter.

The architecture meeting goes through the symptoms. The weekly training cycle means the system is blind to items added in the past week. The collaborative filtering approach cannot generate recommendations for items with no interaction history - the new-item cold start problem. And because the model scores all 50 million items for each user at serving time (a brute-force pass over the catalog), it can only run the scoring during nightly precomputation - which means no personalization based on what the user has done in the current session.

The decision: replace the monolithic recommendation model with a two-stage architecture. Stage 1 retrieves a set of 1,000 candidate items from 50 million in under 10 milliseconds. Stage 2 scores those 1,000 items with a rich, expensive model in under 50 milliseconds. Total serving latency: under 100 milliseconds, including feature retrieval. This is the dominant architecture at YouTube, Netflix, Spotify, and LinkedIn. This lesson builds it from scratch.


Requirements

Functional requirements:

  • Given a user and a context (page, time of day, device), return a ranked list of N items
  • Support multiple recommendation surfaces: home feed, "more like this," search, email digest
  • Handle new users (cold start) and new items (cold start)

Non-functional requirements:

  • Serving latency: p99 under 200ms end-to-end
  • Scale: 10 million requests per hour at peak
  • Freshness: recommendations reflect user behavior from the past 15 minutes
  • Coverage: every item in the catalog with at least 10 interactions must be recommendable

The Two-Stage Architecture

No single model can be both fast enough to scan 50 million items and accurate enough to produce good rankings. The solution: a staged funnel.


Stage 1: Candidate Generation

Two-Tower Model

The two-tower model is the dominant architecture for large-scale candidate retrieval. It encodes users and items into the same embedding space. At serving time, you retrieve the top-K nearest item embeddings to the user embedding using Approximate Nearest Neighbor (ANN) search.

import torch
import torch.nn as nn
import torch.nn.functional as F


class UserTower(nn.Module):
"""
Encodes user context into a dense embedding.
Input: user features (demographics, history, context)
Output: L2-normalized embedding of dimension D
"""

def __init__(self, num_users: int, embedding_dim: int = 256):
super().__init__()
self.user_embedding = nn.Embedding(num_users, 64)
self.mlp = nn.Sequential(
nn.Linear(64 + 32, 512), # 64 user emb + 32 context features
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(512, 256),
nn.ReLU(),
nn.Linear(256, embedding_dim),
)

def forward(self, user_ids: torch.Tensor, context_features: torch.Tensor):
user_emb = self.user_embedding(user_ids)
x = torch.cat([user_emb, context_features], dim=1)
output = self.mlp(x)
return F.normalize(output, dim=1) # L2 normalize for cosine similarity


class ItemTower(nn.Module):
"""
Encodes item attributes into the same embedding space as users.
"""

def __init__(self, num_items: int, num_categories: int, embedding_dim: int = 256):
super().__init__()
self.item_embedding = nn.Embedding(num_items, 64)
self.category_embedding = nn.Embedding(num_categories, 32)
self.mlp = nn.Sequential(
nn.Linear(64 + 32 + 16, 512), # id + category + scalar features
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(512, 256),
nn.ReLU(),
nn.Linear(256, embedding_dim),
)

def forward(
self,
item_ids: torch.Tensor,
category_ids: torch.Tensor,
scalar_features: torch.Tensor,
):
item_emb = self.item_embedding(item_ids)
cat_emb = self.category_embedding(category_ids)
x = torch.cat([item_emb, cat_emb, scalar_features], dim=1)
output = self.mlp(x)
return F.normalize(output, dim=1)


class TwoTowerModel(nn.Module):
"""
Combined two-tower model.
Training: contrastive loss (positive item vs sampled negatives).
Serving: user tower online + item tower offline (precomputed).
"""

def __init__(self, num_users: int, num_items: int, num_categories: int):
super().__init__()
self.user_tower = UserTower(num_users)
self.item_tower = ItemTower(num_items, num_categories)

def forward(
self,
user_ids: torch.Tensor,
context: torch.Tensor,
pos_item_ids: torch.Tensor,
pos_categories: torch.Tensor,
pos_scalars: torch.Tensor,
neg_item_ids: torch.Tensor,
neg_categories: torch.Tensor,
neg_scalars: torch.Tensor,
):
user_emb = self.user_tower(user_ids, context)
pos_emb = self.item_tower(pos_item_ids, pos_categories, pos_scalars)
neg_emb = self.item_tower(neg_item_ids, neg_categories, neg_scalars)

# Scores: dot product (embeddings are L2-normalized, so this = cosine sim)
pos_score = torch.sum(user_emb * pos_emb, dim=1) # (B,)
neg_scores = torch.bmm(
neg_emb.view(neg_emb.size(0), -1, neg_emb.size(-1)),
user_emb.unsqueeze(-1),
).squeeze(-1) # (B, num_negatives)

# In-batch softmax loss (sampled softmax)
logits = torch.cat([pos_score.unsqueeze(1), neg_scores], dim=1)
labels = torch.zeros(logits.size(0), dtype=torch.long)
loss = F.cross_entropy(logits, labels)
return loss


def train_two_tower(model, dataloader, epochs: int = 5):
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
for epoch in range(epochs):
total_loss = 0.0
for batch in dataloader:
optimizer.zero_grad()
loss = model(**batch)
loss.backward()
optimizer.step()
total_loss += loss.item()
print(f"Epoch {epoch+1}: loss={total_loss/len(dataloader):.4f}")

ANN Index: Serving Item Embeddings at Scale

After training, precompute all item embeddings and index them in FAISS or ScaNN for sub-10ms ANN retrieval.

import faiss
import numpy as np
import torch
from pathlib import Path


class ANNItemIndex:
"""
FAISS-based ANN index for item embedding retrieval.
Precomputed offline; served in memory for online retrieval.
"""

def __init__(self, embedding_dim: int = 256, use_gpu: bool = True):
self.embedding_dim = embedding_dim
self.use_gpu = use_gpu
self.index = None
self.item_ids = None

def build_index(
self,
item_tower: ItemTower,
item_dataset,
nlist: int = 1000, # number of Voronoi cells for IVF
nprobe: int = 64, # cells to search at query time
) -> None:
"""
Compute all item embeddings and build a FAISS IVFFlat index.
IVFFlat: approximate search with configurable accuracy/speed tradeoff.
nprobe=64 gives good recall (~95%) with 10-100x speedup over exact search.
"""
# Compute embeddings for all items
all_embeddings = []
all_item_ids = []
item_tower.eval()

with torch.no_grad():
for batch in item_dataset:
embs = item_tower(
batch["item_ids"],
batch["category_ids"],
batch["scalar_features"],
)
all_embeddings.append(embs.cpu().numpy())
all_item_ids.extend(batch["item_ids"].tolist())

embeddings = np.vstack(all_embeddings).astype("float32")
self.item_ids = np.array(all_item_ids)

# Build IVF index
quantizer = faiss.IndexFlatIP(self.embedding_dim) # inner product (= cosine for normalized vecs)
self.index = faiss.IndexIVFFlat(
quantizer, self.embedding_dim, nlist, faiss.METRIC_INNER_PRODUCT
)
self.index.nprobe = nprobe

print(f"[Index] Training on {len(embeddings):,} embeddings...")
self.index.train(embeddings)
self.index.add(embeddings)
print(f"[Index] Built index with {self.index.ntotal:,} items")

def search(
self,
user_embedding: np.ndarray,
top_k: int = 1000,
) -> list:
"""
Retrieve top-K items nearest to the user embedding.
Returns list of (item_id, similarity_score).
"""
query = user_embedding.reshape(1, -1).astype("float32")
scores, indices = self.index.search(query, top_k)

results = []
for score, idx in zip(scores[0], indices[0]):
if idx >= 0: # FAISS returns -1 for invalid results
results.append((int(self.item_ids[idx]), float(score)))
return results

def save(self, path: str) -> None:
faiss.write_index(self.index, f"{path}/faiss.index")
np.save(f"{path}/item_ids.npy", self.item_ids)

def load(self, path: str) -> None:
self.index = faiss.read_index(f"{path}/faiss.index")
self.item_ids = np.load(f"{path}/item_ids.npy")

Stage 2: Ranking

The ranking model takes the 1,000 candidates and scores them with a richer set of features - cross-features between user and item, context features, historical engagement signals. Speed matters less here (you are scoring 1,000 items, not 50 million), so you can use a deeper model.

class WideAndDeepRanker(nn.Module):
"""
Wide and Deep architecture (Google, 2016) for recommendation ranking.
Wide component: memorization (cross-product features)
Deep component: generalization (dense embeddings + MLP)
"""

def __init__(
self,
num_users: int,
num_items: int,
num_categories: int,
num_wide_features: int,
embedding_dim: int = 64,
):
super().__init__()

# Wide component: linear model on cross-product features
# Good for memorizing specific (user, item) patterns
self.wide = nn.Linear(num_wide_features, 1)

# Deep component: embeddings + MLP
self.user_embedding = nn.Embedding(num_users, embedding_dim)
self.item_embedding = nn.Embedding(num_items, embedding_dim)
self.category_embedding = nn.Embedding(num_categories, 16)

deep_input_dim = embedding_dim * 2 + 16 + 32 # + 32 dense features
self.deep = nn.Sequential(
nn.Linear(deep_input_dim, 1024),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(1024, 512),
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(512, 256),
nn.ReLU(),
nn.Linear(256, 1),
)

def forward(
self,
user_ids: torch.Tensor,
item_ids: torch.Tensor,
category_ids: torch.Tensor,
dense_features: torch.Tensor,
wide_features: torch.Tensor,
) -> torch.Tensor:
# Wide path
wide_logit = self.wide(wide_features)

# Deep path
user_emb = self.user_embedding(user_ids)
item_emb = self.item_embedding(item_ids)
cat_emb = self.category_embedding(category_ids)
deep_input = torch.cat(
[user_emb, item_emb, cat_emb, dense_features], dim=1
)
deep_logit = self.deep(deep_input)

# Combined: add wide and deep logits
combined = wide_logit + deep_logit
return torch.sigmoid(combined)

Handling the Cold Start Problem

New Users

New users have no interaction history. Approaches in order of effectiveness:

  1. Onboarding flow: ask 3-5 preference questions at signup. Map answers to initial interest categories. Use popularity-based retrieval filtered by those categories.

  2. Content-based bootstrap: use demographic features (age group, location, device type) available from registration to seed initial recommendations from similar demographic segments.

  3. Popular-by-region: default to regionally trending items until the user has generated at least 10 interactions.

New Items

New items have no collaborative filtering signal. The item tower of the two-tower model handles this - it encodes items purely from content features (title, description, category, price, tags) without any interaction history.

class ContentBasedItemEncoder(nn.Module):
"""
Encode new items from content features only.
Used for items with fewer than 10 interactions.
Output is in the same embedding space as the interaction-trained item tower.
"""

def __init__(
self,
text_embedding_dim: int = 768, # sentence-transformers output
num_categories: int = 500,
output_dim: int = 256,
):
super().__init__()
self.category_embedding = nn.Embedding(num_categories, 32)
self.projection = nn.Sequential(
nn.Linear(text_embedding_dim + 32 + 16, 512),
nn.ReLU(),
nn.Linear(512, output_dim),
)

def forward(
self,
text_embeddings: torch.Tensor, # from sentence-transformer
category_ids: torch.Tensor,
scalar_features: torch.Tensor, # price, recency, etc.
) -> torch.Tensor:
cat_emb = self.category_embedding(category_ids)
x = torch.cat([text_embeddings, cat_emb, scalar_features], dim=1)
return F.normalize(self.projection(x), dim=1)

New items are indexed in the ANN index using their content-based embedding. As interactions accumulate (after 10+ interactions), the item is retrained with interaction-based supervision and its embedding is updated in the index.


Freshness and Session Awareness

The two-tower model is trained offline and produces static user embeddings. Session behavior (what the user clicked in the last 15 minutes) must be incorporated dynamically.

Session-based retrieval runs in parallel with the two-tower retrieval. It takes the items the user interacted with in the current session and retrieves items most similar to those (item-to-item similarity, precomputed offline). Session candidates are merged with two-tower candidates before ranking.

Real-time feature injection at the ranking stage: the feature service provides the user's session-level features (session click count, session category affinity, last 5 items viewed) to the ranking model. The ranking model is trained with these features, so it naturally re-weights scores based on session context.


Diversity: Preventing Filter Bubbles

A pure relevance ranker maximizes engagement signals but produces homogeneous results - the user gets 20 items from the same category they just viewed. This is bad for user experience long-term and creates filter bubbles.

from typing import Callable


def mmr_rerank(
scored_items: list,
item_embeddings: dict,
lambda_param: float = 0.5,
top_n: int = 20,
) -> list:
"""
Maximal Marginal Relevance (MMR) reranking for diversity.
Balances relevance (model score) against diversity (distance from selected items).

lambda_param=1.0: pure relevance (no diversity)
lambda_param=0.0: pure diversity (no relevance)
lambda_param=0.5: balanced
"""
selected = []
remaining = list(scored_items)

while len(selected) < top_n and remaining:
best_item = None
best_mmr_score = float("-inf")

for item_id, relevance_score in remaining:
if not selected:
mmr_score = relevance_score
else:
# Maximum similarity to any already-selected item
item_emb = item_embeddings[item_id]
max_sim = max(
float(np.dot(item_emb, item_embeddings[sel_id]))
for sel_id, _ in selected
)
mmr_score = (
lambda_param * relevance_score
- (1 - lambda_param) * max_sim
)

if mmr_score > best_mmr_score:
best_mmr_score = mmr_score
best_item = (item_id, relevance_score)

selected.append(best_item)
remaining.remove(best_item)

return selected

Production Serving Architecture

YouTube DNN Recommendation (2016): The landmark paper that popularized this architecture. Candidate generation: a two-tower neural network trained with cross-entropy loss on positive watch events, serving 100-200 candidates from millions of videos. Ranking: a separate MLP with hundreds of features, optimizing for a proxy of watch time (weighted logistic regression where positive examples are weighted by their watch duration). The serving infrastructure pre-computes video embeddings offline (updated daily), performs FAISS search online, and features are retrieved from a low-latency serving infrastructure.


:::danger Position Bias in Training Data

Recommendation training data comes from user clicks, which are heavily biased by position - users click the first item 5-10x more than the fifth item, regardless of quality. If you train a ranking model on raw click data, it learns to rank items higher simply because they were shown in position 1, not because they are actually better.

Solution: propensity weighting. For each training example, estimate the probability that this item would have been clicked if it were in position 1 (the unbiased position). Divide the click weight by this propensity to correct for position bias. Alternatively, use a propensity-aware ranking model (PAL - Position Aware Learning) that explicitly models position as a bias term separate from relevance. :::

:::warning Popularity Bias in Collaborative Filtering

Collaborative filtering naturally over-represents popular items. An item with 10 million interactions has much stronger signal than an item with 1,000 interactions, so the model learns to recommend popular items much more frequently than their quality warrants. This creates a feedback loop: popular items get recommended → more interactions → model recommends them even more.

Solution: downsample popular items in training (so the loss contribution from a 10M-interaction item equals that of a 1K-interaction item), and explicitly include a freshness/recency score in the ranking model to give newer items a chance. Long-tail coverage should be a monitored metric alongside engagement metrics. :::


Interview Q&A

Q1: Why do large-scale recommendation systems use a two-stage architecture instead of a single model?

A single model cannot simultaneously score 50 million items per request within a 100ms latency budget. A model complex enough to produce high-quality rankings (deep neural network with rich cross-features) takes roughly 1ms to score one item. Scoring 50 million items would take 14 hours. A model fast enough to scan 50 million items in 10ms must be extremely simple - essentially a dot product - which cannot capture the complex interactions needed for high-quality ranking.

The two-stage solution: use a fast, approximate model (two-tower + ANN search) to reduce 50 million items to 1,000 candidates, then use a slow, accurate model to rank those 1,000. The first stage sacrifices some accuracy for massive speed gains (ANN search retrieves top-1000 from 50M in under 10ms). The second stage sacrifices some speed for high accuracy (complex model scores 1,000 items in 30ms). Together, they meet both the latency and accuracy requirements that neither alone can satisfy.


Q2: How does the two-tower model work and how is it trained?

The two-tower model has two separate neural networks (towers) - one for users, one for items - that map their respective inputs to embeddings in a shared D-dimensional space. Users and items that are relevant to each other should have high cosine similarity (small angle) in this space. The model is trained with sampled softmax loss: for each positive (user, item) pair in the training data (a user interaction), you sample K random items as negatives, compute cosine similarity between the user embedding and all K+1 items, and train with cross-entropy loss to push the positive item to the top.

At serving time, the item tower is run offline over all items (daily), producing a static embedding for each item. These are indexed in FAISS. At query time, only the user tower runs (in real time, 1-2ms), producing the user embedding. FAISS performs ANN search to find the top-1000 items with highest cosine similarity. The separation of user (online) and item (offline) computation is the key serving efficiency insight.


Q3: How do you handle the cold start problem for new items in a recommendation system?

New items have no interaction history, so collaborative filtering cannot generate embeddings for them. Three approaches, used in combination:

First, the item tower of the two-tower model is trained on item content features (title, description, category, price, image features) in addition to item ID. Even for a brand-new item with no interactions, you can compute an embedding from its content. This embedding is less accurate than one informed by interactions, but it is much better than nothing.

Second, for items with 10-100 interactions, use a mixture of the content embedding and a weak collaborative filtering signal (few-shot embedding update). Retrain the item embedding more frequently for new items (hourly) than for established items (daily).

Third, implement an explicit exploration policy: deliberately show new items to a small fraction of users who have shown interest in the item's category, collect interaction signal, and use that signal to bootstrap the collaborative filtering embedding faster. The cost is slightly suboptimal recommendations for the exploration users; the benefit is faster high-quality recommendations for all users once the item has sufficient signal.


Q4: How does YouTube measure recommendation quality? What metrics do they use?

YouTube (per the 2016 DNN recommendation paper) optimizes primarily for watch time rather than clicks. A video that gets a lot of clicks but is watched for only 5 seconds is not a quality recommendation - it is clickbait. Watch time is a better proxy for user satisfaction.

In practice, recommendation systems measure a hierarchy of metrics: (1) Online engagement metrics: click-through rate (CTR), watch time, return visits, explicit ratings. These are the north-star business metrics; (2) Ranking metrics (offline): Normalized Discounted Cumulative Gain (NDCG), Mean Reciprocal Rank (MRR), Recall@K - how often does the correct item appear in the top-K recommendations; (3) Diversity metrics: intra-list diversity (average pairwise distance between recommended items), coverage (what fraction of the catalog gets recommended at least once per week); (4) Fairness metrics: whether items from minority content categories are systematically under-recommended.

None of these metrics alone tells the full story. A system that maximizes CTR without watching time is optimizing for clickbait. A system that maximizes diversity without relevance is unhelpful. Production systems optimize a weighted combination.


Q5: How do you serve recommendations for 10 million requests per hour within 200ms?

10 million requests per hour is roughly 2,800 requests per second. At a p99 target of 200ms, you need enough serving capacity to handle bursts (assume 3x average = 8,400 RPS at peak).

Serving architecture: an API gateway routes recommendation requests to the candidate generation service and ranking service. Both services are horizontally scaled behind load balancers. The candidate generation service runs the user tower model in CPU (fast enough), performs FAISS ANN search (FAISS is highly optimized, one index handles thousands of QPS), and fetches session-based candidates from Redis. The ranking service loads the ranking model into GPU memory (or CPU if latency allows) and scores 1,000 items per request. Feature retrieval is batched: fetch features for all 1,000 candidate items in one pipeline call to Redis.

For 8,400 RPS: roughly 20 candidate generation nodes (each handling ~500 RPS) and 40 ranking nodes (GPU-based, each handling ~200 RPS at 50ms per request). Total infrastructure: 60 nodes at peak. With autoscaling, you right-size this for actual traffic patterns rather than provisioning for peak 24/7.


Summary

A production recommendation system at scale uses a two-stage funnel: fast candidate generation (ANN search with two-tower model embeddings, retrieving 1,000 from 50 million in under 10ms) followed by expensive ranking (Wide and Deep model scoring 1,000 items with rich features in 30ms). Cold start for new items is handled by content-based embeddings from the item tower, which does not require interaction history. Freshness comes from parallel session-based candidate retrieval and real-time feature injection at the ranking stage. Diversity is maintained through MMR reranking in post-processing. The serving path is horizontally scaled to handle millions of requests per hour, with strict latency budgets at each stage summing to the 200ms end-to-end target.

© 2026 EngineersOfAI. All rights reserved.