Skip to main content

:::tip ๐ŸŽฎ Interactive Playground Visualize this concept: Try the Two-Tower Model demo on the EngineersOfAI Playground - no code required. :::

Two-Tower Models

The Day Matrix Factorization Broke YouTubeโ€‹

It is 2016 and the YouTube recommendations team is staring at a number they cannot explain. Their matrix factorization system - the same architecture that had served them well for five years - is struggling. The corpus has grown to over 800 million videos. Users are uploading 500 hours of video every minute. New videos have no watch history, so the factorization model gives them near-zero scores. Popular videos from two years ago are crowding out fresh content because their embedding vectors are well-trained from millions of interactions.

The deeper problem is architectural. Matrix factorization represents both users and items as latent vectors in a shared space, but it learns those vectors from the co-occurrence matrix of who watched what. You cannot add a new video to the matrix without retraining. You cannot incorporate rich side information - video title, thumbnail features, upload recency - without hacking the model in ways that hurt its core collaborative signal. The model is a snapshot of history, and YouTube is a real-time business.

The team builds a neural network with two separate sub-networks: one that encodes user context into a 256-dimensional vector, another that encodes video features into the same 256-dimensional space. Both towers are trained jointly so that videos a user would watch end up close to that user's vector in the shared embedding space. At serving time, the video embeddings are pre-computed and indexed in a nearest-neighbor structure. A user query takes 1 millisecond. The catalog has 800 million entries. The system serves recommendations in under 10 milliseconds end-to-end.

This is the two-tower model. It became the dominant architecture for large-scale retrieval at Google, Meta, Spotify, LinkedIn, and Airbnb within three years of that 2016 paper. Understanding it deeply - not just what it does but why each design decision exists - is the difference between copying an architecture and being able to design one from scratch.

The insight that made two-tower possible was simple but profound: decouple what you can precompute from what you must compute at query time. Item embeddings do not change between user requests. User context changes with every query. Separating these two computations into two towers lets you precompute one half of the dot product problem for the entire catalog and perform only the second half at inference.

Why This Existsโ€‹

The Problem with Matrix Factorizationโ€‹

Classic collaborative filtering via matrix factorization (MF) decomposes the user-item interaction matrix RโˆˆRmร—nR \in \mathbb{R}^{m \times n} into two factor matrices:

Rโ‰ˆUโ‹…VTwhereย UโˆˆRmร—k,VโˆˆRnร—kR \approx U \cdot V^T \quad \text{where } U \in \mathbb{R}^{m \times k}, V \in \mathbb{R}^{n \times k}

This is elegant and it works remarkably well - Netflix's famous prize was won with MF variants. But MF has three structural failures at YouTube-scale:

The cold start problem. A new video has no interaction history. Its row in RR is all zeros. The factor matrix VV has no gradient signal to learn from. You get meaningless embeddings for fresh content, and fresh content is YouTube's competitive advantage.

The feature poverty problem. MF only sees the interaction signal. It does not know that a video is about cooking, that it was uploaded yesterday, that it has a high click-through rate on its thumbnail, or that it is 47 seconds long. Incorporating side features requires auxiliary models or ugly hacks that break the elegant factorization structure.

The staleness problem. MF is a batch process. You train on historical data, deploy the model, and serve stale embeddings until the next training run. A video that goes viral at 9 AM is invisible to recommendations until the next day's training job completes.

Why Two-Tower Solves These Failuresโ€‹

Two-tower addresses each failure directly:

  • Cold start: The item tower takes video features as input. A new video with zero interactions still has a title, description, category, and thumbnail. The item tower produces a meaningful embedding immediately.
  • Feature richness: Both towers are neural networks. Feed them anything: user watch history, search queries, demographic features, item metadata, video transcripts, visual features. The architecture is agnostic to the input modality.
  • Freshness: Item embeddings can be recomputed incrementally. A new video gets its embedding computed within minutes of upload without retraining the full model.

Historical Contextโ€‹

The two-tower architecture appeared in the deep learning era of recommendation, formalized in the 2019 Google paper "Sampling-Bias-Corrected Neural Modeling for Large Corpus Item Recommendations" (Yi et al., NeurIPS 2019), though the underlying dual encoder concept appeared earlier in information retrieval (Gillick et al., 2018 on MIPS for open-domain question answering).

The same architecture was independently published as DSSM (Deep Structured Semantic Model) by Microsoft Research in 2013 for learning query-document similarity, and as the Neural Code Search architecture by Facebook in 2018. By 2020, every major tech company had published variants: Google's MIPS-optimized two-tower, Facebook's FAISS-backed EBR (Embedding Based Retrieval), LinkedIn's two-tower for job recommendations, and Airbnb's dual encoder for listing search.

The key insight that unified all these systems: maximum inner product search (MIPS) is the inference primitive that makes two-tower tractable. Once you reduce recommendations to a dot product between a query vector and a catalog of item vectors, you can use approximate nearest neighbor (ANN) search to find the top-k items in sub-linear time.

Core Conceptsโ€‹

Architecture Overviewโ€‹

The Training Objectiveโ€‹

Two-tower models are typically trained with a softmax cross-entropy loss over the full item catalog, or an approximation of it. Given a user uu and a positive item v+v^+ (an item the user interacted with), the model should assign a higher score to v+v^+ than to all negative items.

The score is the inner product (or cosine similarity) between the user and item vectors:

s(u,v)=โŸจu,vโŸฉ=uTvs(u, v) = \langle \mathbf{u}, \mathbf{v} \rangle = \mathbf{u}^T \mathbf{v}

The softmax loss over the full catalog is:

L=โˆ’logโกexpโก(s(u,v+))โˆ‘vโˆˆVexpโก(s(u,v))\mathcal{L} = -\log \frac{\exp(s(u, v^+))}{\sum_{v \in \mathcal{V}} \exp(s(u, v))}

The denominator sums over all โˆฃVโˆฃ|\mathcal{V}| items - at YouTube scale, this is 800 million terms. Computing this exactly is intractable. In practice, this is approximated using in-batch negative sampling: the negatives for each positive pair are the other items in the same training batch.

Sampling Bias Correctionโ€‹

In-batch negatives introduce a sampling bias. Popular items appear more frequently as negatives simply because they appear more often in training data. The model learns to push popular items away from all users, even users who would genuinely enjoy them. This is the popularity bias problem in negative sampling.

The Yi et al. (2019) correction adjusts the logit for item vv by subtracting the log of its sampling probability:

scorrected(u,v)=s(u,v)โˆ’logโกp(v)s_{\text{corrected}}(u, v) = s(u, v) - \log p(v)

where p(v)p(v) is estimated from a streaming frequency counter. Items that appear frequently as negatives get a lower adjusted score, preventing the model from discriminating against them unfairly.

Hard Negative Miningโ€‹

Random negatives are easy. After a few epochs, the model trivially separates positive pairs from random negatives and stops learning useful representations. Hard negatives are items that score high for a user but are not positive interactions - items the user saw but did not click, or semantically similar but contextually wrong items.

Hard negative mining strategies:

Offline hard negatives: After each training epoch, run the current model on the full catalog and retrieve the top-K items for each user. Items in the top-K that are not positive interactions become hard negatives for the next epoch. This is expensive but effective.

Online hard negatives: Within a batch, the hardest negative for user uu is the item vโˆ’v^- with the highest score that is not a positive:

vhardโˆ’=argโกmaxโกvโ‰ v+s(u,v)v^-_{\text{hard}} = \arg\max_{v \neq v^+} s(u, v)

Semi-hard negatives (from FaceNet): Negatives that are farther than the positive but still within the margin - they provide gradient signal without being so hard that they destabilize training.

warning

Hard negative mining too aggressively can destabilize training. If the model is not yet good, the "hardest" negatives may be nearly indistinguishable from positives, producing noisy gradient signals that hurt convergence. Start with random negatives, warm up the model, then introduce hard negatives after 2-3 epochs.

Embedding Dimensionality Tradeoffsโ€‹

The embedding dimension dd controls the capacity of the shared representation space. Larger dd allows the model to represent more nuanced relationships, but it also increases:

  • Memory: An index of 800M items at d=256d=256 with float32 = 800Mร—256ร—4800M \times 256 \times 4 bytes = 819 GB. At d=64d=64, this drops to 205 GB.
  • Latency: Inner product computation scales linearly with dd. ANN search time scales with dd as well.
  • Index build time: Quantization and IVF index construction are O(nโ‹…d)O(n \cdot d).

Practical guidance from production deployments:

  • d=64d=64: sufficient for most item retrieval tasks with simple categorical features
  • d=128d=128: standard for dense retrieval, good quality-memory balance
  • d=256d=256: used when items have rich multi-modal features (text + visual)
  • d=512+d=512+: research setting; rarely justified in production

Once item embeddings are computed, they are indexed in an ANN structure. The query is: given user vector u\mathbf{u}, find the kk item vectors with the highest inner product score.

FAISS (Facebook AI Similarity Search): The dominant library. Supports:

  • IndexFlatIP: exact inner product search - correct but O(n)O(n)
  • IndexIVFFlat: inverted file index - partitions space into Voronoi cells, only searches nearest cells. Approximate. Controlled by nlist (cells) and nprobe (cells searched at query time).
  • IndexIVFPQ: adds product quantization to compress vectors - reduces memory by 8-32x with ~5-10% recall loss.

ScaNN (Scalable Nearest Neighbors, Google): Optimized for maximum inner product search (MIPS) specifically. Uses anisotropic quantization that minimizes the error in the direction that matters most for ranking. Typically 30-50% faster than FAISS for MIPS tasks.

HNSW (Hierarchical Navigable Small World): Graph-based ANN. Builds a multi-layer graph of embeddings. Excellent recall at moderate latency. Good for smaller catalogs (up to ~50M items). Memory-intensive but no offline index build - supports real-time insertions.

MethodBuild TimeQuery LatencyRecallMemoryOnline Updates
Exact (FAISS Flat)FastHigh (O(n)O(n))100%HighYes
IVF-PQMediumLow90-95%LowNo
HNSWSlowVery Low95-99%MediumYes
ScaNNMediumVery Low92-97%LowNo

Code Examplesโ€‹

Two-Tower Model in PyTorchโ€‹

import torch
import torch.nn as nn
import torch.nn.functional as F
from typing import Dict, List


class UserTower(nn.Module):
"""Encodes user context into a dense embedding vector."""

def __init__(
self,
user_vocab_size: int,
item_vocab_size: int,
embed_dim: int = 64,
output_dim: int = 128,
dropout: float = 0.1,
):
super().__init__()
# User ID embedding
self.user_embed = nn.Embedding(user_vocab_size, embed_dim, padding_idx=0)
# Watch history item embeddings (same space as item tower input)
self.history_embed = nn.Embedding(item_vocab_size, embed_dim, padding_idx=0)

# MLP to produce final user vector
self.mlp = nn.Sequential(
nn.Linear(embed_dim * 2, 256),
nn.BatchNorm1d(256),
nn.ReLU(),
nn.Dropout(dropout),
nn.Linear(256, output_dim),
)

def forward(self, user_ids: torch.Tensor, watch_history: torch.Tensor) -> torch.Tensor:
"""
Args:
user_ids: (batch,) int tensor
watch_history: (batch, seq_len) int tensor, padded with 0
Returns:
user_vec: (batch, output_dim) L2-normalized user embedding
"""
user_vec = self.user_embed(user_ids) # (batch, embed_dim)

# Mean-pool over watch history, ignoring padding
mask = (watch_history != 0).float() # (batch, seq_len)
history_embs = self.history_embed(watch_history) # (batch, seq_len, embed_dim)
history_vec = (history_embs * mask.unsqueeze(-1)).sum(dim=1)
history_vec = history_vec / (mask.sum(dim=1, keepdim=True).clamp(min=1))

combined = torch.cat([user_vec, history_vec], dim=-1)
out = self.mlp(combined)
return F.normalize(out, p=2, dim=-1) # L2 normalize for cosine similarity


class ItemTower(nn.Module):
"""Encodes item features into a dense embedding vector."""

def __init__(
self,
item_vocab_size: int,
category_vocab_size: int,
embed_dim: int = 64,
output_dim: int = 128,
dropout: float = 0.1,
):
super().__init__()
self.item_embed = nn.Embedding(item_vocab_size, embed_dim, padding_idx=0)
self.category_embed = nn.Embedding(category_vocab_size, embed_dim // 2, padding_idx=0)

self.mlp = nn.Sequential(
nn.Linear(embed_dim + embed_dim // 2 + 1, 256), # +1 for upload_age
nn.BatchNorm1d(256),
nn.ReLU(),
nn.Dropout(dropout),
nn.Linear(256, output_dim),
)

def forward(
self,
item_ids: torch.Tensor,
category_ids: torch.Tensor,
upload_age_days: torch.Tensor,
) -> torch.Tensor:
"""
Returns:
item_vec: (batch, output_dim) L2-normalized item embedding
"""
item_vec = self.item_embed(item_ids)
cat_vec = self.category_embed(category_ids)
age = upload_age_days.float().unsqueeze(-1)

combined = torch.cat([item_vec, cat_vec, age], dim=-1)
out = self.mlp(combined)
return F.normalize(out, p=2, dim=-1)


class TwoTowerModel(nn.Module):
"""Full two-tower model with in-batch negative training."""

def __init__(self, user_tower: UserTower, item_tower: ItemTower, temperature: float = 0.05):
super().__init__()
self.user_tower = user_tower
self.item_tower = item_tower
self.temperature = temperature

def forward(
self,
user_ids: torch.Tensor,
watch_history: torch.Tensor,
item_ids: torch.Tensor,
category_ids: torch.Tensor,
upload_age_days: torch.Tensor,
) -> torch.Tensor:
"""
Compute in-batch softmax loss.
Positive pairs: (user[i], item[i])
Negatives: all other items in the batch
"""
user_vecs = self.user_tower(user_ids, watch_history) # (B, d)
item_vecs = self.item_tower(item_ids, category_ids, upload_age_days) # (B, d)

# Similarity matrix: (B, B) - diagonal is positive pairs
logits = torch.matmul(user_vecs, item_vecs.T) / self.temperature # (B, B)
labels = torch.arange(len(user_ids), device=user_ids.device)

# Cross-entropy: user[i] should be most similar to item[i]
loss = F.cross_entropy(logits, labels)
return loss

def get_user_embedding(self, user_ids, watch_history):
with torch.no_grad():
return self.user_tower(user_ids, watch_history)

def get_item_embedding(self, item_ids, category_ids, upload_age_days):
with torch.no_grad():
return self.item_tower(item_ids, category_ids, upload_age_days)

Building the FAISS Indexโ€‹

import faiss
import numpy as np
from tqdm import tqdm


def build_faiss_index(
item_embeddings: np.ndarray,
embedding_dim: int = 128,
use_pq: bool = True,
nlist: int = 4096,
m_pq: int = 16, # number of subquantizers
) -> faiss.Index:
"""
Build an IVF-PQ index for billion-scale approximate nearest neighbor search.

Args:
item_embeddings: (N, d) float32 array of item embeddings
embedding_dim: dimensionality of embeddings
use_pq: if True, use product quantization to compress vectors
nlist: number of Voronoi cells for IVF
m_pq: number of PQ subquantizers (must divide embedding_dim)
"""
n_items = item_embeddings.shape[0]

if use_pq:
# IVF-PQ: partitioned into nlist cells, each vector compressed to m_pq bytes
quantizer = faiss.IndexFlatIP(embedding_dim)
index = faiss.IndexIVFPQ(quantizer, embedding_dim, nlist, m_pq, 8)
else:
# IVF-Flat: no compression, higher memory, higher recall
quantizer = faiss.IndexFlatIP(embedding_dim)
index = faiss.IndexIVFFlat(quantizer, embedding_dim, nlist, faiss.METRIC_INNER_PRODUCT)

# Move to GPU if available
if faiss.get_num_gpus() > 0:
res = faiss.StandardGpuResources()
index = faiss.index_cpu_to_gpu(res, 0, index)

# Train the index (learns the Voronoi cell centroids)
print(f"Training index on {min(n_items, 100_000)} samples...")
train_sample = item_embeddings[:100_000]
index.train(train_sample)

# Add all item embeddings
print(f"Adding {n_items} item embeddings to index...")
batch_size = 100_000
for i in tqdm(range(0, n_items, batch_size)):
index.add(item_embeddings[i : i + batch_size])

return index


def retrieve_top_k(
index: faiss.Index,
user_embedding: np.ndarray,
k: int = 500,
nprobe: int = 64, # number of cells to search - higher = more accurate, slower
) -> tuple:
"""
Retrieve top-k items for a user embedding.

Returns:
distances: (k,) similarity scores
item_indices: (k,) indices into the item catalog
"""
if hasattr(index, "nprobe"):
index.nprobe = nprobe

if user_embedding.ndim == 1:
user_embedding = user_embedding[np.newaxis, :] # Add batch dimension

distances, indices = index.search(user_embedding, k)
return distances[0], indices[0]

Offline Evaluation: Recall@Kโ€‹

def evaluate_recall_at_k(
model: TwoTowerModel,
test_interactions: List[Dict], # [{"user": ..., "positive_items": [...]}]
item_index: faiss.Index,
k_values: List[int] = [10, 50, 100, 500],
) -> Dict[int, float]:
"""
Compute Recall@K: fraction of positive items retrieved in top-K.

For recommendation, this answers: "Of all the items a user would
actually interact with, how many does the model surface in its
top-K candidates?"
"""
model.eval()
recall_scores = {k: [] for k in k_values}
max_k = max(k_values)

for interaction in test_interactions:
user_embedding = model.get_user_embedding(
interaction["user_ids"], interaction["watch_history"]
).numpy()

_, retrieved_indices = retrieve_top_k(item_index, user_embedding, k=max_k)
retrieved_set = set(retrieved_indices.tolist())
positive_set = set(interaction["positive_items"])

for k in k_values:
top_k_set = set(list(retrieved_indices)[:k])
recall = len(positive_set & top_k_set) / len(positive_set)
recall_scores[k].append(recall)

return {k: np.mean(scores) for k, scores in recall_scores.items()}

Production Engineering Notesโ€‹

Embedding Index Update Strategyโ€‹

Item embeddings must be kept fresh. New items should get embeddings within minutes of being added to the catalog, and embeddings should be recomputed as the model improves. Three update strategies:

Full rebuild (daily): Recompute all embeddings, rebuild the FAISS index from scratch. Simple, consistent. Works for catalogs up to ~50M items where rebuild takes under an hour. YouTube does this for less active catalog sections.

Incremental add: New items get embeddings computed immediately and added to the existing index. The index grows over time. Deletions require tombstoning (flag item as removed, filter from results). Requires a data structure that supports online inserts - HNSW is better than IVF-PQ for this.

Dual index with atomic swap: Maintain two indexes - serving and building. Build the new index offline. Swap atomically when complete. Zero-downtime updates. The standard approach for high-traffic systems.

Serving Architectureโ€‹

User Tower as a service: The user tower runs as a separate microservice. It accepts user context (user ID, recent history, current query) and returns a single embedding vector. This service is stateless and horizontally scalable. Cache user embeddings with a short TTL (30-60 seconds) - user context doesn't change that fast.

ANN Index as a service: The index runs on instances with large memory. At 800M items, d=128d=128, using IVF-PQ with 8-bit quantization, memory usage is approximately 800Mร—16800M \times 16 bytes = 12.8 GB - fits on a single host with 32 GB RAM. Run multiple replicas behind a load balancer for availability.

Consistency tradeoff: User tower and item index may be out of sync. A model update retrains both towers simultaneously - but the item index rebuild takes hours while the user tower is deployed in minutes. During the transition window, user embeddings from the new model are queried against item embeddings from the old model. This mismatch degrades quality. Coordinate deployments: deploy user tower + new index atomically.

Hard Negative Mining in Practiceโ€‹

def generate_hard_negatives(
model: TwoTowerModel,
item_index: faiss.Index,
user_interactions: List[Dict],
k_hard: int = 200,
hard_negative_ratio: float = 0.5,
) -> List[Dict]:
"""
For each user, retrieve top-K items and use non-positive items as hard negatives.
Mix with random negatives to avoid training instability.
"""
augmented_interactions = []

for interaction in user_interactions:
user_emb = model.get_user_embedding(
interaction["user_ids"], interaction["watch_history"]
).numpy()

_, top_k_indices = retrieve_top_k(item_index, user_emb, k=k_hard)
positive_set = set(interaction["positive_items"])

# Items in top-K that are NOT positive = hard negatives
hard_negs = [idx for idx in top_k_indices if idx not in positive_set]

n_hard = int(len(hard_negs) * hard_negative_ratio)
n_random = len(hard_negs) - n_hard
selected_hard = hard_negs[:n_hard]
selected_random = np.random.choice(
len(item_index), size=n_random, replace=False
).tolist()

augmented_interactions.append({
**interaction,
"hard_negatives": selected_hard,
"random_negatives": selected_random,
})

return augmented_interactions

Common Mistakesโ€‹

danger

Mistake: Using the same embedding space as a single shared encoder.

If you share weights between the user tower and the item tower, the model is forced to encode user context and item content with identical representations. Users and items live in different semantic spaces. A user vector should represent "what this user likes" while an item vector should represent "what this item is." They can live in the same geometric space for dot-product comparison, but they must be produced by separate neural networks.

danger

Mistake: Not normalizing embeddings and using raw dot product.

Without L2 normalization, the dot product is dominated by the magnitude of the embedding vectors, not their direction. Items with large embedding norms (often popular items with lots of gradient updates) will rank artificially high regardless of relevance. Always L2-normalize output vectors from both towers before training and before indexing.

warning

Mistake: Ignoring sampling bias in in-batch negatives.

Popular items appear more often in training batches because they appear more often in interaction data. Without sampling bias correction, the model learns to push popular items down for all users - exactly backwards from what you want. Implement the Yi et al. (2019) log-frequency correction or use stratified sampling to ensure uniform negative distribution.

warning

Mistake: Setting nprobe too low during ANN search.

The nprobe parameter controls how many Voronoi cells are searched during IVF retrieval. A low nprobe is fast but misses items that lie near cell boundaries. A typical mistake is tuning nprobe at development time on a small index (where it doesn't matter much) and forgetting to re-tune on the production index. For a 4096-cell IVF index, nprobe=64 gives good recall; nprobe=1 can drop recall from 95% to 60%.

tip

Tip: Use separate temperatures for training and indexing.

The softmax temperature ฯ„\tau in training controls how sharply the model discriminates between positives and negatives. A lower ฯ„\tau creates a harder, more focused learning signal. But at serving time, the raw dot product score (without temperature scaling) is used for ranking. Train with ฯ„โˆˆ[0.05,0.1]\tau \in [0.05, 0.1] for sharp learning, but never apply temperature scaling at inference - it doesn't affect the ranking order.

Interview Q&Aโ€‹

Q: Why does a two-tower model need two separate neural networks instead of one network that takes both user and item features as input?

A: The separation is motivated by inference efficiency, not model quality. A single network that jointly processes user and item features (a "cross-encoder") can capture richer interactions - it can model how a specific feature of user A interacts with a specific attribute of item B. But at serving time, you cannot precompute anything: for each user query, you must run the full network against every item in the catalog. At 800M items and 10ms budget, this is physically impossible. Two-tower trades some modeling quality for the ability to precompute item representations offline and perform only a dot product at query time. The dot product is a bilinear interaction - it captures that user vector dimension ii multiplies item vector dimension ii - which is weaker than a cross-encoder but sufficient for candidate generation. The rich cross-attention interactions are reserved for a subsequent ranking stage that operates on a small candidate set.

Q: How do you handle the cold start problem in a two-tower system?

A: Cold start manifests differently for users vs. items. For new items, the item tower takes content features (title, category, thumbnail visual embeddings) as input, so it produces a meaningful embedding even with zero interaction history. The embedding will be noisy but not zero. For new users, the user tower can fall back to demographic features (age bracket, location, device type) or a global popularity-based embedding when watch history is empty. A common production trick is to maintain a "new user" bucket - users with fewer than 10 interactions - and serve a lightweight popularity-plus-demographic model until sufficient history is accumulated to trust the personalized two-tower output.

Q: What is the difference between recall@K and precision@K, and which matters more for the retrieval stage?

A: Recall@K measures the fraction of relevant items that appear in the top-K results. Precision@K measures the fraction of the top-K results that are relevant. For a retrieval/candidate generation stage that feeds a downstream ranker, recall@K is the primary metric. The job of retrieval is to make sure relevant items are in the candidate set - the ranker will sort out the order. If recall@500 is 80%, the downstream ranker starts from a set where 80% of genuinely relevant items are present. If precision@500 is only 10% (50 relevant out of 500), that's fine - the ranker handles filtering. Optimizing precision at retrieval is premature; it limits recall unnecessarily.

Q: How do you evaluate a two-tower model offline before deploying it?

A: Standard offline evaluation computes recall@K and mean reciprocal rank (MRR) on a held-out test set of user-item interactions. The procedure: (1) split interactions chronologically - train on older interactions, test on newer ones; (2) for each test user, retrieve top-K items with the model; (3) check if the user's held-out positive items appear in the top-K. Important subtlety: the test set should hold out items that were present in the training catalog, otherwise you're measuring cold-start performance separately. Also measure coverage - the fraction of catalog items that ever appear in any user's top-500 - to detect filter bubble effects.

Q: A two-tower model is deployed and performing well offline, but online A/B metrics show no improvement. What do you investigate?

A: Several hypotheses in priority order. First, check if the downstream ranker is the bottleneck - if the ranker reorders the top-500 candidates aggressively, a better candidate set may not translate to better final results. Second, check the recall@K of the old vs. new model on items that actually ended up being clicked - are the new model's mistakes in a different distribution than the old model's? Third, check for a position bias issue: if the ranker gives heavy weight to the position of items in the candidate list, and the new model changes ordering within the top-500, the ranker may penalize items it doesn't recognize. Fourth, look at the cold-start segment specifically - two-tower often improves cold-start quality, which shows up in user segments that the aggregated A/B metric dilutes.

Q: What are the tradeoffs between FAISS IVF-PQ and HNSW for production ANN search?

A: IVF-PQ is better when: the catalog is very large (more than 100M items), memory is constrained (PQ compresses vectors 8-32x), and the catalog is relatively static (IVF requires an offline build step; online adds degrade index quality). HNSW is better when: the catalog changes frequently (supports real-time inserts without full rebuild), recall requirements are very high (HNSW at recall@95 is typically faster than IVF-PQ at the same recall), and memory is available (HNSW stores the full graph structure). In practice, most large-scale production systems use IVF-PQ for the main catalog and HNSW for a fresh-content index covering recently-added items.

ยฉ 2026 EngineersOfAI. All rights reserved.