What is two-tower model?

How dual encoder architectures power billion-scale recommendation and search by separating user and item representations and querying them with approximate nearest neighbor search.

How does dual encoder work in practice?

Two-Tower Models covers two-tower model, dual encoder, approximate nearest neighbor from first principles with code examples. Free lesson at https://engineersofai.com/docs/ai-systems/ml-architecture-patterns/two-tower-models

What is the difference between two-tower model and approximate nearest neighbor?

See the full breakdown at https://engineersofai.com/docs/ai-systems/ml-architecture-patterns/two-tower-models

:::tip 🎮 Interactive Playground Visualize this concept: Try the Two-Tower Model demo on the EngineersOfAI Playground - no code required. :::

Two-Tower Models

The Day Matrix Factorization Broke YouTube

It is 2016 and the YouTube recommendations team is staring at a number they cannot explain. Their matrix factorization system - the same architecture that had served them well for five years - is struggling. The corpus has grown to over 800 million videos. Users are uploading 500 hours of video every minute. New videos have no watch history, so the factorization model gives them near-zero scores. Popular videos from two years ago are crowding out fresh content because their embedding vectors are well-trained from millions of interactions.

The deeper problem is architectural. Matrix factorization represents both users and items as latent vectors in a shared space, but it learns those vectors from the co-occurrence matrix of who watched what. You cannot add a new video to the matrix without retraining. You cannot incorporate rich side information - video title, thumbnail features, upload recency - without hacking the model in ways that hurt its core collaborative signal. The model is a snapshot of history, and YouTube is a real-time business.

The team builds a neural network with two separate sub-networks: one that encodes user context into a 256-dimensional vector, another that encodes video features into the same 256-dimensional space. Both towers are trained jointly so that videos a user would watch end up close to that user's vector in the shared embedding space. At serving time, the video embeddings are pre-computed and indexed in a nearest-neighbor structure. A user query takes 1 millisecond. The catalog has 800 million entries. The system serves recommendations in under 10 milliseconds end-to-end.

This is the two-tower model. It became the dominant architecture for large-scale retrieval at Google, Meta, Spotify, LinkedIn, and Airbnb within three years of that 2016 paper. Understanding it deeply - not just what it does but why each design decision exists - is the difference between copying an architecture and being able to design one from scratch.

The insight that made two-tower possible was simple but profound: decouple what you can precompute from what you must compute at query time. Item embeddings do not change between user requests. User context changes with every query. Separating these two computations into two towers lets you precompute one half of the dot product problem for the entire catalog and perform only the second half at inference.

Why This Exists

The Problem with Matrix Factorization

Classic collaborative filtering via matrix factorization (MF) decomposes the user-item interaction matrix $R \in \mathbb{R}^{m \times n}$ into two factor matrices:

$R \approx U \cdot V^T \quad \text{where } U \in \mathbb{R}^{m \times k}, V \in \mathbb{R}^{n \times k}$

This is elegant and it works remarkably well - Netflix's famous prize was won with MF variants. But MF has three structural failures at YouTube-scale:

The cold start problem. A new video has no interaction history. Its row in $R$ is all zeros. The factor matrix $V$ has no gradient signal to learn from. You get meaningless embeddings for fresh content, and fresh content is YouTube's competitive advantage.

The feature poverty problem. MF only sees the interaction signal. It does not know that a video is about cooking, that it was uploaded yesterday, that it has a high click-through rate on its thumbnail, or that it is 47 seconds long. Incorporating side features requires auxiliary models or ugly hacks that break the elegant factorization structure.

The staleness problem. MF is a batch process. You train on historical data, deploy the model, and serve stale embeddings until the next training run. A video that goes viral at 9 AM is invisible to recommendations until the next day's training job completes.

Why Two-Tower Solves These Failures

Two-tower addresses each failure directly:

Cold start: The item tower takes video features as input. A new video with zero interactions still has a title, description, category, and thumbnail. The item tower produces a meaningful embedding immediately.
Feature richness: Both towers are neural networks. Feed them anything: user watch history, search queries, demographic features, item metadata, video transcripts, visual features. The architecture is agnostic to the input modality.
Freshness: Item embeddings can be recomputed incrementally. A new video gets its embedding computed within minutes of upload without retraining the full model.

Historical Context

The two-tower architecture appeared in the deep learning era of recommendation, formalized in the 2019 Google paper "Sampling-Bias-Corrected Neural Modeling for Large Corpus Item Recommendations" (Yi et al., NeurIPS 2019), though the underlying dual encoder concept appeared earlier in information retrieval (Gillick et al., 2018 on MIPS for open-domain question answering).

The same architecture was independently published as DSSM (Deep Structured Semantic Model) by Microsoft Research in 2013 for learning query-document similarity, and as the Neural Code Search architecture by Facebook in 2018. By 2020, every major tech company had published variants: Google's MIPS-optimized two-tower, Facebook's FAISS-backed EBR (Embedding Based Retrieval), LinkedIn's two-tower for job recommendations, and Airbnb's dual encoder for listing search.

The key insight that unified all these systems: maximum inner product search (MIPS) is the inference primitive that makes two-tower tractable. Once you reduce recommendations to a dot product between a query vector and a catalog of item vectors, you can use approximate nearest neighbor (ANN) search to find the top-k items in sub-linear time.

Core Concepts

Architecture Overview

The Training Objective

Two-tower models are typically trained with a softmax cross-entropy loss over the full item catalog, or an approximation of it. Given a user $u$ and a positive item $v^+$ (an item the user interacted with), the model should assign a higher score to $v^+$ than to all negative items.

The score is the inner product (or cosine similarity) between the user and item vectors:

$s(u, v) = \langle \mathbf{u}, \mathbf{v} \rangle = \mathbf{u}^T \mathbf{v}$

The softmax loss over the full catalog is:

$\mathcal{L} = -\log \frac{\exp(s(u, v^+))}{\sum_{v \in \mathcal{V}} \exp(s(u, v))}$

The denominator sums over all $|\mathcal{V}|$ items - at YouTube scale, this is 800 million terms. Computing this exactly is intractable. In practice, this is approximated using in-batch negative sampling: the negatives for each positive pair are the other items in the same training batch.

Sampling Bias Correction

In-batch negatives introduce a sampling bias. Popular items appear more frequently as negatives simply because they appear more often in training data. The model learns to push popular items away from all users, even users who would genuinely enjoy them. This is the popularity bias problem in negative sampling.

The Yi et al. (2019) correction adjusts the logit for item $v$ by subtracting the log of its sampling probability:

$s_{\text{corrected}}(u, v) = s(u, v) - \log p(v)$

where $p(v)$ is estimated from a streaming frequency counter. Items that appear frequently as negatives get a lower adjusted score, preventing the model from discriminating against them unfairly.

Hard Negative Mining

Random negatives are easy. After a few epochs, the model trivially separates positive pairs from random negatives and stops learning useful representations. Hard negatives are items that score high for a user but are not positive interactions - items the user saw but did not click, or semantically similar but contextually wrong items.

Hard negative mining strategies:

Offline hard negatives: After each training epoch, run the current model on the full catalog and retrieve the top-K items for each user. Items in the top-K that are not positive interactions become hard negatives for the next epoch. This is expensive but effective.

Online hard negatives: Within a batch, the hardest negative for user $u$ is the item $v^-$ with the highest score that is not a positive:

$v^-_{\text{hard}} = \arg\max_{v \neq v^+} s(u, v)$

Semi-hard negatives (from FaceNet): Negatives that are farther than the positive but still within the margin - they provide gradient signal without being so hard that they destabilize training.

warning

Hard negative mining too aggressively can destabilize training. If the model is not yet good, the "hardest" negatives may be nearly indistinguishable from positives, producing noisy gradient signals that hurt convergence. Start with random negatives, warm up the model, then introduce hard negatives after 2-3 epochs.

Embedding Dimensionality Tradeoffs

The embedding dimension $d$ controls the capacity of the shared representation space. Larger $d$ allows the model to represent more nuanced relationships, but it also increases:

Memory: An index of 800M items at $d=256$ with float32 = $800M \times 256 \times 4$ bytes = 819 GB. At $d=64$ , this drops to 205 GB.
Latency: Inner product computation scales linearly with $d$ . ANN search time scales with $d$ as well.
Index build time: Quantization and IVF index construction are $O(n \cdot d)$ .

Practical guidance from production deployments:

$d=64$ : sufficient for most item retrieval tasks with simple categorical features
$d=128$ : standard for dense retrieval, good quality-memory balance
$d=256$ : used when items have rich multi-modal features (text + visual)
$d=512+$ : research setting; rarely justified in production

Approximate Nearest Neighbor Search

Once item embeddings are computed, they are indexed in an ANN structure. The query is: given user vector $\mathbf{u}$ , find the $k$ item vectors with the highest inner product score.

FAISS (Facebook AI Similarity Search): The dominant library. Supports:

IndexFlatIP: exact inner product search - correct but $O(n)$
IndexIVFFlat: inverted file index - partitions space into Voronoi cells, only searches nearest cells. Approximate. Controlled by nlist (cells) and nprobe (cells searched at query time).
IndexIVFPQ: adds product quantization to compress vectors - reduces memory by 8-32x with ~5-10% recall loss.

ScaNN (Scalable Nearest Neighbors, Google): Optimized for maximum inner product search (MIPS) specifically. Uses anisotropic quantization that minimizes the error in the direction that matters most for ranking. Typically 30-50% faster than FAISS for MIPS tasks.

HNSW (Hierarchical Navigable Small World): Graph-based ANN. Builds a multi-layer graph of embeddings. Excellent recall at moderate latency. Good for smaller catalogs (up to ~50M items). Memory-intensive but no offline index build - supports real-time insertions.

Method	Build Time	Query Latency	Recall	Memory	Online Updates
Exact (FAISS Flat)	Fast	High ( $O(n)$ )	100%	High	Yes
IVF-PQ	Medium	Low	90-95%	Low	No
HNSW	Slow	Very Low	95-99%	Medium	Yes
ScaNN	Medium	Very Low	92-97%	Low	No

Code Examples

Two-Tower Model in PyTorch

import torch
import torch.nn as nn
import torch.nn.functional as F
from typing import Dict, List


class UserTower(nn.Module):
    """Encodes user context into a dense embedding vector."""

    def __init__(
        self,
        user_vocab_size: int,
        item_vocab_size: int,
        embed_dim: int = 64,
        output_dim: int = 128,
        dropout: float = 0.1,
    ):
        super().__init__()
        # User ID embedding
        self.user_embed = nn.Embedding(user_vocab_size, embed_dim, padding_idx=0)
        # Watch history item embeddings (same space as item tower input)
        self.history_embed = nn.Embedding(item_vocab_size, embed_dim, padding_idx=0)

        # MLP to produce final user vector
        self.mlp = nn.Sequential(
            nn.Linear(embed_dim * 2, 256),
            nn.BatchNorm1d(256),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(256, output_dim),
        )

    def forward(self, user_ids: torch.Tensor, watch_history: torch.Tensor) -> torch.Tensor:
        """
        Args:
            user_ids: (batch,) int tensor
            watch_history: (batch, seq_len) int tensor, padded with 0
        Returns:
            user_vec: (batch, output_dim) L2-normalized user embedding
        """
        user_vec = self.user_embed(user_ids)  # (batch, embed_dim)

        # Mean-pool over watch history, ignoring padding
        mask = (watch_history != 0).float()  # (batch, seq_len)
        history_embs = self.history_embed(watch_history)  # (batch, seq_len, embed_dim)
        history_vec = (history_embs * mask.unsqueeze(-1)).sum(dim=1)
        history_vec = history_vec / (mask.sum(dim=1, keepdim=True).clamp(min=1))

        combined = torch.cat([user_vec, history_vec], dim=-1)
        out = self.mlp(combined)
        return F.normalize(out, p=2, dim=-1)  # L2 normalize for cosine similarity


class ItemTower(nn.Module):
    """Encodes item features into a dense embedding vector."""

    def __init__(
        self,
        item_vocab_size: int,
        category_vocab_size: int,
        embed_dim: int = 64,
        output_dim: int = 128,
        dropout: float = 0.1,
    ):
        super().__init__()
        self.item_embed = nn.Embedding(item_vocab_size, embed_dim, padding_idx=0)
        self.category_embed = nn.Embedding(category_vocab_size, embed_dim // 2, padding_idx=0)

        self.mlp = nn.Sequential(
            nn.Linear(embed_dim + embed_dim // 2 + 1, 256),  # +1 for upload_age
            nn.BatchNorm1d(256),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(256, output_dim),
        )

    def forward(
        self,
        item_ids: torch.Tensor,
        category_ids: torch.Tensor,
        upload_age_days: torch.Tensor,
    ) -> torch.Tensor:
        """
        Returns:
            item_vec: (batch, output_dim) L2-normalized item embedding
        """
        item_vec = self.item_embed(item_ids)
        cat_vec = self.category_embed(category_ids)
        age = upload_age_days.float().unsqueeze(-1)

        combined = torch.cat([item_vec, cat_vec, age], dim=-1)
        out = self.mlp(combined)
        return F.normalize(out, p=2, dim=-1)


class TwoTowerModel(nn.Module):
    """Full two-tower model with in-batch negative training."""

    def __init__(self, user_tower: UserTower, item_tower: ItemTower, temperature: float = 0.05):
        super().__init__()
        self.user_tower = user_tower
        self.item_tower = item_tower
        self.temperature = temperature

    def forward(
        self,
        user_ids: torch.Tensor,
        watch_history: torch.Tensor,
        item_ids: torch.Tensor,
        category_ids: torch.Tensor,
        upload_age_days: torch.Tensor,
    ) -> torch.Tensor:
        """
        Compute in-batch softmax loss.
        Positive pairs: (user[i], item[i])
        Negatives: all other items in the batch
        """
        user_vecs = self.user_tower(user_ids, watch_history)  # (B, d)
        item_vecs = self.item_tower(item_ids, category_ids, upload_age_days)  # (B, d)

        # Similarity matrix: (B, B) - diagonal is positive pairs
        logits = torch.matmul(user_vecs, item_vecs.T) / self.temperature  # (B, B)
        labels = torch.arange(len(user_ids), device=user_ids.device)

        # Cross-entropy: user[i] should be most similar to item[i]
        loss = F.cross_entropy(logits, labels)
        return loss

    def get_user_embedding(self, user_ids, watch_history):
        with torch.no_grad():
            return self.user_tower(user_ids, watch_history)

    def get_item_embedding(self, item_ids, category_ids, upload_age_days):
        with torch.no_grad():
            return self.item_tower(item_ids, category_ids, upload_age_days)

Building the FAISS Index

import faiss
import numpy as np
from tqdm import tqdm


def build_faiss_index(
    item_embeddings: np.ndarray,
    embedding_dim: int = 128,
    use_pq: bool = True,
    nlist: int = 4096,
    m_pq: int = 16,  # number of subquantizers
) -> faiss.Index:
    """
    Build an IVF-PQ index for billion-scale approximate nearest neighbor search.

    Args:
        item_embeddings: (N, d) float32 array of item embeddings
        embedding_dim: dimensionality of embeddings
        use_pq: if True, use product quantization to compress vectors
        nlist: number of Voronoi cells for IVF
        m_pq: number of PQ subquantizers (must divide embedding_dim)
    """
    n_items = item_embeddings.shape[0]

    if use_pq:
        # IVF-PQ: partitioned into nlist cells, each vector compressed to m_pq bytes
        quantizer = faiss.IndexFlatIP(embedding_dim)
        index = faiss.IndexIVFPQ(quantizer, embedding_dim, nlist, m_pq, 8)
    else:
        # IVF-Flat: no compression, higher memory, higher recall
        quantizer = faiss.IndexFlatIP(embedding_dim)
        index = faiss.IndexIVFFlat(quantizer, embedding_dim, nlist, faiss.METRIC_INNER_PRODUCT)

    # Move to GPU if available
    if faiss.get_num_gpus() > 0:
        res = faiss.StandardGpuResources()
        index = faiss.index_cpu_to_gpu(res, 0, index)

    # Train the index (learns the Voronoi cell centroids)
    print(f"Training index on {min(n_items, 100_000)} samples...")
    train_sample = item_embeddings[:100_000]
    index.train(train_sample)

    # Add all item embeddings
    print(f"Adding {n_items} item embeddings to index...")
    batch_size = 100_000
    for i in tqdm(range(0, n_items, batch_size)):
        index.add(item_embeddings[i : i + batch_size])

    return index


def retrieve_top_k(
    index: faiss.Index,
    user_embedding: np.ndarray,
    k: int = 500,
    nprobe: int = 64,  # number of cells to search - higher = more accurate, slower
) -> tuple:
    """
    Retrieve top-k items for a user embedding.

    Returns:
        distances: (k,) similarity scores
        item_indices: (k,) indices into the item catalog
    """
    if hasattr(index, "nprobe"):
        index.nprobe = nprobe

    if user_embedding.ndim == 1:
        user_embedding = user_embedding[np.newaxis, :]  # Add batch dimension

    distances, indices = index.search(user_embedding, k)
    return distances[0], indices[0]

Offline Evaluation: Recall@K

def evaluate_recall_at_k(
    model: TwoTowerModel,
    test_interactions: List[Dict],  # [{"user": ..., "positive_items": [...]}]
    item_index: faiss.Index,
    k_values: List[int] = [10, 50, 100, 500],
) -> Dict[int, float]:
    """
    Compute Recall@K: fraction of positive items retrieved in top-K.

    For recommendation, this answers: "Of all the items a user would
    actually interact with, how many does the model surface in its
    top-K candidates?"
    """
    model.eval()
    recall_scores = {k: [] for k in k_values}
    max_k = max(k_values)

    for interaction in test_interactions:
        user_embedding = model.get_user_embedding(
            interaction["user_ids"], interaction["watch_history"]
        ).numpy()

        _, retrieved_indices = retrieve_top_k(item_index, user_embedding, k=max_k)
        retrieved_set = set(retrieved_indices.tolist())
        positive_set = set(interaction["positive_items"])

        for k in k_values:
            top_k_set = set(list(retrieved_indices)[:k])
            recall = len(positive_set & top_k_set) / len(positive_set)
            recall_scores[k].append(recall)

    return {k: np.mean(scores) for k, scores in recall_scores.items()}

Production Engineering Notes

Embedding Index Update Strategy

Item embeddings must be kept fresh. New items should get embeddings within minutes of being added to the catalog, and embeddings should be recomputed as the model improves. Three update strategies:

Full rebuild (daily): Recompute all embeddings, rebuild the FAISS index from scratch. Simple, consistent. Works for catalogs up to ~50M items where rebuild takes under an hour. YouTube does this for less active catalog sections.

Incremental add: New items get embeddings computed immediately and added to the existing index. The index grows over time. Deletions require tombstoning (flag item as removed, filter from results). Requires a data structure that supports online inserts - HNSW is better than IVF-PQ for this.

Dual index with atomic swap: Maintain two indexes - serving and building. Build the new index offline. Swap atomically when complete. Zero-downtime updates. The standard approach for high-traffic systems.

Serving Architecture

User Tower as a service: The user tower runs as a separate microservice. It accepts user context (user ID, recent history, current query) and returns a single embedding vector. This service is stateless and horizontally scalable. Cache user embeddings with a short TTL (30-60 seconds) - user context doesn't change that fast.

ANN Index as a service: The index runs on instances with large memory. At 800M items, $d=128$ , using IVF-PQ with 8-bit quantization, memory usage is approximately $800M \times 16$ bytes = 12.8 GB - fits on a single host with 32 GB RAM. Run multiple replicas behind a load balancer for availability.

Consistency tradeoff: User tower and item index may be out of sync. A model update retrains both towers simultaneously - but the item index rebuild takes hours while the user tower is deployed in minutes. During the transition window, user embeddings from the new model are queried against item embeddings from the old model. This mismatch degrades quality. Coordinate deployments: deploy user tower + new index atomically.

Hard Negative Mining in Practice

def generate_hard_negatives(
    model: TwoTowerModel,
    item_index: faiss.Index,
    user_interactions: List[Dict],
    k_hard: int = 200,
    hard_negative_ratio: float = 0.5,
) -> List[Dict]:
    """
    For each user, retrieve top-K items and use non-positive items as hard negatives.
    Mix with random negatives to avoid training instability.
    """
    augmented_interactions = []

    for interaction in user_interactions:
        user_emb = model.get_user_embedding(
            interaction["user_ids"], interaction["watch_history"]
        ).numpy()

        _, top_k_indices = retrieve_top_k(item_index, user_emb, k=k_hard)
        positive_set = set(interaction["positive_items"])

        # Items in top-K that are NOT positive = hard negatives
        hard_negs = [idx for idx in top_k_indices if idx not in positive_set]

        n_hard = int(len(hard_negs) * hard_negative_ratio)
        n_random = len(hard_negs) - n_hard
        selected_hard = hard_negs[:n_hard]
        selected_random = np.random.choice(
            len(item_index), size=n_random, replace=False
        ).tolist()

        augmented_interactions.append({
            **interaction,
            "hard_negatives": selected_hard,
            "random_negatives": selected_random,
        })

    return augmented_interactions

Common Mistakes

danger

Mistake: Using the same embedding space as a single shared encoder.

If you share weights between the user tower and the item tower, the model is forced to encode user context and item content with identical representations. Users and items live in different semantic spaces. A user vector should represent "what this user likes" while an item vector should represent "what this item is." They can live in the same geometric space for dot-product comparison, but they must be produced by separate neural networks.

danger

Mistake: Not normalizing embeddings and using raw dot product.

Without L2 normalization, the dot product is dominated by the magnitude of the embedding vectors, not their direction. Items with large embedding norms (often popular items with lots of gradient updates) will rank artificially high regardless of relevance. Always L2-normalize output vectors from both towers before training and before indexing.

warning

Mistake: Ignoring sampling bias in in-batch negatives.

Popular items appear more often in training batches because they appear more often in interaction data. Without sampling bias correction, the model learns to push popular items down for all users - exactly backwards from what you want. Implement the Yi et al. (2019) log-frequency correction or use stratified sampling to ensure uniform negative distribution.

warning

Mistake: Setting nprobe too low during ANN search.

The nprobe parameter controls how many Voronoi cells are searched during IVF retrieval. A low nprobe is fast but misses items that lie near cell boundaries. A typical mistake is tuning nprobe at development time on a small index (where it doesn't matter much) and forgetting to re-tune on the production index. For a 4096-cell IVF index, nprobe=64 gives good recall; nprobe=1 can drop recall from 95% to 60%.

tip

Tip: Use separate temperatures for training and indexing.

The softmax temperature $\tau$ in training controls how sharply the model discriminates between positives and negatives. A lower $\tau$ creates a harder, more focused learning signal. But at serving time, the raw dot product score (without temperature scaling) is used for ranking. Train with $\tau \in [0.05, 0.1]$ for sharp learning, but never apply temperature scaling at inference - it doesn't affect the ranking order.

Interview Q&A

Q: Why does a two-tower model need two separate neural networks instead of one network that takes both user and item features as input?

A: The separation is motivated by inference efficiency, not model quality. A single network that jointly processes user and item features (a "cross-encoder") can capture richer interactions - it can model how a specific feature of user A interacts with a specific attribute of item B. But at serving time, you cannot precompute anything: for each user query, you must run the full network against every item in the catalog. At 800M items and 10ms budget, this is physically impossible. Two-tower trades some modeling quality for the ability to precompute item representations offline and perform only a dot product at query time. The dot product is a bilinear interaction - it captures that user vector dimension $i$ multiplies item vector dimension $i$ - which is weaker than a cross-encoder but sufficient for candidate generation. The rich cross-attention interactions are reserved for a subsequent ranking stage that operates on a small candidate set.

Q: How do you handle the cold start problem in a two-tower system?

A: Cold start manifests differently for users vs. items. For new items, the item tower takes content features (title, category, thumbnail visual embeddings) as input, so it produces a meaningful embedding even with zero interaction history. The embedding will be noisy but not zero. For new users, the user tower can fall back to demographic features (age bracket, location, device type) or a global popularity-based embedding when watch history is empty. A common production trick is to maintain a "new user" bucket - users with fewer than 10 interactions - and serve a lightweight popularity-plus-demographic model until sufficient history is accumulated to trust the personalized two-tower output.

Q: What is the difference between recall@K and precision@K, and which matters more for the retrieval stage?

A: Recall@K measures the fraction of relevant items that appear in the top-K results. Precision@K measures the fraction of the top-K results that are relevant. For a retrieval/candidate generation stage that feeds a downstream ranker, recall@K is the primary metric. The job of retrieval is to make sure relevant items are in the candidate set - the ranker will sort out the order. If recall@500 is 80%, the downstream ranker starts from a set where 80% of genuinely relevant items are present. If precision@500 is only 10% (50 relevant out of 500), that's fine - the ranker handles filtering. Optimizing precision at retrieval is premature; it limits recall unnecessarily.

Q: How do you evaluate a two-tower model offline before deploying it?

A: Standard offline evaluation computes recall@K and mean reciprocal rank (MRR) on a held-out test set of user-item interactions. The procedure: (1) split interactions chronologically - train on older interactions, test on newer ones; (2) for each test user, retrieve top-K items with the model; (3) check if the user's held-out positive items appear in the top-K. Important subtlety: the test set should hold out items that were present in the training catalog, otherwise you're measuring cold-start performance separately. Also measure coverage - the fraction of catalog items that ever appear in any user's top-500 - to detect filter bubble effects.

Q: A two-tower model is deployed and performing well offline, but online A/B metrics show no improvement. What do you investigate?

A: Several hypotheses in priority order. First, check if the downstream ranker is the bottleneck - if the ranker reorders the top-500 candidates aggressively, a better candidate set may not translate to better final results. Second, check the recall@K of the old vs. new model on items that actually ended up being clicked - are the new model's mistakes in a different distribution than the old model's? Third, check for a position bias issue: if the ranker gives heavy weight to the position of items in the candidate list, and the new model changes ordering within the top-500, the ranker may penalize items it doesn't recognize. Fourth, look at the cold-start segment specifically - two-tower often improves cold-start quality, which shows up in user segments that the aggregated A/B metric dilutes.

Q: What are the tradeoffs between FAISS IVF-PQ and HNSW for production ANN search?

A: IVF-PQ is better when: the catalog is very large (more than 100M items), memory is constrained (PQ compresses vectors 8-32x), and the catalog is relatively static (IVF requires an offline build step; online adds degrade index quality). HNSW is better when: the catalog changes frequently (supports real-time inserts without full rebuild), recall requirements are very high (HNSW at recall@95 is typically faster than IVF-PQ at the same recall), and memory is available (HNSW stores the full graph structure). In practice, most large-scale production systems use IVF-PQ for the main catalog and HNSW for a fresh-content index covering recently-added items.

The Day Matrix Factorization Broke YouTube​

Why This Exists​

The Problem with Matrix Factorization​

Why Two-Tower Solves These Failures​

Historical Context​

Core Concepts​

Architecture Overview​

The Training Objective​

Sampling Bias Correction​

Hard Negative Mining​

Embedding Dimensionality Tradeoffs​

Approximate Nearest Neighbor Search​

Code Examples​

Two-Tower Model in PyTorch​

Building the FAISS Index​

Offline Evaluation: Recall@K​

Production Engineering Notes​

Embedding Index Update Strategy​

Serving Architecture​

Hard Negative Mining in Practice​

Common Mistakes​

Interview Q&A​

The Day Matrix Factorization Broke YouTube

Why This Exists

The Problem with Matrix Factorization

Why Two-Tower Solves These Failures

Historical Context

Core Concepts

Architecture Overview

The Training Objective

Sampling Bias Correction

Hard Negative Mining

Embedding Dimensionality Tradeoffs

Approximate Nearest Neighbor Search

Code Examples

Two-Tower Model in PyTorch

Building the FAISS Index

Offline Evaluation: Recall@K

Production Engineering Notes

Embedding Index Update Strategy

Serving Architecture

Hard Negative Mining in Practice

Common Mistakes

Interview Q&A