Personalization at Scale
The Invisible Merchandiser
Walk into a physical Target store in Austin and walk into one in Boston. The same brand, the same layout, the same products. The merchandising team made decisions months ago about what goes where. Everyone gets the same store.
Open the Target app. You see something entirely different from the person next to you.
That difference is built from hundreds of signals: what you clicked three weeks ago, what you searched but did not buy, what items you added to a list but never checked out, what the 200 people most similar to you purchased last weekend. The "invisible merchandiser" working inside the app knows more about your preferences than any human retail associate could.
Amazon reports that 35% of its revenue comes from its recommendation engine. Netflix attributes $1 billion per year in subscription retention to its recommendation system reducing churn. Spotify's "Discover Weekly" - a recommendation product - was the fastest-growing feature in the company's history, reaching 40 million listeners within months of launch. The financial case for personalization at scale is not theoretical.
But building these systems is not as simple as "train a collaborative filter and ship it." The core technical challenges - retrieval at billion-item scale, real-time feature freshness, cold start for new users, privacy constraints post-GDPR - require an entire stack of carefully designed systems. This lesson walks through that stack end to end.
Why This Exists
The catalog problem: modern retailers carry hundreds of thousands to millions of items. A user visiting the homepage has maybe 5 seconds of attention. The system must surface the items most likely to convert among millions of candidates - in real time.
No human can curate this. Rule-based systems ("show popular items in user's most purchased category") capture maybe 20% of the relevant signal. The remaining 80% - the subtle interaction between your past behavior, current context, and item attributes - requires learning from data.
The financial argument is direct. Personalized recommendations increase average order value by 15-25%. They reduce bounce rate (users who find relevant content faster stay longer). They increase repeat purchase rate (users who discover items they love come back). For a retailer with 1.5B.
The technical case is about navigating the retrieval-ranking tradeoff. You cannot run a deep neural network on every item in the catalog for every user request - the latency would be seconds, not milliseconds. You need a two-phase approach: fast retrieval of hundreds of candidates, then slower ranking of those candidates. The architecture of modern recommendation systems flows directly from this constraint.
Historical Context
Recommendation systems have a well-documented history driven by public competitions.
The Netflix Prize (2006-2009) was the defining event. Netflix offered $1 million to any team that could improve their existing recommendation system's RMSE by 10%. The winning solution - BellKor's Pragmatic Chaos - used an ensemble of over 100 models including matrix factorization variants. The competition produced Singular Value Decomposition++ (SVD++) and sparked a decade of academic research into collaborative filtering.
Yehuda Koren's matrix factorization work from this era (particularly his 2009 paper "Matrix Factorization Techniques for Recommender Systems") became the standard. The key insight: instead of computing user-item similarity in the original item space, learn low-dimensional latent embeddings for both users and items such that their dot product predicts rating.
Alternating Least Squares (ALS) made this scalable. Instead of joint optimization (non-convex), fix user vectors and optimize item vectors (least squares, convex), then swap. This was parallelizable and became the workhorse of Spark MLlib.
The deep learning revolution hit recommendations around 2016-2018. Google's Wide and Deep (2016) combined a deep neural network with a linear model for app recommendations. YouTube's two-tower architecture (Covington et al., 2016) set the pattern still used today. Facebook's DLRM (2019) showed how to handle massive categorical feature spaces with embeddings.
The current state-of-the-art for retail specifically: dense retrieval with two-tower models (fast approximate nearest neighbor search), followed by a ranking model using gradient boosting or a transformer-based model trained on historical click-through and purchase data.
Core Concepts
The Two-Phase Architecture
Every large-scale recommendation system separates retrieval from ranking. This is not a choice - it is an engineering necessity.
Phase 1 - Retrieval (Candidate Generation):
- Input: current user context
- Goal: narrow 10 million items to 100-500 candidates
- Latency budget: 10-50ms
- Method: approximate nearest neighbor (ANN) search on learned embeddings
Phase 2 - Ranking:
- Input: 100-500 candidates + rich user and item features
- Goal: score each candidate and return top-K
- Latency budget: 50-200ms
- Method: gradient boosted trees or neural network with full feature set
The two phases have different technical requirements, different training objectives, and often different teams. Understanding both is essential.
Collaborative Filtering: The Foundation
Collaborative filtering is built on a simple premise: users who agreed in the past tend to agree in the future. You do not need to understand why a user likes something - you just need to find similar users or similar items.
Matrix Factorization formulation:
Given a user-item interaction matrix (ratings or implicit signals like clicks), find low-rank matrices (user embeddings, ) and (item embeddings, ) such that:
For implicit feedback (binary interactions - purchased or not), use Weighted Matrix Factorization:
Where is a confidence weight: items the user interacted with multiple times get higher weight.
ALS optimization: Fix , solve for in closed form (least squares). Fix , solve for . Repeat until convergence. This alternation makes the non-convex joint problem tractable.
Two-Tower Architecture
The two-tower model is the industry standard for scalable retrieval. The key insight: represent users and items in the same embedding space so retrieval becomes a nearest-neighbor search.
Query Tower (User Tower):
- Input: user_id embedding + recent interaction history + current context (device, time of day, current cart)
- Architecture: MLP or Transformer over interaction sequence
- Output: -dimensional user embedding
Item Tower:
- Input: item_id embedding + item attributes (category, price, brand, image embedding)
- Architecture: MLP over item features
- Output: -dimensional item embedding
Training objective: Given a user query and a positive item (one they clicked/purchased), maximize the inner product with the positive item while minimizing it with negative samples (items they did not interact with).
The training trick is in-batch negatives: for a batch of (user, positive_item) pairs, treat all other batch items as negatives for each user. This dramatically increases negative example efficiency.
At serving time:
- Compute user embedding with the query tower (real-time, takes current context)
- Look up pre-computed item embeddings (all items, computed offline)
- Use FAISS or ScaNN for approximate nearest neighbor search
- Return top-K items by inner product similarity
This separation is powerful: item embeddings are computed once and indexed. User embeddings are computed per-request but only require running the query tower.
Real-Time vs. Batch Features
Personalization quality depends critically on feature freshness. A user who just added a winter coat to their cart probably does not want another recommendation for a winter coat - but a recommendation for gloves, boots, or a scarf is highly relevant. This signal exists only in real-time.
Batch features (updated daily or hourly):
- User purchase history (last 30 days)
- User category preferences (derived from rolling history)
- Item popularity scores
- Item average rating
Real-time features (updated per-session or per-event):
- Current session items viewed (last 10 minutes)
- Current cart contents
- Current search query
- Time of day, device type
The feature store sits at the intersection. It pre-computes and caches expensive features (user purchase history) while serving freshly computed real-time features from an in-memory store like Redis. Tools like Feast, Tecton, and Vertex AI Feature Store manage this complexity.
Cold Start Solutions
New users have no interaction history. Cold start is not an edge case - every retailer acquires new users constantly, and first-session personalization determines whether those users return.
Level 1 - Context-only personalization: No history. Use what you have: device, location, time of day, referral source. A user arriving from a "summer dresses" Pinterest pin gets very different recommendations than one arriving directly on a mobile device at 11 PM.
Level 2 - Session context: As the user clicks or searches within their first session, rapidly update recommendations. After 2-3 clicks, you have enough signal for basic collaborative filtering: "users who also viewed these three items went on to buy..."
Level 3 - Explicit preference capture: Onboarding flows asking "What are you shopping for?" or "What brands do you like?" provide cold-start signal with minimal friction. Pinterest's onboarding interest selection is the canonical example.
Level 4 - Cross-platform signals: With user consent, purchase history from loyalty programs, email engagement, or mobile app behavior can be transferred to cold-start web sessions. This requires a customer identity resolution layer.
A/B Testing Personalization
The statistical challenge with A/B testing personalization is novelty effect: users in the treatment group (getting personalized recommendations) experience something new and engage more simply because it is different - regardless of its quality. This effect fades after a few sessions.
Best practice: run personalization experiments for at least 2-4 weeks with full user cohorts (not just session-level randomization), and use business metrics (revenue, retention, repeat purchase rate) rather than just click-through rate as the primary metric. CTR is easy to optimize superficially.
Privacy-Preserving Personalization
Post-GDPR and with third-party cookies deprecated, traditional cross-site user tracking is no longer viable. This creates a real constraint: you cannot build user profiles from behavior across multiple domains.
First-party data strategy: Focus entirely on signals from your own properties. Authenticated users are your most valuable asset - their behavior is consented, persisted, and legally usable. Build loyalty programs that incentivize login.
Federated learning: Train personalization models on-device. User embeddings stay on the device and are never sent to the server. Only gradient updates (or differentially private gradients) are aggregated centrally. Apple's on-device recommendation for App Store uses variants of this approach.
Contextual fallback: For anonymous users, use purely contextual signals (category page, search query, cart contents, time of day). Modern contextual models using transformer-based text understanding of search queries can recover substantial personalization quality without user history.
Practical Implementation
Two-Tower Model with TensorFlow Recommenders
import tensorflow as tf
import tensorflow_recommenders as tfrs
import numpy as np
import pandas as pd
from typing import Dict, Text
# ============================================================
# 1. Data Preparation
# ============================================================
def prepare_interaction_data(
interactions_df: pd.DataFrame,
min_interactions_per_user: int = 5
) -> tf.data.Dataset:
"""
interactions_df expected columns:
user_id, item_id, interaction_type (click/purchase/cart),
timestamp, session_id
"""
# Filter cold users
user_counts = interactions_df.groupby('user_id')['item_id'].count()
active_users = user_counts[user_counts >= min_interactions_per_user].index
df = interactions_df[interactions_df['user_id'].isin(active_users)]
# Weight by interaction type (purchase > cart > click)
weight_map = {'purchase': 3.0, 'cart': 2.0, 'click': 1.0}
df = df.copy()
df['weight'] = df['interaction_type'].map(weight_map).fillna(1.0)
dataset = tf.data.Dataset.from_tensor_slices({
'user_id': df['user_id'].values.astype(str),
'item_id': df['item_id'].values.astype(str),
'weight': df['weight'].values.astype(np.float32),
})
return dataset
# ============================================================
# 2. User and Item Tower Definitions
# ============================================================
class UserTower(tf.keras.Model):
"""
Query tower: maps user_id + context to a dense embedding.
"""
def __init__(
self,
user_ids: list,
embedding_dim: int = 64,
hidden_units: list = [256, 128]
):
super().__init__()
self.embedding_dim = embedding_dim
# Learnable user embedding table
self.user_embedding = tf.keras.Sequential([
tf.keras.layers.StringLookup(
vocabulary=user_ids,
mask_token=None
),
tf.keras.layers.Embedding(
len(user_ids) + 1,
embedding_dim
),
])
# MLP to project to output space
layers = []
for units in hidden_units:
layers.extend([
tf.keras.layers.Dense(units, activation='relu'),
tf.keras.layers.Dropout(0.1),
])
layers.append(tf.keras.layers.Dense(embedding_dim))
self.mlp = tf.keras.Sequential(layers)
def call(self, inputs):
user_emb = self.user_embedding(inputs['user_id'])
return self.mlp(user_emb)
class ItemTower(tf.keras.Model):
"""
Item tower: maps item_id + attributes to a dense embedding.
"""
def __init__(
self,
item_ids: list,
num_categories: int,
embedding_dim: int = 64,
hidden_units: list = [256, 128]
):
super().__init__()
self.embedding_dim = embedding_dim
# Item ID embedding
self.item_embedding = tf.keras.Sequential([
tf.keras.layers.StringLookup(
vocabulary=item_ids,
mask_token=None
),
tf.keras.layers.Embedding(
len(item_ids) + 1,
embedding_dim
),
])
# Category embedding
self.category_embedding = tf.keras.layers.Embedding(
num_categories + 1,
16
)
# MLP for combined features
layers = []
for units in hidden_units:
layers.extend([
tf.keras.layers.Dense(units, activation='relu'),
tf.keras.layers.Dropout(0.1),
])
layers.append(tf.keras.layers.Dense(embedding_dim))
self.mlp = tf.keras.Sequential(layers)
def call(self, inputs):
item_emb = self.item_embedding(inputs['item_id'])
cat_emb = self.category_embedding(inputs['category_id'])
# Concatenate item embedding with category and price features
price_feature = tf.expand_dims(inputs['log_price'], axis=-1)
combined = tf.concat([item_emb, cat_emb, price_feature], axis=-1)
return self.mlp(combined)
# ============================================================
# 3. Two-Tower Model with In-Batch Negatives
# ============================================================
class TwoTowerRetrievalModel(tfrs.Model):
"""
Two-tower model using in-batch negatives for training.
Uses TFRS Retrieval task for efficient batch training.
"""
def __init__(
self,
user_tower: UserTower,
item_tower: ItemTower,
item_dataset: tf.data.Dataset,
temperature: float = 0.07
):
super().__init__()
self.user_tower = user_tower
self.item_tower = item_tower
self.temperature = temperature
# TFRS handles the retrieval task + metrics
self.task = tfrs.tasks.Retrieval(
metrics=tfrs.metrics.FactorizedTopK(
candidates=item_dataset.batch(128).map(item_tower)
)
)
def compute_loss(self, features: Dict[Text, tf.Tensor], training=False):
user_embeddings = self.user_tower(features)
item_embeddings = self.item_tower(features)
return self.task(
user_embeddings,
item_embeddings,
compute_metrics=not training
)
# ============================================================
# 4. Training Pipeline
# ============================================================
def train_two_tower_model(
interactions_df: pd.DataFrame,
item_catalog_df: pd.DataFrame,
embedding_dim: int = 64,
epochs: int = 10,
batch_size: int = 2048
) -> TwoTowerRetrievalModel:
"""
Full training pipeline for the two-tower model.
"""
user_ids = interactions_df['user_id'].unique().astype(str).tolist()
item_ids = item_catalog_df['item_id'].unique().astype(str).tolist()
num_categories = item_catalog_df['category_id'].nunique()
# Build towers
user_tower = UserTower(
user_ids=user_ids,
embedding_dim=embedding_dim
)
item_tower = ItemTower(
item_ids=item_ids,
num_categories=num_categories,
embedding_dim=embedding_dim
)
# Item candidate dataset (for TopK metrics)
item_dataset = tf.data.Dataset.from_tensor_slices({
'item_id': item_catalog_df['item_id'].values.astype(str),
'category_id': item_catalog_df['category_id'].values.astype(np.int32),
'log_price': np.log1p(item_catalog_df['price'].values).astype(np.float32),
})
# Full model
model = TwoTowerRetrievalModel(
user_tower=user_tower,
item_tower=item_tower,
item_dataset=item_dataset
)
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))
# Training dataset
train_dataset = prepare_interaction_data(interactions_df)
train_dataset = train_dataset.shuffle(100_000).batch(batch_size).cache()
model.fit(train_dataset, epochs=epochs)
return model
# ============================================================
# 5. ANN Index Building with FAISS
# ============================================================
import faiss
def build_faiss_index(
item_tower: ItemTower,
item_catalog_df: pd.DataFrame,
embedding_dim: int = 64
) -> tuple:
"""
Build a FAISS index for fast approximate nearest neighbor retrieval.
Returns (index, item_ids_array).
"""
# Generate all item embeddings in batches
all_embeddings = []
item_ids_ordered = []
batch_size = 1024
for start in range(0, len(item_catalog_df), batch_size):
batch = item_catalog_df.iloc[start:start + batch_size]
batch_inputs = {
'item_id': batch['item_id'].values.astype(str),
'category_id': batch['category_id'].values.astype(np.int32),
'log_price': np.log1p(batch['price'].values).astype(np.float32),
}
embeddings = item_tower(batch_inputs).numpy()
all_embeddings.append(embeddings)
item_ids_ordered.extend(batch['item_id'].values.tolist())
all_embeddings = np.vstack(all_embeddings).astype(np.float32)
# L2-normalize for cosine similarity via inner product
faiss.normalize_L2(all_embeddings)
# Build IVF index (faster for large catalogs, slight accuracy tradeoff)
quantizer = faiss.IndexFlatIP(embedding_dim) # Inner product (= cosine after L2 norm)
n_clusters = min(int(np.sqrt(len(item_catalog_df))), 1024)
index = faiss.IndexIVFFlat(quantizer, embedding_dim, n_clusters, faiss.METRIC_INNER_PRODUCT)
# Train the quantizer on embeddings
index.train(all_embeddings)
index.add(all_embeddings)
# Set nprobe for retrieval (higher = more accurate, slower)
index.nprobe = 10
return index, np.array(item_ids_ordered)
# ============================================================
# 6. Online Recommendation Serving
# ============================================================
class RecommendationServer:
"""
Lightweight serving wrapper combining query tower + FAISS retrieval.
"""
def __init__(
self,
user_tower: UserTower,
faiss_index,
item_ids: np.ndarray,
item_metadata: pd.DataFrame
):
self.user_tower = user_tower
self.index = faiss_index
self.item_ids = item_ids
self.item_metadata = item_metadata.set_index('item_id')
def get_recommendations(
self,
user_id: str,
session_context: dict,
top_k: int = 50,
exclude_item_ids: list = None
) -> pd.DataFrame:
"""
Get top-k recommendations for a user.
session_context: dict with session-level features (current_cart, device_type, etc.)
"""
# Build query tower input
query_inputs = {
'user_id': tf.constant([user_id]),
**{k: tf.constant([v]) for k, v in session_context.items()}
}
# Get user embedding
user_embedding = self.user_tower(query_inputs).numpy().astype(np.float32)
faiss.normalize_L2(user_embedding)
# ANN search - retrieve more candidates for post-filtering
distances, indices = self.index.search(user_embedding, top_k * 2)
# Map indices to item IDs
candidate_item_ids = self.item_ids[indices[0]]
candidate_scores = distances[0]
# Build result dataframe
results = pd.DataFrame({
'item_id': candidate_item_ids,
'retrieval_score': candidate_scores
})
# Filter excluded items (already purchased, in cart, etc.)
if exclude_item_ids:
results = results[~results['item_id'].isin(exclude_item_ids)]
# Join metadata
results = results.merge(
self.item_metadata.reset_index(),
on='item_id',
how='left'
)
return results.head(top_k)
# ============================================================
# 7. Ranking Stage (Post-Retrieval)
# ============================================================
import lightgbm as lgb
def build_ranking_features(
candidates_df: pd.DataFrame,
user_features: dict,
interaction_history: pd.DataFrame
) -> pd.DataFrame:
"""
Build features for the ranking stage.
Combines retrieval score, user features, item features, and context.
"""
df = candidates_df.copy()
# User-level features
for key, value in user_features.items():
df[f'user_{key}'] = value
# Category affinity: user's historical purchase rate in this category
if not interaction_history.empty:
cat_affinity = (
interaction_history
.groupby('category_id')['interaction_type']
.apply(lambda x: (x == 'purchase').sum() / len(x))
.to_dict()
)
df['user_category_affinity'] = df['category_id'].map(cat_affinity).fillna(0.0)
# Days since last interaction with this category
last_interaction = (
interaction_history
.groupby('category_id')['timestamp']
.max()
.to_dict()
)
df['days_since_category_interaction'] = df['category_id'].map(
lambda c: (pd.Timestamp.now() - last_interaction.get(c, pd.Timestamp.now())).days
)
return df
Architecture Diagrams
Two-Tower Retrieval Architecture
Feature Store Architecture
Production Engineering Notes
Index Freshness
The FAISS index contains item embeddings computed from the item tower. New products added to the catalog need to be embedded and added to the index. The question is how frequently to rebuild.
Incremental indexing: FAISS IndexIVFFlat supports index.add() for new vectors without full rebuild. For a catalog growing by thousands of items daily, incremental addition is preferable to full nightly rebuilds. The tradeoff: over time, incremental additions may degrade IVF clustering quality. Schedule a full rebuild weekly.
Index deployment: The index is typically large (a million 64-dim float32 vectors = 256MB). Use blue-green deployment: build new index, validate retrieval quality on test queries, then swap atomically.
Model Staleness
User preferences shift. Item trends shift. The embedding space learned 3 months ago may not represent current user intent accurately. Best practice: retrain two-tower models weekly on recent interaction data (last 90 days with recency weighting). Monitor "embedding drift" - the cosine similarity between old and new user embeddings for the same user as a distribution. A significant shift signals the model has learned substantially different representations.
Diversity in Recommendations
Optimizing purely for relevance score produces filter bubbles: users who bought Item A keep getting Item A-like recommendations. Diversity matters for long-term engagement and discovery.
Maximal Marginal Relevance (MMR): Iteratively select items that maximize relevance minus similarity to already-selected items.
Categorical diversity: Ensure recommended items span at least N distinct categories. Simple but effective.
Common Mistakes
:::danger Popularity Bias in Training Data Interaction data is biased toward popular items: they appear more in training because they have more historical interactions. If you train naively, your model learns to recommend popular items regardless of user preference. Two mitigations: (1) use inverse popularity weighting in your loss function so interactions with rare items count more; (2) add explicit "popularity" as a feature in your ranking model so the model can distinguish between "user likes this item" and "everyone clicks on this item." :::
:::danger Position Bias in Click Data Clicks are not uniform feedback - items shown at position 1 get clicked more than items at position 7, regardless of quality. If you train on click data without correcting for position bias, your model learns "show things at position 1" not "show relevant things." Use Inverse Propensity Scoring (IPS): weight each click by 1/P(shown at that position) to get an unbiased relevance signal. :::
:::warning Offline-Online Metric Gap High offline AUC or NDCG does not guarantee online CTR improvement. A model that is great at predicting held-out clicks may fail online because: (1) it optimizes for the past not the future; (2) training distribution differs from serving distribution; (3) business constraints (diversity, newness, margins) matter online but not in offline eval. Always A/B test; never ship based solely on offline metrics. :::
:::warning Session Isolation in Validation When splitting your data for validation, split by time, not randomly. If you randomly sample (user, item) pairs for validation, the model can memorize user patterns from training interactions that appear in both splits. Use a strict temporal cutoff: all interactions before date T for training, all interactions after date T for validation. :::
Interview Questions and Answers
Q1: How does a two-tower model differ from matrix factorization, and when would you use each?
A: Both learn user and item embeddings such that dot product predicts interaction probability. The difference is in flexibility and input types. Matrix factorization (ALS, SVD++) uses only the interaction matrix - user ID and item ID, nothing else. Two-tower models can incorporate arbitrary features in each tower: user demographics, browsing history, session context; item attributes, text descriptions, images. This makes two-tower strictly more expressive. In practice, use MF when your catalog is relatively small and stable (under 100K items), you have sparse interaction data, and latency requirements are tight. Use two-tower for large catalogs with rich attributes, cold start challenges, or when you need real-time session context incorporated into the query.
Q2: Explain in-batch negatives for training two-tower models. What are their limitations?
A: In-batch negatives reuse examples already in the training batch as negative examples for each query. For a batch of 2048 (user, positive_item) pairs, each user treats the other 2047 positive items in the batch as negatives. This is computationally efficient - you do not need to sample additional negatives. The limitation is sampling bias: popular items appear more frequently in batches (because they have more interactions in training data). This means popular items are disproportionately used as negatives. The model ends up learning "this item is not as popular as a popular item" rather than "this item is not relevant to this user." Mitigation: logQ correction - subtract log(sampling probability) from the logit for each negative. This de-biases the loss by accounting for item popularity in sampling.
Q3: A new user visits your e-commerce site for the first time. Walk me through your personalization strategy across their first 10 interactions.
A: Interaction 0 (landing): No signals yet. Serve contextual recommendations based on entry point (which ad they clicked, which email campaign brought them, or organic search query). Also use device type, time of day, and geographic location for coarse personalization. Interactions 1-2 (first category view or search): Identify the category or intent. Shift recommendations toward items in that category, weighted by popularity within the category for users with similar entry context. Interaction 3-5 (first clicks): Now use session-level collaborative filtering - "users who viewed these items also clicked on..." This works even without user history because you're leveraging the population. Interaction 6-8 (cart add or second category): Price sensitivity signal - are they clicking budget or premium? Adjust price tier of recommendations. Interaction 9-10 (potential purchase): If they've added to cart, personalize for complementary items (cross-sell). If they've abandoned, show the item again with social proof or urgency signals. After purchase or account creation, they move out of cold start and into the standard personalization flow.
Q4: How would you design an A/B test to measure the true impact of your recommendation system, accounting for network effects?
A: The critical problem with standard A/B testing recommendations is interference: if I recommend item X to treatment users and they buy it, I change item X's availability, price, and inventory signal - which affects control users too. This violates the SUTVA (Stable Unit Treatment Value Assumption) required for valid A/B tests. For recommendations, user-level randomization is safer than session-level because purchase decisions are more independent across users than across sessions. But for viral or social items, even user-level independence breaks down. Best approach: geo-level randomization for experiments measuring broad impact - assign entire cities or DMAs to treatment/control. For measuring algorithmic differences with less contamination, use switchback designs (time-based randomization) or holdout sets where a fraction of users see no personalization at all (pure control). Primary metrics: 30-day repeat purchase rate, revenue per user, category discovery rate. Avoid CTR as a primary metric - it is gameable.
Q5: What is approximate nearest neighbor (ANN) search and why is exact nearest neighbor impractical for recommendation retrieval?
A: Exact nearest neighbor search requires computing the dot product or distance between the query embedding and every item in the index. For a catalog of 1 million items with 64-dimensional embeddings, that's 1 million dot products per query. At 64-dim float32, a single dot product is about 128 floating point operations. 1 million of them is 128M FLOPs. On a modern CPU with ~100 GFLOPs peak, that's about 1.3ms - borderline acceptable for one query. But at 10,000 queries per second on one machine, it's 13 TFLOPs per second of compute, which is not feasible. ANN methods trade a small amount of recall for orders of magnitude speedup. FAISS IVF (Inverted File Index) clusters item embeddings into k clusters (Voronoi partitions), then at query time only searches the nearest nprobe clusters. With nprobe=10 out of 1024 clusters, you search about 1% of the index but recover 90-95% of the true nearest neighbors. HNSW (Hierarchical Navigable Small World graphs) is another approach that achieves sub-millisecond retrieval with 99%+ recall at scale. Production systems at Spotify, Pinterest, and Airbnb all use HNSW variants.
Q6: How would you handle the recommendation problem for a retailer that sells both everyday goods (frequent, low-involvement purchases like detergent) and high-involvement purchases (furniture, appliances)?
A: These two purchase types require fundamentally different recommendation models. Everyday goods are characterized by high purchase frequency, strong brand loyalty, and replenishment patterns - "this user buys Tide pods every 3 weeks" is more predictive than any collaborative filter. Model this as a subscription/replenishment prediction problem: when is this user likely to run out? Surface the replenishment reminder at the right time. High-involvement purchases are characterized by long consideration cycles, multiple research sessions, high price sensitivity, and significant post-purchase regret. Collaborative filtering works better here because users research broadly before buying. The recommendation goal shifts from "predict next purchase" to "surface options that will end the consideration journey." In practice, maintain two separate recommendation systems and use a request-time classifier to determine which system to invoke based on the category the user is currently browsing. Share user embeddings between systems for transfer learning benefit, but optimize objectives separately.
