The Cold Start Problem - When Your Recommender Knows Nothing
Reading time: ~32 minutes | Level: Recommender Systems | Role: MLE, Data Scientist, AI Engineer
The Day Pinterest Had to Guess
It is 2012. Pinterest is growing by tens of thousands of new users every day. The company's collaborative filtering system - trained on the interaction history of its most engaged users - is beginning to show real power for those users. Recommendations are surfacing pins they love. Engagement is climbing. The team is proud of what they have built.
And then someone looks at the new user funnel.
Every new user who signs up sees the same 20 pins. Not 20 pins chosen for them - 20 pins that are globally popular, the same 20 as every other new user who signed up that day. A 45-year-old man interested in woodworking gets the same first experience as a 22-year-old woman interested in fashion and a 30-year-old chef looking for recipes. The collaborative filtering system has nothing to say about any of them, so it defaults to the only signal it has: global popularity. The pins that get the most clicks from everyone become the pins shown to everyone who is new.
The consequences are severe and non-obvious. First, new user retention is poor - users who see content irrelevant to their interests in their first session often do not return. Second, the popular pins get even more clicks from new users, entrenching their popularity and making it even harder for new content to surface. Third, power users who join later see the same stale "popular" content as complete beginners - a particularly bad experience that drives away exactly the high-value users you most want to retain. The recommender system is stuck in a loop: popular items are shown because they are popular, and they become more popular because they are shown.
The solution Pinterest developed over the following years had several components. An onboarding flow asked new users to pick five topics they cared about, giving the system immediate signal about interests even before any pins had been saved or clicked. A content-based fallback used pin image features and text descriptions to find items similar to those in a user's chosen interest categories. A blending function gradually increased the weight of collaborative filtering as the user accumulated interaction history - 10% CF at 2 interactions, 50% CF at 20 interactions, 90% CF at 100 interactions. And the entire cold start experience was tracked as a separate metric, not buried in aggregate recommendation quality numbers.
This is the cold start problem in production. It is not an edge case you can defer to later. Every user was once a new user. Every item was once a new item. The cold start problem is the permanent state for a significant fraction of your most important traffic, and solving it well is often what separates a recommendation system that retains users from one that drives them away in the first session.
Why This Exists
Collaborative filtering - the engine behind most recommendation systems - is fundamentally a pattern matching exercise. It finds users who behaved similarly to you in the past and recommends items those users liked. It finds items that were liked by the same users who liked things you like, and recommends those. Both approaches require one prerequisite: historical interaction data.
Without interaction data, collaborative filtering has no signal. The user-item interaction matrix has an empty row for new users and an empty column for new items. Matrix factorization has no latent factors to place a new user in embedding space. Neural collaborative filtering has no interaction history to pass through its embedding layers. Two-tower models have no personalized embedding for a user they have never seen before.
This is not a bug in these systems - it is a fundamental consequence of how they work. Collaborative filtering is powerful precisely because it learns from aggregate patterns across millions of users and items. A new user has not contributed to those patterns yet. A new item has not been evaluated by those users yet.
The gap between the new user/item experience and the personalized experience can be dramatic. For a mature recommendation system trained on years of data, the best users receive recommendations with NDCG@10 of 0.85 or higher. New users, receiving popularity-based recommendations, might see NDCG@10 of 0.40. That gap directly translates to lower engagement, lower retention, and lower lifetime value from exactly the users you are spending the most money to acquire.
Getting cold start right is not about perfecting an edge case. It is about ensuring your system performs adequately for an important and permanent segment of your users.
Three Types of Cold Start
The cold start problem appears in three distinct forms, each requiring a different solution strategy.
New user cold start is the most commonly encountered form. A user registers, but the system has no interaction history for them. They may have provided some profile information (age, location, stated preferences from an onboarding flow), but the rich behavioral signal that powers collaborative filtering is absent. This state is temporary - it ends once the user has accumulated enough interactions - but that transition from cold to warm to hot takes time, and the user may churn before it happens.
New item cold start affects every item that is added to a catalog. A new product listed on an e-commerce site, a new song uploaded to a music platform, a new video published on YouTube - all of these lack interaction history. Content-based features (text description, image, audio spectrogram, metadata) are available, but the collaborative signal (who bought this alongside what else? who listened to this after what?) does not exist yet. This is particularly acute for new items that are genuinely novel - they may not be similar to existing items, making content-based fallback also imprecise.
System cold start occurs when you are bootstrapping an entirely new recommendation system with no historical data. This is the rarest form and the hardest to solve elegantly. Without data, there is little you can do beyond editorial curation (hand-picking items to show), trending/popularity signals (at least showing items that are being actively engaged with elsewhere), and transfer learning from related domains.
Solving New User Cold Start
Onboarding Surveys: Getting Explicit Signal Fast
The simplest and most direct solution to new user cold start is to ask. A short onboarding survey - selecting 3–5 topics of interest, rating a few example items, or indicating preferences across broad categories - gives the system immediate signal to work with before any implicit behavior has accumulated.
The design of this onboarding flow matters more than most engineers appreciate. It must be:
Short: users will abandon onboarding if it takes more than 30–60 seconds. 3–5 choices is usually the right length. Spotify's genre selector shows approximately 25 genres with visual album art and lets users pick their favorites in seconds. Pinterest's topic picker shows interest categories with representative pins. The interaction is fast and feels useful, not like a form.
Immediately rewarding: the recommendation shown immediately after onboarding should be visibly personalized based on what the user just told you. If a user picks "photography" and "travel" and then sees generic popular content, the onboarding was pointless and they know it. Show items that are clearly related to their stated interests, even if the quality of those recommendations is not yet as high as fully personalized CF would produce.
Low-stakes: users should feel they can answer quickly and honestly without fear of being "locked in." Show them they can change preferences later. This removes the paralysis of over-thinking answers.
Translated into training signal: onboarding survey answers must be connected to your recommendation model's feature space. If a user says they like "jazz," that needs to map to a cluster of items in your item embedding space that the recommendation model understands. The engineering work of building this bridge between survey answers and model features is often underestimated.
Demographic and Contextual Fallback
When no explicit onboarding signal is available (users who skip the survey, API access without onboarding, etc.), you can use available contextual information as a weak substitute:
- Referral source: a user who arrived via a specific ad campaign, social media post, or partner site likely has interests aligned with the content of that referral
- Geographic location: coarse personalization by location (country, language, cultural context)
- Device type: mobile vs. desktop vs. app users have systematically different engagement patterns
- Time of day / day of week: content consumption patterns vary by time
These are weak signals and produce coarse personalization at best. But "coarse personalization" is better than "show the same 20 popular items to everyone," particularly for items where demographic correlations with preferences are strong (local news, local events, language-specific content).
The key is to represent this contextual information as features in your recommendation model rather than as a separate, rule-based system. A single model that takes user context features (location, referral, device, time) and outputs a ranked list is easier to maintain and can be improved continuously as you collect more data.
Exploration Budget
Even with onboarding survey data and demographic signals, new user recommendations carry high uncertainty. The right response to uncertainty is exploration - showing a diverse range of content to learn the user's actual preferences quickly.
Set aside 20–30% of recommendations for new users as an explicit exploration budget: items chosen to cover different content categories, styles, or content types rather than items predicted to be maximally relevant. When a user engages with an item from the exploration set, that is a much stronger signal than an engagement with an item from the "obviously relevant based on onboarding" set - because you are learning something unexpected about the user's preferences.
The tradeoff: exploration reduces immediate relevance (the user may not like all exploratory recommendations), but it accelerates learning (you discover their preferences faster), which improves recommendations sooner and increases the probability of long-term retention.
Meta-Learning / Few-Shot Adaptation
The most sophisticated approach to new user cold start is to train a model that explicitly learns to adapt quickly from a small number of interactions. This is inspired by model-agnostic meta-learning (MAML, Finn et al., 2017), applied to the recommendation domain.
The core idea: instead of training a single recommendation model on all users uniformly, train a model that, given interactions from a new user, can quickly fine-tune to that user's preferences with a small number of gradient steps. The outer training loop optimizes for fast adaptation, not just for accuracy on users with many interactions.
In practice, this is implemented as:
- Sample a "task" - a user with many interactions
- Simulate cold start by holding out most interactions and treating the first as the "cold" signal
- Fine-tune the model on the interactions (inner loop)
- Evaluate on the held-out interactions (outer loop)
- Meta-update the base model to minimize the outer loop loss
This is computationally expensive but produces a model that genuinely improves faster from sparse interactions than a standard model would. Companies like Spotify and Netflix have published work on variants of this approach for recommendation.
Solving New Item Cold Start
Content-Based Embedding: Warm Start from Features
For a new item with no interaction history, the most reliable approach is to compute an embedding from its content features and place it in the same embedding space as existing items. This allows the recommendation model to treat the new item like a known item with a similar profile.
For text-heavy items (news articles, job listings, product descriptions):
A sentence transformer (e.g., sentence-transformers/all-mpnet-base-v2) produces a 768-dimensional embedding of the item's text. A learned projection head maps this into the recommendation model's embedding space, where items cluster by behavioral similarity.
For visual items (fashion, art, food):
CLIP's vision encoder produces semantically meaningful image embeddings that can be projected into the recommendation embedding space.
The projection MLP is trained on existing items: given item with known interaction-based embedding , train the MLP to predict from item content features :
After training, you can call this MLP on any new item and immediately get a reasonable embedding, without waiting for interaction data to accumulate.
Warm Start via Similar Item Averaging
A simpler alternative to the projection MLP is to initialize a new item's embedding as the average embedding of its most similar existing items:
where is a set of nearest neighbors in content feature space (found via cosine similarity on text or image embeddings). This warm start embedding can then be refined as interactions accumulate, via online gradient updates or periodic batch retraining.
Bandit Exploration for New Items
Beyond getting the initial embedding right, you need to actively surface new items to users so that interaction data can accumulate. This is an exploration problem: you know little about the new item's true quality, but you need to show it to some users to find out.
The Upper Confidence Bound (UCB) algorithm provides a principled approach. Model the click-through rate (or any engagement metric) for item as an uncertain estimate with confidence bounds:
where is the estimated engagement rate for item , is the total number of recommendation events so far, is the number of times item has been shown, and is a constant controlling the exploration-exploitation tradeoff.
The UCB bonus is large when is small (item has been shown rarely - high uncertainty, high exploration value) and decreases as grows (item has been shown often - lower uncertainty, less need to explore). This naturally directs exploration toward items that have been shown the least, gradually building up the interaction data needed for collaborative filtering.
Thompson Sampling is an alternative that often outperforms UCB in practice. Instead of adding a deterministic confidence bonus, maintain a Beta distribution over the true engagement rate for each item and sample from it:
where counts positive interactions (clicks, purchases) and counts non-interactions. New items start with a prior (e.g., , which is a uniform prior). After each show/click observation, update or accordingly. At recommendation time, sample from each item's Beta distribution and rank by sampled value.
Thompson Sampling has a natural property: items with fewer observations (high uncertainty) have wider Beta distributions and are therefore more likely to be sampled high, driving exploration. Items with many observations converge to their true mean and are ranked stably.
The key design decision is which users to expose to new items during exploration. Showing new items to all users indiscriminately hurts overall recommendation quality. A better approach: reserve exploration for a subset of users who are explicitly opted in (e.g., users who have toggled "show me new things"), users with high engagement who can tolerate occasional misses, or users whose taste profile seems most similar to the predicted profile of the new item.
The Transition: From Cold to Warm to Personalized
The cold start problem has a natural resolution: as interaction data accumulates, the system should gradually increase the weight of collaborative filtering and decrease the weight of content-based and demographic fallbacks.
A simple blending function:
where is the number of interactions user has had, and:
This linearly increases the CF weight from 0 (at 0 interactions) to 1 (at interactions, typically 50–100). Below , the system is in hybrid mode, blending content-based and collaborative signals.
A more principled version models uncertainty explicitly. Bayesian approaches compute a posterior over user preference given limited observations and use that posterior to weight CF vs. CB:
In practice, the linear blending function with a tuned threshold is usually good enough and much simpler to implement and debug.
Code: Thompson Sampling for New Item Exploration
import numpy as np
from dataclasses import dataclass, field
from typing import Dict, List, Tuple
import random
@dataclass
class ItemStats:
"""Tracks Beta distribution parameters for Thompson Sampling."""
item_id: str
alpha: float = 1.0 # prior: 1 pseudo-click
beta: float = 1.0 # prior: 1 pseudo-non-click
n_shown: int = 0
is_new: bool = True
@property
def mean_estimate(self) -> float:
"""Expected value of the Beta distribution."""
return self.alpha / (self.alpha + self.beta)
@property
def uncertainty(self) -> float:
"""Variance of the Beta distribution - higher = more uncertain."""
a, b = self.alpha, self.beta
return (a * b) / ((a + b) ** 2 * (a + b + 1))
def sample(self) -> float:
"""Sample engagement rate from posterior Beta distribution."""
return np.random.beta(self.alpha, self.beta)
def update(self, clicked: bool) -> None:
"""Update posterior after observing a click or non-click."""
if clicked:
self.alpha += 1
else:
self.beta += 1
self.n_shown += 1
# Graduate from "new" status after enough observations
if self.n_shown >= 30:
self.is_new = False
class BanditItemExplorer:
"""
Manages exploration of new items using Thompson Sampling.
Integrates with a main recommender that scores established items.
"""
def __init__(
self,
exploration_fraction: float = 0.2,
new_item_threshold: int = 30,
):
self.exploration_fraction = exploration_fraction
self.new_item_threshold = new_item_threshold
self.item_stats: Dict[str, ItemStats] = {}
def register_new_item(self, item_id: str) -> None:
"""Add a newly created item to the exploration pool."""
self.item_stats[item_id] = ItemStats(item_id=item_id)
print(f"Registered new item {item_id} for exploration")
def record_outcome(self, item_id: str, clicked: bool) -> None:
"""Record whether a shown item was clicked."""
if item_id in self.item_stats:
self.item_stats[item_id].update(clicked)
def get_new_items(self) -> List[ItemStats]:
"""Return all items still in exploration phase."""
return [s for s in self.item_stats.values() if s.is_new]
def recommend(
self,
n_recommendations: int,
established_scores: Dict[str, float],
user_content_affinity: Dict[str, float] = None,
) -> List[str]:
"""
Produce a ranked recommendation list blending:
- Established items (ranked by main recommender scores)
- New items (selected by Thompson Sampling)
Args:
n_recommendations: total items to return
established_scores: {item_id: score} from main recommender
user_content_affinity: optional {item_id: content_similarity_score}
"""
n_explore = max(1, int(n_recommendations * self.exploration_fraction))
n_exploit = n_recommendations - n_explore
# --- Exploitation: top items from main recommender ---
exploit_items = sorted(
established_scores.items(), key=lambda x: -x[1]
)[:n_exploit]
exploit_ids = [item_id for item_id, _ in exploit_items]
# --- Exploration: Thompson Sampling over new items ---
new_items = self.get_new_items()
if not new_items:
# No new items to explore - fill with more established items
extra = sorted(established_scores.items(), key=lambda x: -x[1])
extra_ids = [
item_id for item_id, _ in extra
if item_id not in exploit_ids
][:n_explore]
return exploit_ids + extra_ids
# Sample from each new item's posterior
sampled_scores = []
for stats in new_items:
ts_score = stats.sample()
# Optionally blend with content affinity to seed exploration
if user_content_affinity and stats.item_id in user_content_affinity:
content_score = user_content_affinity[stats.item_id]
# Weight: mostly TS, small content boost for initial cold start
blended = 0.8 * ts_score + 0.2 * content_score
else:
blended = ts_score
sampled_scores.append((stats.item_id, blended))
# Select top new items by sampled score
explore_items = sorted(sampled_scores, key=lambda x: -x[1])[:n_explore]
explore_ids = [item_id for item_id, _ in explore_items]
# Interleave: don't put all exploratory items at the bottom
combined = []
exploit_idx, explore_idx = 0, 0
for i in range(n_recommendations):
# Insert exploratory item at exploration_fraction intervals
if (i + 1) % max(1, int(1 / self.exploration_fraction)) == 0:
if explore_idx < len(explore_ids):
combined.append(explore_ids[explore_idx])
explore_idx += 1
continue
if exploit_idx < len(exploit_ids):
combined.append(exploit_ids[exploit_idx])
exploit_idx += 1
return combined
# Simulation: new items accumulating interactions over time
def simulate_new_item_exploration():
explorer = BanditItemExplorer(exploration_fraction=0.2)
# Register 10 new items with unknown true CTRs
true_ctrs = {}
for i in range(10):
item_id = f"new_item_{i}"
true_ctrs[item_id] = np.random.beta(2, 5) # true CTR ~ 0.2-0.4
explorer.register_new_item(item_id)
# 300 established items with known scores
established_scores = {
f"established_{i}": np.random.uniform(0.3, 0.9)
for i in range(300)
}
# Simulate 1000 recommendation events
total_clicks = 0
total_shows = 0
new_item_clicks = {item_id: 0 for item_id in true_ctrs}
new_item_shows = {item_id: 0 for item_id in true_ctrs}
for event in range(1000):
recommendations = explorer.recommend(
n_recommendations=10,
established_scores=established_scores,
)
for item_id in recommendations:
if item_id in true_ctrs:
clicked = np.random.random() < true_ctrs[item_id]
explorer.record_outcome(item_id, clicked)
new_item_shows[item_id] += 1
if clicked:
new_item_clicks[item_id] += 1
total_clicks += 1
total_shows += 1
# Report: how well did we learn the true CTRs?
print("\nNew Item Learning Results:")
print(f"{'Item':<15} {'True CTR':<12} {'Estimated CTR':<15} {'Shows':<8}")
print("-" * 50)
for item_id in sorted(true_ctrs.keys()):
stats = explorer.item_stats[item_id]
estimated = stats.mean_estimate
true = true_ctrs[item_id]
print(
f"{item_id:<15} {true:<12.3f} {estimated:<15.3f} "
f"{new_item_shows[item_id]:<8}"
)
# Best item should have accumulated most shows
best_item = max(true_ctrs, key=lambda x: true_ctrs[x])
print(f"\nTrue best item: {best_item} (CTR={true_ctrs[best_item]:.3f})")
most_shown = max(new_item_shows, key=new_item_shows.get)
print(f"Most shown item: {most_shown} (shows={new_item_shows[most_shown]})")
simulate_new_item_exploration()
Code: Content-Based Warm Start and Transition Blending
import numpy as np
from sentence_transformers import SentenceTransformer
import torch
import torch.nn as nn
from typing import Optional
class ContentEmbeddingProjector(nn.Module):
"""
Projects content features (text/image embeddings) into the
collaborative filtering embedding space.
Trained on existing items where both content features and
CF embeddings are available.
"""
def __init__(self, content_dim: int, cf_dim: int, hidden_dim: int = 256):
super().__init__()
self.projector = nn.Sequential(
nn.Linear(content_dim, hidden_dim),
nn.LayerNorm(hidden_dim),
nn.GELU(),
nn.Dropout(0.1),
nn.Linear(hidden_dim, hidden_dim),
nn.LayerNorm(hidden_dim),
nn.GELU(),
nn.Dropout(0.1),
nn.Linear(hidden_dim, cf_dim),
)
def forward(self, content_embedding: torch.Tensor) -> torch.Tensor:
return self.projector(content_embedding)
def train_content_projector(
content_embeddings: np.ndarray, # (n_items, content_dim)
cf_embeddings: np.ndarray, # (n_items, cf_dim)
n_epochs: int = 50,
lr: float = 1e-3,
) -> ContentEmbeddingProjector:
"""Train the projection from content space to CF embedding space."""
content_dim = content_embeddings.shape[1]
cf_dim = cf_embeddings.shape[1]
model = ContentEmbeddingProjector(content_dim, cf_dim)
optimizer = torch.optim.Adam(model.parameters(), lr=lr)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=n_epochs)
X = torch.FloatTensor(content_embeddings)
Y = torch.FloatTensor(cf_embeddings)
for epoch in range(n_epochs):
model.train()
optimizer.zero_grad()
Y_pred = model(X)
# Cosine similarity loss: align directions, not just magnitudes
cos_loss = 1 - torch.nn.functional.cosine_similarity(Y_pred, Y).mean()
mse_loss = torch.nn.functional.mse_loss(Y_pred, Y)
loss = 0.5 * cos_loss + 0.5 * mse_loss
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
scheduler.step()
if (epoch + 1) % 10 == 0:
print(f"Epoch {epoch+1:3d} | Loss: {loss.item():.4f}")
return model
class HybridRecommender:
"""
Blends content-based and collaborative filtering recommendations
based on the number of interactions a user has had.
- Cold users (< threshold interactions): mostly content-based
- Warm users (threshold/2 to threshold): blend
- Hot users (> threshold): mostly collaborative filtering
"""
def __init__(
self,
cf_model, # collaborative filtering model
cb_model, # content-based model
transition_threshold: int = 50,
):
self.cf_model = cf_model
self.cb_model = cb_model
self.transition_threshold = transition_threshold
def blend_weight(self, n_interactions: int) -> float:
"""
Returns CF weight alpha in [0, 1].
alpha = 0 means pure content-based.
alpha = 1 means pure collaborative filtering.
"""
return min(n_interactions / self.transition_threshold, 1.0)
def recommend(
self,
user_id: str,
n_interactions: int,
candidate_items: list,
n_recommendations: int = 10,
user_content_profile: Optional[np.ndarray] = None,
) -> list:
"""
Produce recommendations for a user given their interaction count.
"""
alpha = self.blend_weight(n_interactions)
# Get CF scores (may be zero/noise for cold users)
cf_scores = self.cf_model.score(user_id, candidate_items)
# Get CB scores from content profile
cb_scores = self.cb_model.score(user_content_profile, candidate_items)
# Normalize both score vectors to [0, 1]
def normalize(scores):
s = np.array(scores)
rng = s.max() - s.min()
return (s - s.min()) / rng if rng > 0 else np.zeros_like(s)
cf_norm = normalize(cf_scores)
cb_norm = normalize(cb_scores)
# Blend: alpha controls CF weight
blended = alpha * cf_norm + (1 - alpha) * cb_norm
# Rank by blended score
ranked_indices = np.argsort(blended)[::-1]
return [candidate_items[i] for i in ranked_indices[:n_recommendations]]
def explain_blend(self, n_interactions: int) -> str:
alpha = self.blend_weight(n_interactions)
cf_pct = int(alpha * 100)
cb_pct = 100 - cf_pct
return (
f"With {n_interactions} interactions: "
f"{cf_pct}% collaborative filtering, {cb_pct}% content-based"
)
# Demonstrate the transition
rec = HybridRecommender(cf_model=None, cb_model=None)
for n in [0, 5, 10, 25, 50, 100]:
print(rec.explain_blend(n))
# Output:
# With 0 interactions: 0% collaborative filtering, 100% content-based
# With 5 interactions: 10% collaborative filtering, 90% content-based
# With 10 interactions: 20% collaborative filtering, 80% content-based
# With 25 interactions: 50% collaborative filtering, 50% content-based
# With 50 interactions: 100% collaborative filtering, 0% content-based
# With 100 interactions: 100% collaborative filtering, 0% content-based
Production Engineering Notes
How Airbnb Handles New Listing Cold Start
Airbnb's new listing cold start problem is particularly interesting because listings vary enormously in quality, and a bad early experience destroys a listing's long-term viability on the platform. A new host who gets few bookings in their first month may list the property on other platforms permanently.
Airbnb's approach (described in their 2017 KDD paper and subsequent engineering blog posts) involves several components:
Content feature embedding: new listings are embedded based on their property type, amenities, photos (using image features), location, price point, and host profile. This embedding is initialized in the same space as similar existing listings using a learned projection.
Similar listing warm start: the new listing's initial embedding is the average of its nearest neighbors in content feature space. These neighbors have rich interaction histories, so the new listing inherits a reasonable estimate of expected booking rates, typical guest demographics, and optimal search ranking positions.
Expedited data collection: Airbnb explicitly routes a small number of guests who match the predicted guest profile to new listings, giving those listings early interactions to bootstrap from. This is equivalent to a targeted exploration policy: show new listings to users most likely to provide useful signal.
Separate cold start metrics: Airbnb tracks "time to first booking" as a key metric for new listings. A listing that fails to get a booking within its first 30 days of listing is flagged and may receive additional ranking boosts or host support outreach.
How Duolingo Handles New Lesson Cold Start
Duolingo's cold start problem is unusual: every new lesson they release must be learned by their adaptive learning system, which personalizes which lesson to show each user based on their skill level and learning history.
Their approach leverages the curriculum structure: new lessons have known prerequisite relationships with existing lessons. A new lesson on "Spanish subjunctive" is known to be appropriate for users who have mastered "Spanish present tense." The initial difficulty estimate comes from the lesson's position in the curriculum and expert judgments from content creators, not from user performance data.
As users attempt the new lesson, the adaptive system rapidly updates its model of the lesson's true difficulty and the types of users for whom it is appropriate. After approximately 100 user attempts, the system has enough data to remove the content-based prior and rely on observed performance.
Measuring Cold Start Performance
The most important engineering discipline around cold start is tracking it as a separate metric. Aggregate recommendation quality metrics (mean NDCG@10 across all users, mean click-through rate) are dominated by experienced users, who are the most engaged and the most numerous. Cold start failures are invisible in aggregate metrics.
Track separately:
- NDCG@10 for users with fewer than 5 interactions (ultra-cold)
- NDCG@10 for users with 5–50 interactions (warming)
- 7-day retention rate by user cohort (new users who saw good first-session recommendations are more likely to return)
- New item coverage: what fraction of new items (added in the last 30 days) receive at least 100 impressions within 7 days of being listed?
A/B test cold start solutions separately from the main recommendation system. The treatment effect for cold start improvements is often much larger than for experienced user improvements, but it requires intentional stratification to see.
Common Mistakes
Defaulting all new users to "most popular" recommendations. This is the easiest implementation and the worst outcome. It creates a self-reinforcing popularity bias: popular items get shown to new users, new users click popular items (because they are often genuinely good), those clicks reinforce the items' popularity, making them even more dominant in the popular list. Meanwhile, genuinely good items that are newer or less well-known never surface. New users see the same experience regardless of their interests, reducing the probability of finding content that creates genuine engagement. The first session is the most important session - it determines whether a user returns. A generic popular-item experience is a wasted opportunity that directly harms retention.
Using a training dataset that mixes cold and hot users without stratification. If you train a recommendation model on a combined dataset of users with few interactions and users with many interactions, the model learns primarily from the experienced users (who dominate by interaction count). The cold user experience is then effectively determined by how the model generalizes from experienced users - which is usually poor. Always stratify your training data and evaluation metrics by user interaction count. Consider training separate models for cold, warm, and hot phases, or at minimum evaluating them separately.
Transitioning from content-based to collaborative filtering too abruptly. A hard switch from CB to CF at a fixed interaction threshold creates a jarring discontinuity in recommendation quality. If CF is not yet reliable at 50 interactions, the switch produces a sudden degradation. Use gradual blending with a smooth transition function rather than a hard cutoff. Tune the transition threshold empirically: plot recommendation quality vs. interaction count and find where CF quality reliably exceeds CB quality.
Not giving new items enough exploration budget to accumulate data. A recommender system that only surfaces items it is already confident about will never learn about new items. If new items are shown only when they happen to be predicted as relevant by the content embedding (which is imprecise), they will remain in a low-data state indefinitely. Budget explicit exploration for new items - 10–20% of recommendation slots dedicated to items added in the last 30 days - and track new item coverage as a primary metric.
Track cold start separately from overall recommendation metrics. Aggregate NDCG@10 or click-through rate will always be dominated by your experienced users, who have better recommendations and more interactions. Cold start problems are invisible in these aggregates until they become bad enough to affect your overall new user retention numbers - by which time significant damage has been done. Build dashboards that specifically track recommendation quality for users with under 10, 10–50, and 50–200 interactions, and set separate improvement targets for each segment.
YouTube Resources
| Video | Channel | Description |
|---|---|---|
| Cold Start Problem in Recommender Systems | Machine Learning Mastery | Overview of cold start types and solution approaches |
| Multi-Armed Bandits | Mutual Information | UCB and Thompson Sampling explained with intuition and math |
| Meta-Learning for Recommendations | ICML | Few-shot recommendation with MAML-style meta-learning |
| Exploration vs Exploitation | David Silver RL | The explore-exploit tradeoff - foundations from reinforcement learning |
Interview Q&A
Q1: Describe all three types of cold start and your recommended solution for each.
A: There are three distinct cold start scenarios, each requiring a different approach.
New user cold start - the user has no interaction history. Solutions in priority order: (1) Onboarding survey: ask users to select 3–5 topics or rate a few example items. This gives immediate, explicit signal before any implicit behavior has accumulated. (2) Contextual fallback: use referral source, location, device type, and time of day as weak personalization signals when no explicit preference data is available. (3) Content-based recommendations: show items from the user's stated interest categories using content features, not collaborative filtering. (4) Exploration budget: reserve 20–30% of the new user's recommendations for diverse exploratory content to learn their preferences quickly. (5) Gradually blend in CF as interactions accumulate using a smooth transition function.
New item cold start - the item has no interaction history. Solutions: (1) Content embedding projection: compute a content embedding (text, image, metadata) and project it into the collaborative filtering embedding space using a learned projection MLP trained on existing items. (2) Similar item warm start: initialize the new item's CF embedding as the average of its nearest neighbors in content space - items with similar content tend to have similar interaction patterns. (3) Bandit exploration: use Thompson Sampling or UCB to actively surface new items to users, building up the interaction data needed for CF. Reserve 10–20% of recommendation slots for new items.
System cold start - no historical data exists. Solutions: (1) Editorial curation: manually select a diverse set of high-quality items to bootstrap the system. (2) Trending signals: use signals from outside the platform (social media trends, news, partner data) to identify items likely to be popular. (3) Transfer learning: if a related system or domain exists, transfer embeddings or preferences from that system.
Q2: Compare UCB and Thompson Sampling for new item exploration. When would you prefer each?
A: Both algorithms solve the explore-exploit problem for new items - balancing the need to show items with uncertain quality (exploration) against the need to show items with high predicted quality (exploitation).
UCB (Upper Confidence Bound) adds a deterministic bonus to each item's estimated engagement rate:
The bonus decreases as grows (more observations reduce uncertainty). UCB is deterministic given the current state, which makes it easier to debug and audit. The main drawback: the exploration bonus is not calibrated to the true uncertainty in a probabilistic sense - it is a heuristic upper bound.
Thompson Sampling maintains a probability distribution (Beta distribution for binary outcomes) over each item's true engagement rate and samples from it:
Thompson Sampling tends to outperform UCB empirically across many domains. Its exploration is naturally calibrated - items with few observations have wide distributions and are often sampled high, driving exploration, while items with many observations have narrow distributions that concentrate near the true mean. It is also computationally simpler (no calculation needed).
When to prefer UCB: when you need deterministic, auditable decisions (regulatory or business contexts where randomness in recommendations is problematic); when you want tight theoretical regret guarantees (UCB's regret bounds are better understood theoretically).
When to prefer Thompson Sampling: in most practical settings. It tends to reach good exploration-exploitation balance faster, handles uncertainty more naturally, and is simpler to implement correctly.
Both algorithms require choosing the right granularity for the item representation. Exploring per-item is appropriate when items are distinct enough that each needs its own data. Exploring per-category or per-cluster reduces variance at the cost of assuming items within a cluster have similar quality - often a reasonable assumption for new items from a known category.
Q3: How would Spotify handle the cold start problem for a new song just uploaded by a known artist?
A: A new song from a known artist is actually a fairly favorable cold start scenario because the artist context provides strong prior information.
Step 1 - Content features immediately: compute audio features (tempo, key, energy, acousticness, valence) using a model like those underlying Spotify's audio analysis API. Compute a text embedding of the song's lyrics (if available) and title. Use these to place the song in the same audio feature space as existing songs.
Step 2 - Leverage artist graph: the song immediately inherits signal from the artist's existing catalog. Users who follow the artist, who have played the artist's other songs repeatedly, who have the artist in playlists - all of these represent strong candidates for early exposure. Routing the new song's early plays to this existing audience gives you the highest-quality exploration signal: these users are most likely to engage, and their engagement (or lack thereof) is most informative about the song's true quality.
Step 3 - Related artist listeners: beyond the direct artist fanbase, Spotify's artist similarity graph identifies artists whose listeners have substantial overlap. These listeners are a secondary exploration audience - they are less certain to like the new song, but there are more of them, providing faster data accumulation.
Step 4 - Content-based playlist placement: Spotify's automated playlists (Discover Weekly, Daily Mixes, Release Radar) can place the new song based on audio feature similarity to existing playlist content. Release Radar in particular is designed specifically for this use case - it surfaces new releases from artists a user follows.
Step 5 - Blend into CF as data accumulates: after the song has received 500–1000 plays and has meaningful play/skip data, it can be treated as a normal item in the collaborative filtering system. The warm start embedding from audio features, combined with early interaction data, provides a strong initialization.
The key insight: for new items from known entities (artists, authors, brands), the cold start problem is substantially easier because you can inherit signal from the entity's history and direct exploration toward the most informative audience.
Q4: How would you measure cold start performance? What metrics matter most?
A: Cold start performance requires specialized metrics because standard aggregate recommendation metrics are dominated by experienced users and hide cold start failures.
Primary metrics - segment by interaction count:
- NDCG@10 for users with 0–5 interactions (cold phase)
- NDCG@10 for users with 5–50 interactions (warming phase)
- NDCG@10 for users with 50+ interactions (hot phase, baseline)
If cold phase NDCG@10 is 0.35 and hot phase NDCG@10 is 0.82, you have a significant cold start gap to close.
Retention funnel metrics:
- Day-7 retention rate stratified by users' interaction count in session 1
- Day-30 retention rate for users who had at least 1 positive interaction in session 1 vs. 0 positive interactions
This is often the most important business metric: users who find content they like in their first session are far more likely to return. Measuring the correlation between first-session recommendation quality and long-term retention directly quantifies the business value of cold start improvements.
New item coverage metrics:
- Fraction of items added in the last 30 days that received more than 100 impressions within 7 days of listing
- Fraction of new items that received at least 1 positive interaction within 7 days
Low new item coverage means your system is failing to explore new inventory, which harms long-term catalog health.
Time-to-personalization:
- Median number of interactions needed before NDCG@10 for a user reaches 90% of the hot phase benchmark
A shorter time-to-personalization means your cold start solutions are effectively accelerating the learning process. Improvements to onboarding surveys, content-based fallbacks, and exploration budgets should all reduce this number.
Q5: Design an onboarding flow for a new professional networking platform (similar to LinkedIn) that minimizes the cold start problem.
A: A professional network has different cold start characteristics than a consumer platform. Users arrive with a clear intent (find jobs, build connections, learn) and are often willing to provide more profile information than on entertainment platforms.
Step 1 - Mandatory basics (30 seconds): current role, industry, years of experience, location, and one career goal (find a job / grow network / learn skills / hire). These four data points immediately segment the user into a meaningful cluster - "junior software engineer in NYC looking for a job" is a cluster with known content preferences.
Step 2 - Skill tag selection (45 seconds): show 40–60 skill tags with checkboxes. The user selects all that apply. This produces an explicit, high-dimensional preference vector that directly feeds content-based recommendations (relevant articles, courses, people with these skills). Present tags grouped by category (programming languages, frameworks, soft skills, industries) to make selection fast.
Step 3 - One follow immediately: after skills, show 5 people the user might know based on skills and location (use second-degree connections from similar profiles, not random). Getting even one follow in the first session creates the first interaction that bootstraps CF.
Step 4 - First personalized feed: the home feed on first login should visibly reflect the stated skills and role. Show articles about their industry, posts from professionals in their role, and job listings matching their stated goal. If a user said they are a "junior engineer looking for a job," the first thing they see should be relevant job listings and career advice posts - not generic viral content.
Step 5 - Progressive onboarding: surface additional profile-completion prompts over the first 5 sessions rather than all at once. Each completed prompt (adding past employers, adding education, uploading a resume) provides additional signal and strengthens recommendations. Gamify this with visible progress indicators.
Measurement: track first-session follow rate (did the user follow at least one person?), first-session content engagement rate (did the user click at least one article or job listing?), and 7-day return rate. These are the leading indicators of onboarding success.
Advanced Techniques: Meta-Learning for Cold Start
The onboarding survey and content-based fallback approaches work reasonably well but share a fundamental limitation: they do not learn how to learn. Each new user starts fresh, and the system adapts to them using the same fixed content-based or demographic signals regardless of how much experience the system has accumulated across millions of previous cold starts.
Meta-learning changes this. The core idea: train a model not to make good recommendations for users with many interactions, but to make good recommendations for users with very few interactions by learning to adapt quickly.
MAML-Style Few-Shot Recommendation
Model-Agnostic Meta-Learning (MAML, Finn et al., 2017) provides the theoretical framework. Adapted for recommendation:
The setup: you have a large population of users, each with many interactions. Simulate cold start by treating each user as a "task": hold out most of their interactions and treat the first (say, ) as the "cold" support set. The objective is to recommend well on the user's remaining interactions given only the 3-interaction support set.
Inner loop (adaptation): for each task (user), take gradient steps using only the support interactions:
Outer loop (meta-optimization): update the base model parameters to minimize the query set loss across all tasks after inner-loop adaptation:
The outer loop trains to be a good initialization point - one from which the model can quickly adapt to new users with few gradient steps.
In practice, full MAML is expensive (requires differentiating through gradient steps). Approximations like Reptile (Nichol et al., 2018) or first-order MAML are more practical for recommendation-scale systems. Several production systems (Spotify, Meitu) have published results showing that meta-learning approaches meaningfully improve cold start quality compared to content-based fallbacks alone.
Context-Aware Initialization
A simpler alternative to full meta-learning is context-aware user initialization: given a new user's onboarding context (stated preferences, demographics, referral), retrieve the most similar historical user profiles from a learned user embedding space, and initialize the new user's representation as a weighted combination of those similar historical users.
where are the users whose onboarding context most closely matches the new user, and are similarity weights. This effectively says: "this new user is probably similar to these existing users, so start with their collaborative signal." As the new user accumulates interactions, their embedding drifts toward their true position in the latent space.
The Explore-Exploit Tradeoff Across the User Lifecycle
Cold start is not a binary problem that resolves after a fixed number of interactions. It is better understood as a spectrum of uncertainty that evolves as the user accumulates history.
At any point in a user's lifecycle, the recommendation system faces the classic explore-exploit tradeoff:
- Exploit: show items the system is confident the user will like, maximizing immediate engagement
- Explore: show items outside the system's confident predictions to learn more about the user's preferences, potentially sacrificing immediate engagement for faster long-term learning
For a brand new user, uncertainty is maximal. The optimal strategy is heavily exploratory - you have so little information that the marginal value of learning from a new interaction is very high. Show diverse content, even at the cost of some relevance.
For a user with 10,000 interactions, uncertainty is minimal. The system knows this user's preferences in detail. Heavy exploitation is appropriate - show highly relevant content, with only a small exploration budget maintained to track preference drift over time.
The transition between these extremes should be gradual and driven by measured uncertainty, not fixed thresholds. A principled uncertainty estimate comes from the variance of the recommendation model's predictions for this user: high variance means the model is uncertain, high exploration is warranted; low variance means the model is confident, high exploitation is appropriate.
In production, implement this by maintaining a per-user exploration rate that starts high and decays as a function of interaction count:
where (say, 0.05) is the minimum exploration rate even for highly experienced users (to handle preference drift), (say, 0.4) is the maximum exploration rate for brand new users, and controls how quickly the rate decays.
Industry Case Studies
Netflix: New Member Experience
Netflix's cold start engineering is among the most studied in the industry. When a new member signs up, Netflix shows a preference selection screen: "Rate a few titles to get better recommendations." Users rate titles they have seen (thumbs up / thumbs down). The minimum required is zero - users can skip entirely - but most rate 5–15 titles.
These explicit ratings immediately populate the user's rating history and provide enough signal for a weak collaborative filtering recommendation. Netflix does not rely on demographics or content features alone - they have found that even 5 explicit ratings from a user produce better predictions than any demographic proxy.
For users who skip the rating step entirely, Netflix falls back to a ranked list of popular titles within a genre predicted from the user's country, account language, and device type. This is a far weaker signal, and Netflix's internal data shows these users have significantly lower 30-day retention than users who rate even a few titles.
The lesson: even minimal explicit onboarding data (5 ratings) dramatically outperforms any implicit or demographic fallback. The value of getting even a tiny amount of explicit signal from a new user is enormous.
TikTok: No Explicit Onboarding
TikTok is notable for taking the opposite approach to onboarding: it asks for almost no explicit preferences and immediately starts the explore-exploit process using implicit signals.
A new TikTok user opens the app and sees a stream of videos. After each video, TikTok records watch time (did they watch the full video? Did they rewatch? Did they scroll past after 2 seconds?), likes, comments, and shares. Within 10–20 video interactions, TikTok's recommendation system has a surprisingly accurate model of the user's preferences.
This approach works for TikTok because the unit of consumption (a short video, typically 15–60 seconds) is fast. Collecting 20 implicit signals takes only a few minutes of browsing. The content diversity of the initial video stream is deliberately engineered to be high - showing content across many categories to learn the user's preferences quickly.
For platforms where each item consumes more time (long-form articles, 20-minute podcast episodes, full movies), this approach is less viable. You cannot show 20 diverse full movies to a new user in a single session to learn their preferences. TikTok's approach works specifically because of the short-form, low-commitment nature of its content.
Key Takeaways
Cold start is not an edge case - it is the permanent state for every new user and every new item, affecting a significant fraction of your most important traffic at all times.
There are three distinct cold start problems, each requiring a different solution: new users (onboarding + content-based fallback + gradual CF blending), new items (content embedding + warm start + bandit exploration), and system bootstrap (editorial curation + transfer learning).
Explicit onboarding signal, however minimal, beats demographic proxies. Even asking a user to pick 3 topics produces better cold start recommendations than inferring preferences from location, device type, or referral source. Make onboarding easy, fast, and immediately rewarding.
Bandit algorithms (UCB, Thompson Sampling) are the right tool for new item exploration. They provide a principled approach to balancing the need to learn (exploration) against the need to serve relevant content (exploitation), with natural uncertainty quantification built in.
The transition from content-based to collaborative filtering must be gradual and measurable. Use a smooth blending function tuned empirically, not a hard cutoff. Measure separately.
Track cold start with its own metrics. Aggregate recommendation quality is always dominated by experienced users. Cold start failures are invisible in aggregate numbers until they become catastrophic. Build dashboards specifically for users with fewer than 10, 10–50, and 50–200 interactions, and set explicit improvement targets for each segment.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Embedding Space Explorer demo on the EngineersOfAI Playground - no code required.
:::
