:::tip 🎮 Interactive Playground Visualize this concept: Try the Feedback Loops demo on the EngineersOfAI Playground - no code required. :::
Feedback Loops and Data Flywheels
The Recommendation System That Made Itself Worse
In 2017, a major music streaming service launched a new recommendation engine. The model was good - offline metrics were the best they had ever seen. In the first month of production, user satisfaction scores went up. Streams per user increased. The team celebrated.
By month three, something strange was happening. The "Discover Weekly" playlist - their flagship personalized playlist - was getting shorter and shorter in practice. Not in terms of track count, but in terms of diversity. Users who liked hip-hop were getting only hip-hop. Users who listened to classical were getting only classical. The long-tail artists - the independent musicians who had once been the platform's differentiator - were disappearing from recommendations entirely. A user who had liked jazz and electronic music in alternating weeks was now being served only jazz, because that week they happened to stream more jazz, and the model had narrowed its representation of that user accordingly.
The model was not broken. It was working exactly as designed. It optimized for streams per session, so it learned to recommend what each user was currently listening to most. More streams → stronger signal → stronger recommendation → more streams. The feedback loop was complete, and it was self-reinforcing. Popular content got recommended more, got streamed more, generated stronger training signal, got recommended even more. Niche content that users would have enjoyed if they had been exposed to it never received the exposure needed to generate the training signal that would have led to its recommendation.
This is the fundamental pathology of ML systems that learn from their own outputs: the system's past decisions determine what data the system learns from, which shapes the system's future decisions. Without deliberate intervention, this feedback loop produces systems that are increasingly narrow, increasingly confident, and increasingly wrong about the world outside their own recommendation bubble.
This lesson is about understanding this loop mathematically, detecting it in production, and breaking it with counterfactual methods and deliberate exploration.
Why This Exists
The Exposure Bias Problem
Standard supervised learning assumes that training data is independently and identically distributed (i.i.d.) - drawn randomly from the true distribution of user preferences. Recommendation systems violate this assumption fundamentally: training data consists of interactions with items that the system chose to show. The system has never observed interactions with items it chose not to show.
This creates exposure bias: the model learns from a biased sample of user-item interactions, skewed toward items the current (or previous) system preferred. The model then uses this biased knowledge to make future recommendations, which generates more biased training data, which further skews the model's knowledge.
Mathematically, the observed click probability differs from the true click probability because observation is conditioned on exposure:
where is the propensity - the probability that item was shown to user under the current recommendation policy. Items with low propensity (rarely shown) contribute little to training even if users would enjoy them.
Types of Feedback Loops
Popularity bias: The system recommends popular items. Popular items get more exposure. More exposure generates more interactions. More interactions strengthen the training signal. The popular gets more popular; the niche gets less visible. This is the Matthew effect in recommendation systems.
Filter bubble: The system learns a narrow representation of each user based on what it has shown them. It recommends items that confirm this narrow representation. Users never encounter content outside their historical pattern. Their "interest profile" never expands.
Confirmation bias in labels: The system shows item A, the user clicks. This is recorded as a positive signal. But did the user click because they genuinely preferred A, or because A was the most prominent item shown? Position bias - the tendency to click items shown at the top - corrupts the training signal.
Historical Context
The feedback loop problem in recommendation systems was described academically as early as 2004 (Fleder and Hosanagar, "Blockbuster Culture's Next Rise or Fall"), but became an active engineering problem with the rise of deep learning-based recommendation in 2014-2016.
Propensity scoring as a solution was imported from causal inference (Rosenbaum and Rubin, 1983) into recommendation systems by Schnabel et al. in the 2016 paper "Recommendations as Treatments: Debiasing Learning and Evaluation." Inverse Propensity Weighting (IPW) for recommendation was popularized by Bottou et al. (2013) and applied to production recommendation by the Spotify, Netflix, and Google teams in the 2017-2019 period.
The exploration-exploitation framing came from the bandit literature: Thompson sampling and UCB (Upper Confidence Bound) algorithms from the 1952 Thompson paper and the 2002 Auer et al. paper were adapted for recommendation by Microsoft, Google, and Yahoo in the 2010-2014 period.
Core Concepts
Detecting Feedback Loops
Before you can break a feedback loop, you need to confirm it exists and measure its severity.
Popularity concentration metric: Compute the Gini coefficient of item exposure distribution over time. If the Gini coefficient is increasing, the system is concentrating exposure on fewer items - a signature of the popularity feedback loop.
where is the number of times item was recommended. A Gini of 0 means perfectly equal distribution; Gini of 1 means all recommendations go to one item.
Long-tail coverage decay: Track what fraction of the catalog receives at least one recommendation per week. In a healthy system, this stays stable or grows. In a feedback-loop-driven system, this fraction shrinks over time as the recommendation distribution narrows.
Catalog diversity index (CDI): The fraction of the catalog's items that were recommended to at least 1% of users. Decreasing CDI indicates increasing filter bubble severity.
import numpy as np
from collections import Counter
from typing import List, Dict
def compute_gini_coefficient(item_exposures: Dict[str, int]) -> float:
"""
Compute Gini coefficient of item exposure distribution.
Values near 1 indicate highly concentrated (potentially biased) recommendations.
"""
counts = np.array(list(item_exposures.values()), dtype=float)
counts.sort()
n = len(counts)
cumulative = np.cumsum(counts)
return (2 * np.sum((np.arange(1, n + 1) * counts)) - (n + 1) * cumulative[-1]) / (n * cumulative[-1])
def compute_catalog_coverage(
recommendations: List[List[str]], # list of per-user recommendation lists
catalog_size: int,
min_exposure_fraction: float = 0.01, # item must reach this fraction of users
) -> float:
"""
Fraction of catalog items that reached at least min_exposure_fraction of users.
"""
n_users = len(recommendations)
exposure_counts = Counter(item for rec_list in recommendations for item in rec_list)
min_exposure = n_users * min_exposure_fraction
items_above_threshold = sum(1 for count in exposure_counts.values() if count >= min_exposure)
return items_above_threshold / catalog_size
def detect_popularity_bias(
item_exposures: Dict[str, int],
item_popularity: Dict[str, int], # pre-existing popularity (before system)
) -> float:
"""
Compute Spearman correlation between item exposure and item popularity.
High correlation (> 0.8) indicates strong popularity bias.
"""
from scipy.stats import spearmanr
items = list(set(item_exposures) & set(item_popularity))
exposures = [item_exposures[i] for i in items]
popularities = [item_popularity[i] for i in items]
corr, _ = spearmanr(exposures, popularities)
return corr
Inverse Propensity Weighting (IPW)
IPW corrects for exposure bias by weighting each training example by the inverse probability that it was shown. Rare-exposure examples receive high weights; common-exposure examples receive low weights.
The propensity-weighted loss replaces the standard training loss:
where is the propensity - the probability item was shown to user under the logging policy. The division by propensity up-weights rare observations and down-weights common ones, producing an unbiased estimate of the true interaction probability.
Estimating propensity: The propensity score must be estimated from the logging policy. If the logging system is deterministic (always shows the same items to the same users), propensity estimation is impossible - some items have propensity exactly 0 and we cannot divide by 0. This is why every production recommendation system should include some randomization.
Common propensity estimators:
- Position propensity model: propensity is a function of position in the recommendation list. Items shown at position 1 have propensity ~0.9 (almost always seen). Items at position 10 have propensity ~0.3.
- Policy propensity: if you have a logged softmax policy (e.g., items sampled proportionally to their score), propensity is the softmax probability.
- Causal forest: non-parametric propensity estimation using covariates.
import torch
import torch.nn as nn
class IPWLoss(nn.Module):
"""
Inverse Propensity Weighted binary cross-entropy loss.
Corrects for exposure bias in logged feedback data.
"""
def __init__(self, clip_min: float = 0.01, clip_max: float = 10.0):
"""
clip_min: minimum propensity (avoid division by near-zero)
clip_max: maximum importance weight (clip extreme weights for variance reduction)
"""
super().__init__()
self.clip_min = clip_min
self.clip_max = clip_max
def forward(
self,
logits: torch.Tensor, # (batch,) model predictions
labels: torch.Tensor, # (batch,) binary labels
propensities: torch.Tensor, # (batch,) propensity scores p(o=1|u,i)
) -> torch.Tensor:
# Clamp propensities to avoid extreme weights
propensities_clamped = propensities.clamp(min=self.clip_min)
# Importance weights
weights = (1.0 / propensities_clamped).clamp(max=self.clip_max)
# Normalize weights to reduce variance
weights = weights / weights.mean()
# Weighted binary cross-entropy
bce = nn.functional.binary_cross_entropy_with_logits(
logits, labels.float(), reduction="none"
)
return (weights * bce).mean()
def estimate_position_propensity(
position_ctr: Dict[int, float], # {position: observed CTR}
reference_position: int = 0,
) -> Dict[int, float]:
"""
Estimate position propensity from observed CTR at each position.
Assumes CTR at position p = relevance × propensity(p).
Propensity is estimated relative to position 0 (reference).
Returns propensity scores for each position.
"""
ref_ctr = position_ctr[reference_position]
return {pos: ctr / ref_ctr for pos, ctr in position_ctr.items()}
Exploration Strategies
To collect unbiased data, you need to show users items they would not normally see under the greedy recommendation policy. This is exploration. The challenge is balancing exploration (gathering information about under-exposed items) with exploitation (recommending what you know users like).
Epsilon-greedy: With probability , show a random item. With probability , show the greedy recommendation. Simple but inefficient - random items are often completely irrelevant.
Thompson Sampling: Maintain a distribution over item scores. Sample from this distribution to select items. Items with high uncertainty have wide distributions and are sampled into recommendation slots more often than their expected score would suggest. This is Bayesian exploration.
UCB (Upper Confidence Bound): Score items by , where is the estimated reward, is the number of times item has been shown, and is the total number of impressions. Items with few impressions receive a bonus that promotes exploration.
Boltzmann exploration: Sample items proportionally to , where is the model's score and is the temperature. High temperature → more uniform sampling → more exploration. This is differentiable and can be integrated into the recommendation policy.
import numpy as np
from typing import List, Tuple
class ThompsonSamplingRecommender:
"""
Thompson Sampling for recommendation with Beta distribution priors.
Models click probability for each item as Beta(alpha, beta).
"""
def __init__(self, n_items: int, alpha_prior: float = 1.0, beta_prior: float = 1.0):
self.alpha = np.full(n_items, alpha_prior) # successes (clicks) + prior
self.beta = np.full(n_items, beta_prior) # failures (no-clicks) + prior
def recommend(self, n: int = 10, exclude: List[int] = None) -> List[int]:
"""
Sample from each item's Beta distribution and return top-n items.
Items with fewer observations have wider distributions → more exploration.
"""
sampled_probs = np.random.beta(self.alpha, self.beta)
if exclude:
sampled_probs[exclude] = -1
return np.argsort(sampled_probs)[::-1][:n].tolist()
def update(self, item_id: int, clicked: bool):
"""Update Beta distribution parameters based on observed outcome."""
if clicked:
self.alpha[item_id] += 1
else:
self.beta[item_id] += 1
def get_uncertainty(self, item_id: int) -> float:
"""Variance of Beta distribution - measure of uncertainty."""
a, b = self.alpha[item_id], self.beta[item_id]
return (a * b) / ((a + b) ** 2 * (a + b + 1))
class BoltzmannExploration:
"""Temperature-controlled exploration via softmax sampling."""
def __init__(self, temperature: float = 0.5):
self.temperature = temperature
def sample_recommendations(
self,
scores: np.ndarray,
n: int = 10,
) -> List[int]:
"""Sample n items proportionally to exp(score / temperature)."""
probs = np.exp(scores / self.temperature)
probs = probs / probs.sum()
return np.random.choice(len(scores), size=n, replace=False, p=probs).tolist()
Counterfactual Evaluation
Standard A/B testing measures the effect of the current system. Counterfactual evaluation asks: "How would a different policy have performed on the same users?" This allows offline policy evaluation without deploying the new policy and exposing users to potentially worse recommendations.
Inverse Propensity Scoring (IPS) estimator for offline policy evaluation:
This estimates the value (e.g., expected clicks) of the new policy using data collected under the logging policy .
Doubly Robust estimator (combines IPS with a direct model for lower variance):
where is the direct model estimate and is a reward prediction model. This estimator is consistent even if either the propensity model or the reward model is misspecified.
Production Engineering Notes
The Data Flywheel
A healthy data flywheel is a positive feedback loop:
Better model → Better recommendations → More user engagement → More data → Better model
The unhealthy version is the same loop but with "better" replaced by "more popular" and "more engagement" replaced by "more feedback on popular content." The difference is whether the feedback loop is driven by genuine user preference or by the system's prior recommendation decisions.
To maintain a healthy flywheel:
- Include randomized exploration in production (even at 1-5% of traffic)
- Apply propensity weighting to all training data
- Monitor long-tail coverage monthly
- Track Gini coefficient of recommendation distribution
- Run periodic "diversity injection" campaigns - deliberately recommend diverse content and measure user response
Logging Requirements
Counterfactual evaluation requires accurate propensity logging. Every recommendation event must be logged with:
- User ID
- Item IDs shown
- Positions of each item
- The policy's score for each item
- The probability of showing each item (the propensity)
- User response (click, no-click, watch time, etc.)
If your system does not log propensities, you cannot perform counterfactual evaluation and cannot apply IPW correction. Build propensity logging into the recommendation serving infrastructure from day one - retrofitting it is painful.
Common Mistakes
Mistake: Assuming that offline metrics predict online performance in a feedback-loop system.
In a system with strong feedback loops, offline evaluation uses data collected by the current policy. A new policy that would break the feedback loop (by exploring more, recommending diverse content) will look worse on offline metrics because it diverges from the historical data distribution. This is the classic case where offline evaluation misleads. Always supplement offline evaluation with counterfactual IPS estimates and always validate with online A/B or interleaving experiments.
Mistake: Setting exploration rate to zero after initial deployment.
Teams often set during initial deployment for data collection, then reduce it to or zero once the model is "good enough." This is the exact moment the feedback loop begins to strengthen. Maintain a minimum exploration rate (1-5%) permanently. The cost in short-term engagement metrics is small; the long-term benefit in data diversity and model coverage is large.
Mistake: Using position-naive click logs as training signal without position bias correction.
Items shown at position 1 have 3-5x higher click-through rates than items shown at position 5, regardless of relevance. If you train on raw click data without accounting for position, your model learns that "position 1 items are relevant" - and recommends items it would have shown at position 1 anyway. Always apply position bias correction (propensity weighting by position, or pair-wise training that compares items shown at the same position).
Tip: Use interleaving experiments to measure recommendation diversity at low cost.
Interleaving (mixing items from two policies in the same recommendation list) lets you compare two policies on the same user session simultaneously, with much smaller required sample sizes than standard A/B testing. Use interleaving experiments specifically to measure whether a de-biasing intervention (higher exploration, IPW training) produces meaningfully more diverse recommendations without hurting engagement. Interleaving has statistical power 10-100x higher than A/B testing for ranking comparison.
Interview Q&A
Q: What is a feedback loop in ML systems and what makes it harmful?
A: A feedback loop occurs when a system's outputs become part of its future training inputs. In recommendation: the system recommends items → users interact with recommended items → interactions become training data → model learns to recommend those items more strongly. This is harmful because it creates a self-reinforcing bias: the system's prior decisions, rather than true user preferences, determine what future data is collected and what the model learns. The system becomes increasingly confident about an increasingly narrow view of user preferences, amplifying whatever biases existed in the initial deployment. Popular items get more popular not because users prefer them to alternatives, but because they receive more exposure and therefore generate more training signal.
Q: Explain inverse propensity weighting and how it corrects for exposure bias.
A: IPW corrects for the fact that different items have different probabilities of being shown (different propensities). Standard training gives equal weight to all observed interactions, which overcounts interactions with frequently-shown items and undercounts interactions with rarely-shown items. IPW corrects this by weighting each training example by the inverse of its propensity: a rare interaction (propensity 0.1) receives weight 10; a common interaction (propensity 0.9) receives weight 1.1. This makes the weighted training distribution approximate the unbiased distribution of true user preferences. Mathematically, the IPW estimator is an unbiased estimator of the policy value under the true user preference distribution, given accurate propensity estimates. The practical requirement: every serving decision must log the probability that the item was shown, which requires propensity-aware serving infrastructure.
Q: How would you measure whether a recommendation system has a filter bubble problem?
A: Three metrics. First, intra-user diversity over time: for each user, compute the semantic diversity (average pairwise distance between item embeddings) of items they interact with in month 1 vs month 6. A shrinking diversity index signals filter bubble development. Second, long-tail coverage: what fraction of the catalog receives at least one recommendation per week? A declining coverage indicates the system is concentrating on popular items. Third, counterfactual user preference estimation: periodically show users items outside their recommendation bubble (using exploration) and measure their response. If users interact positively with out-of-bubble items at rates comparable to in-bubble items, the filter bubble is suppressing genuine diverse preferences.
Q: A recommendation system's diversity metrics have been declining for six months. What do you do?
A: Diagnose first. Is the diversity decline concentrated in specific user segments (new users? power users?) or catalog segments (certain categories)? What changed six months ago - was there a model update, a business rule change, a change in the recommendation surface? Then address with a combination of interventions. Short-term: increase exploration rate to 10-15% temporarily to inject diversity into recommendations and training data. Apply IPW correction to existing training data to reduce the weight of overexposed items. Medium-term: retrain the model with diversity-aware objectives - add a diversity regularization term to the ranking loss. Add a re-ranking stage that enforces catalog coverage constraints. Long-term: redesign the training data pipeline with permanent propensity weighting, minimum exploration floor at 3-5%, and regular diversity audits that trigger alerts if Gini coefficient exceeds a threshold.
Q: What is the difference between Thompson Sampling and epsilon-greedy for exploration in recommendation?
A: Epsilon-greedy randomly selects a uniformly random item with probability , ignoring all information about the item. This is highly inefficient - a random item is usually irrelevant, producing poor user experience and noisy training signal. Thompson Sampling samples from each item's posterior distribution over its quality. Items with high uncertainty (few observations) have wide posterior distributions, making them more likely to sample a high value and win the recommendation slot. Thompson Sampling is substantially more efficient than epsilon-greedy because exploration is concentrated on items that are genuinely uncertain rather than uniformly random. In practice, Thompson Sampling requires maintaining posterior distributions (or approximations) for all items, which is computationally expensive at catalog sizes of millions of items. Common approximations: Laplace approximation, ensemble uncertainty, or using a simpler count-based upper confidence bound.
