Skip to main content

Interleaving Experiments

Why Search Teams Can Iterate 10x Faster

In 2009, the Netflix Prize concluded. Teams had spent three years improving recommendation algorithms. The winning team improved RMSE by 10.06%. Netflix, however, noted in a blog post that A/B testing was so slow and expensive to run that the full Prize-winning system was never actually deployed in production - the operational cost of running lengthy A/B tests to validate each algorithm change made iteration extremely slow.

At the same time, Microsoft Research and Google were developing a technique called interleaving that would fundamentally change how search and recommendation systems are evaluated online. Rather than showing each user a result list from either model A or model B, interleaving shows each user a result list that blends items from both models. Clicks on items from each model are attributed back to the contributing model. The result: experiments that require 10–25x fewer users to reach the same statistical conclusions as A/B tests.

This sensitivity advantage means a search team can validate a ranking change with 50,000 users instead of 1 million. An experiment that would take 3 weeks with A/B testing takes 2 days with interleaving. This is why search quality teams at major technology companies run hundreds of experiments per month while recommendation teams using A/B testing run tens.


:::tip 🎮 Interactive Playground Visualize this concept: Try the A/B Testing for ML Models demo on the EngineersOfAI Playground - no code required. :::

Why A/B Tests Are Insensitive for Ranking

The fundamental problem with A/B testing for ranking models: you are not measuring user preference between two orderings. You are measuring click rates on two different item sets.

Consider search result ranking. Model A ranks [item1, item2, item3, item4, item5]. Model B ranks [item2, item4, item1, item5, item3]. The items are the same, but in different order. In an A/B test:

  • Control users see model A's ranking and click on some items
  • Treatment users see model B's ranking and click on some items
  • You compare click rates

But user engagement depends heavily on what items are shown, not just how they are ordered. A model that surfaces a viral item at rank 1 will see a click rate spike regardless of whether its ranking quality improved. The item effect dominates the ranking effect.

Interleaving eliminates this confound. Both models rank the same item pool. The user sees a merged list. Clicks reveal direct preference between the two orderings.


Team Draft Interleaving

Team Draft Interleaving (TDI) is the standard method, introduced by Hofmann et al. (2011). It works like picking teams in a playground game:

  1. Flip a coin to decide which model picks first (model A or B)
  2. The first model picks its top unselected item, adds it to the merged list, and "claims" that item
  3. The second model picks its top unselected item from remaining items
  4. Alternate until the merged list is full
  5. Track which model claimed each position

When a user clicks an item, the click is attributed to whichever model claimed that item. At the end of the experiment, compare total clicks attributed to model A vs model B.

import numpy as np
from dataclasses import dataclass, field
from typing import List, Tuple, Dict
from collections import defaultdict
from scipy import stats

@dataclass
class InterleaveResult:
item_id: str
position: int
contributing_model: str # "A" or "B"


def team_draft_interleave(
ranking_a: List[str],
ranking_b: List[str],
k: int = 10
) -> Tuple[List[InterleaveResult], str]:
"""
Team Draft Interleaving: produce merged ranking from two model rankings.

Args:
ranking_a: Ordered list of item IDs from model A (best first)
ranking_b: Ordered list of item IDs from model B (best first)
k: Number of results in merged list

Returns:
(merged_list, first_picker): merged results with attribution, and which model picked first
"""
# Randomly determine which model picks first (balanced across requests)
first_picker = np.random.choice(["A", "B"])
pickers = [first_picker, "B" if first_picker == "A" else "A"]

merged = []
claimed_by_a = set()
claimed_by_b = set()
seen = set()

# Indexes into each model's ranking
idx_a, idx_b = 0, 0

while len(merged) < k:
for picker in pickers:
if len(merged) >= k:
break

if picker == "A":
# Model A picks its highest-ranked unclaimed item
while idx_a < len(ranking_a) and ranking_a[idx_a] in seen:
idx_a += 1
if idx_a < len(ranking_a):
item = ranking_a[idx_a]
merged.append(InterleaveResult(
item_id=item,
position=len(merged),
contributing_model="A"
))
claimed_by_a.add(item)
seen.add(item)
idx_a += 1
else:
# Model B picks its highest-ranked unclaimed item
while idx_b < len(ranking_b) and ranking_b[idx_b] in seen:
idx_b += 1
if idx_b < len(ranking_b):
item = ranking_b[idx_b]
merged.append(InterleaveResult(
item_id=item,
position=len(merged),
contributing_model="B"
))
claimed_by_b.add(item)
seen.add(item)
idx_b += 1

return merged, first_picker


def simulate_user_click(
merged_list: List[InterleaveResult],
item_relevance: Dict[str, float],
position_bias: List[float]
) -> List[str]:
"""
Simulate user clicking on items in the merged list.

Users click based on item relevance × position bias (examination model).
position_bias: probability of examining position i (1.0 at position 0, decays)
"""
clicks = []
for result in merged_list:
if result.position < len(position_bias):
exam_prob = position_bias[result.position]
relevance = item_relevance.get(result.item_id, 0.0)
click_prob = exam_prob * relevance
if np.random.random() < click_prob:
clicks.append(result.item_id)
return clicks


def analyze_interleaving(
experiment_sessions: List[Dict],
) -> Dict:
"""
Aggregate interleaving experiment results across sessions.

experiment_sessions: list of {
"merged_list": List[InterleaveResult],
"clicks": List[str] # item IDs clicked
}
"""
wins_a = 0 # sessions where A got more attributed clicks
wins_b = 0 # sessions where B got more attributed clicks
ties = 0

total_clicks_a = 0
total_clicks_b = 0
total_sessions = len(experiment_sessions)

for session in experiment_sessions:
item_to_model = {r.item_id: r.contributing_model for r in session["merged_list"]}
session_clicks_a = sum(1 for click in session["clicks"] if item_to_model.get(click) == "A")
session_clicks_b = sum(1 for click in session["clicks"] if item_to_model.get(click) == "B")

total_clicks_a += session_clicks_a
total_clicks_b += session_clicks_b

if session_clicks_a > session_clicks_b:
wins_a += 1
elif session_clicks_b > session_clicks_a:
wins_b += 1
else:
ties += 1

# Sign test: is the number of B-wins significantly greater than A-wins?
# (excluding ties)
decisive_sessions = wins_a + wins_b
if decisive_sessions > 0:
# Under H0 (no preference), P(B wins) = 0.5
p_value = 2 * min(
stats.binom.cdf(wins_b, decisive_sessions, 0.5),
1 - stats.binom.cdf(wins_b - 1, decisive_sessions, 0.5)
)
else:
p_value = 1.0

winner = "B" if wins_b > wins_a else ("A" if wins_a > wins_b else "Tie")

return {
"total_sessions": total_sessions,
"wins_a": wins_a,
"wins_b": wins_b,
"ties": ties,
"decisive_sessions": decisive_sessions,
"clicks_a": total_clicks_a,
"clicks_b": total_clicks_b,
"click_ratio_b": total_clicks_b / max(total_clicks_a + total_clicks_b, 1),
"p_value": p_value,
"significant": p_value < 0.05,
"winner": winner
}


# ===== Full simulation: comparing two search ranking models =====
print("=== Interleaving Experiment Simulation ===\n")

# 20 items, each with a relevance score
items = [f"item_{i:02d}" for i in range(20)]

# True relevance: items 0-4 are highly relevant, rest have decreasing relevance
true_relevance = {item: max(0, 1.0 - 0.08 * i) for i, item in enumerate(items)}

# Model A: good ranking but not perfect - puts items in roughly correct order
model_a_ranking = sorted(items, key=lambda x: true_relevance[x] + np.random.normal(0, 0.15))

# Model B: slightly better ranking - less noise
model_b_ranking = sorted(items, key=lambda x: true_relevance[x] + np.random.normal(0, 0.08))

# Position bias: users examine top results more
position_bias = [1.0 / (1 + i * 0.5) for i in range(10)]

# Run experiment
n_sessions = 500
sessions = []

for _ in range(n_sessions):
merged, first_picker = team_draft_interleave(model_a_ranking, model_b_ranking, k=10)
clicks = simulate_user_click(merged, true_relevance, position_bias)
sessions.append({"merged_list": merged, "clicks": clicks, "first_picker": first_picker})

results = analyze_interleaving(sessions)

print(f"Sessions: {results['total_sessions']}")
print(f"Model A wins: {results['wins_a']} ({results['wins_a']/results['total_sessions']:.1%})")
print(f"Model B wins: {results['wins_b']} ({results['wins_b']/results['total_sessions']:.1%})")
print(f"Ties: {results['ties']}")
print(f"Total attributed clicks - A: {results['clicks_a']}, B: {results['clicks_b']}")
print(f"Click ratio (B/total): {results['click_ratio_b']:.3f}")
print(f"P-value: {results['p_value']:.4f}")
print(f"Significant: {results['significant']}")
print(f"Winner: {results['winner']}")

Why Interleaving Is More Sensitive

The sensitivity advantage comes from within-session comparison. In a standard A/B test:

  • A user in the control group sees model A's list and clicks 2 items
  • A different user in the treatment group sees model B's list and clicks 3 items
  • You are comparing behaviors across different users - who have different preferences, different contexts, different intent

This between-user comparison has high variance because user-level engagement differs dramatically. Some users click a lot, others click nothing. This noise swamps the signal from ranking quality.

In interleaving:

  • The same user sees items from both models in the same session
  • Their clicks directly reveal preference between the two orderings for their specific query and intent
  • User-level heterogeneity is cancelled out - each user is their own control

This within-user comparison eliminates the largest source of variance in ranking experiments, producing 10–25x sensitivity improvement in practice (Chapelle et al., 2012).


Balanced Interleaving

Team Draft Interleaving can suffer from position bias: the first-picking model places its top item at position 1, getting a higher examination probability. Balanced Interleaving (Joachims, 2002) addresses this by ensuring each model "owns" equal positions in the merged list.

def balanced_interleave(
ranking_a: List[str],
ranking_b: List[str],
k: int = 10
) -> Tuple[List[InterleaveResult], str]:
"""
Balanced Interleaving: ensures each model fills equal numbers of positions,
alternating which model picks first to avoid systematic position bias.
"""
first_picker = np.random.choice(["A", "B"])
second_picker = "B" if first_picker == "A" else "A"

merged = []
seen = set()
idx = {"A": 0, "B": 0}
rankings = {"A": ranking_a, "B": ranking_b}

for position in range(k):
# Alternate picks: first_picker takes even positions, second takes odd
picker = first_picker if position % 2 == 0 else second_picker

# Find next item from this picker not already in merged list
ranking = rankings[picker]
while idx[picker] < len(ranking) and ranking[idx[picker]] in seen:
idx[picker] += 1

if idx[picker] < len(ranking):
item = ranking[idx[picker]]
merged.append(InterleaveResult(
item_id=item,
position=position,
contributing_model=picker
))
seen.add(item)
idx[picker] += 1

return merged, first_picker

Multileaving: Comparing Multiple Models Simultaneously

For teams running many model variants, multileaving generalizes interleaving to 3+ models simultaneously:

def team_draft_multileave(
rankings: Dict[str, List[str]], # model_name -> ranked items
k: int = 10
) -> List[Tuple[str, str]]: # (item_id, contributing_model)
"""
Team Draft Multileaving: extend interleaving to multiple models.
Each model alternately picks items, claiming the items it contributes.
"""
model_names = list(rankings.keys())
n_models = len(model_names)

# Shuffle order of models for this session
pick_order = model_names.copy()
np.random.shuffle(pick_order)

merged = []
seen = set()
idx = {m: 0 for m in model_names}

while len(merged) < k:
for picker in pick_order:
if len(merged) >= k:
break
ranking = rankings[picker]
while idx[picker] < len(ranking) and ranking[idx[picker]] in seen:
idx[picker] += 1
if idx[picker] < len(ranking):
item = ranking[idx[picker]]
merged.append((item, picker))
seen.add(item)
idx[picker] += 1

return merged


# Example: comparing 4 search model variants simultaneously
print("=== Multileaving: 4-Model Comparison ===\n")

items = [f"doc_{i:02d}" for i in range(30)]
true_relevance = {item: 1.0 / (1 + i * 0.1) for i, item in enumerate(items)}

# 4 models with different noise levels (simulating different quality)
models = {
"baseline": sorted(items, key=lambda x: true_relevance[x] + np.random.normal(0, 0.25)),
"model_v2": sorted(items, key=lambda x: true_relevance[x] + np.random.normal(0, 0.18)),
"model_v3": sorted(items, key=lambda x: true_relevance[x] + np.random.normal(0, 0.12)),
"model_v4": sorted(items, key=lambda x: true_relevance[x] + np.random.normal(0, 0.08)),
}

n_sessions = 1000
model_clicks = defaultdict(int)
model_sessions = defaultdict(int)

position_bias = [1.0 / (1 + i * 0.6) for i in range(10)]

for _ in range(n_sessions):
# Regenerate rankings each session (simulate query-level ranking)
session_models = {
name: sorted(items, key=lambda x: true_relevance[x] + np.random.normal(
0, {"baseline": 0.25, "model_v2": 0.18, "model_v3": 0.12, "model_v4": 0.08}[name]
))
for name in models
}

merged = team_draft_multileave(session_models, k=10)
item_to_model = {item: model for item, model in merged}

for i, (item, model) in enumerate(merged):
if i < len(position_bias) and np.random.random() < position_bias[i] * true_relevance[item]:
model_clicks[model] += 1

print(f"{'Model':>12} | {'Clicks':>8} | {'Click Share':>12} | {'Rank'}")
print("-" * 45)
total_clicks = sum(model_clicks.values())
for rank, (model, clicks) in enumerate(sorted(model_clicks.items(), key=lambda x: -x[1]), 1):
share = clicks / total_clicks
print(f"{model:>12} | {clicks:>8} | {share:>12.1%} | {rank}")

When Interleaving Falls Short

Interleaving is powerful but not universal. It has real limitations:

Only works for ranking/recommendation: Interleaving requires that both models rank the same item pool. It does not work for models that retrieve different item pools, for models that generate new content (LLMs), or for non-ranking decisions (pricing, fraud, targeting).

Measures engagement, not long-term quality: Interleaving captures which model's items get clicked in this session. It does not measure whether clicking those items led to satisfaction, return visits, or conversion. A model that surfaces clickbait will win interleaving experiments while destroying long-term engagement.

Susceptible to position bias despite mitigation: Even with balanced interleaving, model A's top item being at position 1 vs position 2 (depending on who picked first) creates position bias. This must be accounted for in analysis.

Cannot measure safety or diversity: A model that shows maximally clickable but homogeneous items wins interleaving. This cannot detect filter bubbles, representation failures, or content safety issues.


Production Engineering Notes

Log everything for offline analysis: Store the full merged list, item-to-model mapping, click positions, and session metadata. You will need this for debugging, for computing sensitivity comparisons, and for training future models.

Ensure determinism in ranking inputs: If model A and model B query the same item database but at slightly different times, they may see different item availability. Snapshot the candidate pool at request time and pass the same snapshot to both models.

Handle model latency asymmetry: If model A takes 20ms and model B takes 200ms, the interleaved request takes 200ms - degrading user experience. Use request-level timeouts and async model calls. If the shadow model times out, fall back to the live model's ranking.

Balance first-picker assignment: Record which model picks first for each session and verify it is balanced 50/50 over time. Imbalance creates systematic position bias that inflates the first-picker's click attribution.


Common Mistakes

:::danger Using Interleaving for Non-Ranking Models Interleaving only works when both models rank the same set of candidates. If model A retrieves candidates A1–A10 and model B retrieves candidates B1–B10 with partial overlap, interleaving confounds item quality with model quality. The model with better candidate retrieval will win interleaving regardless of ranking quality. Reserve interleaving for pure ranking comparisons where the candidate pool is fixed. :::

:::warning Treating Interleaving Wins as Full Causal Evidence Interleaving tells you which model users prefer in the moment of the search. It does not tell you which model produces better long-term outcomes. Use interleaving for rapid iteration (does this ranking change improve user preference?), but follow up winning interleaving results with A/B tests measuring downstream business metrics (purchase completion, subscription retention) before fully committing to the new model. :::

:::warning Ignoring the Sign Test in Favor of Click Ratio The click ratio (B clicks / total clicks) is a continuous metric that seems intuitive but ignores within-session variance. The sign test (does B win more sessions than A, excluding ties?) is more statistically robust because it is not affected by outlier sessions where one model got many more attributed clicks due to favorable position luck. Use the sign test as your primary statistical test for interleaving experiments. :::


Interview Q&A

Q: What is interleaving and why is it more sensitive than A/B testing for ranking models?

A: Interleaving merges rankings from two models into a single result list and attributes clicks back to the contributing model. When a user clicks item X, and item X was placed in the list by model B, that click is credited to model B. At the end of the experiment, you compare total attributed clicks: did model A or model B earn more user preference signals? The sensitivity advantage comes from within-session comparison. In a standard A/B test, you compare different users who received different item sets - user heterogeneity creates enormous variance. In interleaving, the same user sees items from both models in the same session, so each user's clicks directly reveal their preference between the two rankings. This eliminates the between-user variance that dominates A/B tests for ranking, yielding 10–25x sensitivity improvement and 10–25x fewer users needed to reach the same statistical power.

Q: Describe how Team Draft Interleaving works step by step.

A: Team Draft Interleaving works like picking teams in a sports draft. Step 1: flip a coin to determine which model (A or B) picks first - this is random per session and balanced across sessions. Step 2: the first model selects its highest-ranked item not yet in the merged list and "claims" it. Step 3: the second model selects its highest-ranked item not yet claimed and claims it. Step 4: alternate until the merged list has k results. Each item in the merged list is labeled with which model claimed it. When users click, each click is attributed to the claiming model. At experiment end, you compare total attributed clicks using a sign test: in what fraction of sessions did model B earn more attributed clicks than model A, excluding tied sessions?

Q: What are the limitations of interleaving?

A: Several important limitations. First, it only works for ranking - both models must rank the same candidate pool. It cannot compare retrieval systems, generative models, or non-ranking decisions. Second, it measures short-term engagement preference, not long-term quality. A model that ranks clickbait first wins interleaving while harming user satisfaction. Third, it cannot measure secondary effects: safety violations, diversity failures, filter bubbles, or content quality. Fourth, position bias persists despite mitigation - the first-picking model gets position 1 more often when its top item is unique (not also in model B's top). Fifth, for very short queries or session with zero clicks, interleaving produces no signal. Despite these limitations, interleaving is extremely valuable for rapid ranking iteration - use it to quickly filter model variants, then validate winners with A/B tests measuring business outcomes.

Q: When would you use multileaving instead of interleaving?

A: Multileaving extends interleaving to simultaneously compare 3 or more model variants in a single experiment. You would use it when you have many candidate model variants to screen - for example, a hyperparameter search that produced 8 ranking model variants. Standard approach: run 4 pairwise interleaving experiments (each pair compared separately). Multileaving approach: run all 8 variants simultaneously, attributing clicks to contributing models, and rank all 8 by attributed click share in one experiment. Multileaving reduces the total experiment time by a factor of N/2 for N models. The tradeoff: multileaving is harder to analyze (click attribution is noisier with more models competing for positions) and requires more sophisticated statistical analysis. Use multileaving for rapid screening of many variants, then confirm the winner with a focused pairwise interleaving or A/B test.

© 2026 EngineersOfAI. All rights reserved.