Multi-Armed Bandits
The Cost of Equal Splits
An e-commerce platform was running A/B tests on product recommendation algorithms. Each test ran for 14 days with a 50/50 traffic split. The logic was sound: equal splits maximize statistical power per unit of traffic.
The problem became apparent when the team added up the numbers. They had run 24 experiments in the past year. Of those, 18 had been won convincingly by the treatment - the new model outperformed the baseline by 2–8%. But for every one of those 18 winning experiments, 50% of traffic had received the inferior model for the entire 14-day duration. That was 7 days × 50% of traffic × 18 experiments = 63 experiment-days worth of users receiving demonstrably worse recommendations.
For a platform with 2 million daily users and average revenue per user of 220M in foregone revenue annually - revenue that would have been captured if the platform had adaptively shifted traffic toward the winning model as evidence accumulated.
This is the exploration-exploitation tradeoff. A/B tests are pure exploration - they learn at the expense of not exploiting what you already know. Multi-armed bandits balance both simultaneously.
:::tip 🎮 Interactive Playground Visualize this concept: Try the Multi-Armed Bandit Explorer demo on the EngineersOfAI Playground - no code required. :::
The Exploration-Exploitation Tradeoff
Imagine you are at a casino with 10 slot machines (the "arms"). Each machine has a fixed but unknown payout probability. Your goal is to maximize total winnings over 1000 pulls.
The pure exploration strategy: pull each machine 100 times (equal split), learn the true payout rates, then only play the best machine. But you spent 900 pulls on machines that were probably inferior.
The pure exploitation strategy: after one pull of each machine, only pull the machine that paid out first. But you learned nothing useful - one pull is not enough to estimate a payout rate.
The optimal strategy balances these: explore enough to learn reliably, exploit early when evidence is strong. Multi-armed bandit algorithms operationalize this tradeoff.
In ML experimentation:
- Each "arm" is a model version or configuration
- Each "pull" is a user request routed to one model
- The "reward" is a business metric (click, purchase, engagement)
- The goal is to maximize total reward during the experiment, not just learn the optimal arm at the end
Epsilon-Greedy: The Baseline
The simplest bandit algorithm. With probability ε, explore (random arm). With probability 1−ε, exploit (best known arm).
import numpy as np
from dataclasses import dataclass, field
from typing import List, Dict, Optional
import matplotlib.pyplot as plt
@dataclass
class ArmStats:
name: str
n_pulls: int = 0
n_rewards: int = 0
@property
def estimated_rate(self) -> float:
if self.n_pulls == 0:
return 0.0
return self.n_rewards / self.n_pulls
@property
def conversion_rate(self) -> float:
return self.estimated_rate
class EpsilonGreedyBandit:
"""
Epsilon-greedy multi-armed bandit.
Explores randomly with probability epsilon,
exploits the best known arm with probability 1-epsilon.
"""
def __init__(self, arm_names: List[str], epsilon: float = 0.1):
self.arms = {name: ArmStats(name) for name in arm_names}
self.epsilon = epsilon
def select_arm(self) -> str:
"""Select which arm to pull."""
if np.random.random() < self.epsilon:
# Explore: random arm
return np.random.choice(list(self.arms.keys()))
else:
# Exploit: best known arm
return max(self.arms.keys(), key=lambda k: self.arms[k].estimated_rate)
def update(self, arm_name: str, reward: int):
"""Update arm statistics after observing reward."""
self.arms[arm_name].n_pulls += 1
self.arms[arm_name].n_rewards += reward
def get_traffic_allocations(self) -> Dict[str, float]:
"""Current traffic allocation percentages."""
total_pulls = sum(a.n_pulls for a in self.arms.values())
if total_pulls == 0:
return {name: 1.0/len(self.arms) for name in self.arms}
return {name: arm.n_pulls / total_pulls for name, arm in self.arms.items()}
# Simulate epsilon-greedy on a 3-arm problem
# True conversion rates: arm A=0.05, arm B=0.08, arm C=0.06
true_rates = {"model_A": 0.05, "model_B": 0.08, "model_C": 0.06}
best_arm = max(true_rates, key=true_rates.get)
bandit = EpsilonGreedyBandit(list(true_rates.keys()), epsilon=0.1)
rewards_per_round = []
arm_selections = []
for t in range(10000):
arm = bandit.select_arm()
reward = int(np.random.random() < true_rates[arm])
bandit.update(arm, reward)
rewards_per_round.append(reward)
arm_selections.append(arm)
print("=== Epsilon-Greedy Results (epsilon=0.10) ===")
print(f"True best arm: {best_arm} (rate={true_rates[best_arm]:.2%})\n")
for name, arm in bandit.arms.items():
print(f"{name}: pulls={arm.n_pulls:,}, estimated_rate={arm.estimated_rate:.3f}, "
f"true_rate={true_rates[name]:.3f}")
cumulative_reward = np.cumsum(rewards_per_round)
# Optimal reward: always play best arm
optimal_reward = np.cumsum([true_rates[best_arm]] * 10000)
cumulative_regret = optimal_reward - cumulative_reward
print(f"\nTotal regret after 10,000 pulls: {cumulative_regret[-1]:.1f}")
print(f"Fraction of time on best arm: {arm_selections.count(best_arm)/10000:.1%}")
Weakness of epsilon-greedy: it explores at the same rate throughout the experiment, even when one arm is obviously superior. It also ignores uncertainty - it treats an arm with 10 pulls the same as one with 1000 pulls when exploiting.
Thompson Sampling: Probabilistic Exploration
Thompson Sampling is the state-of-the-art bandit algorithm for binary rewards. It works by maintaining a probability distribution over each arm's true conversion rate, sampling from those distributions, and selecting the arm with the highest sample.
The key insight: uncertainty drives exploration. An arm with few pulls has a wide distribution - it might be sampled high even if its empirical rate is low. An arm with many pulls has a narrow distribution - it samples near its true rate consistently. The algorithm naturally explores uncertain arms and exploits confident leaders.
For binary rewards (click/no-click, convert/no-convert), the Beta distribution is the conjugate prior:
Starting with Beta(1,1) = Uniform(0,1) as the prior, each observed success or failure updates the distribution.
@dataclass
class BetaArmStats:
name: str
alpha: float = 1.0 # prior: 1 success (uniform prior)
beta: float = 1.0 # prior: 1 failure (uniform prior)
def sample(self) -> float:
"""Sample from the posterior Beta distribution."""
return np.random.beta(self.alpha, self.beta)
def update(self, reward: int):
"""Update posterior with observed reward (1=success, 0=failure)."""
self.alpha += reward
self.beta += (1 - reward)
@property
def posterior_mean(self) -> float:
return self.alpha / (self.alpha + self.beta)
@property
def posterior_std(self) -> float:
a, b = self.alpha, self.beta
n = a + b
return np.sqrt(a * b / (n ** 2 * (n + 1)))
@property
def n_observations(self) -> int:
return int(self.alpha + self.beta - 2) # subtract the prior
class ThompsonSamplingBandit:
"""
Thompson Sampling multi-armed bandit using Beta-Binomial model.
Optimal (in the Bayes regret sense) for binary reward problems.
Naturally balances exploration and exploitation through posterior sampling.
"""
def __init__(self, arm_names: List[str]):
self.arms = {name: BetaArmStats(name) for name in arm_names}
def select_arm(self) -> str:
"""Sample from each arm's posterior and select the highest sample."""
samples = {name: arm.sample() for name, arm in self.arms.items()}
return max(samples, key=samples.get)
def update(self, arm_name: str, reward: int):
self.arms[arm_name].update(reward)
def get_probability_best(self, n_simulations: int = 10000) -> Dict[str, float]:
"""
Estimate the probability that each arm is the best,
using Monte Carlo simulation over the posterior.
"""
win_counts = {name: 0 for name in self.arms}
for _ in range(n_simulations):
samples = {name: arm.sample() for name, arm in self.arms.items()}
winner = max(samples, key=samples.get)
win_counts[winner] += 1
return {name: count / n_simulations for name, count in win_counts.items()}
def get_expected_loss(self) -> Dict[str, float]:
"""
Expected loss from choosing each arm vs the best arm.
When expected_loss < threshold, you can declare a winner.
"""
n_sim = 10000
best_samples = []
arm_samples = {name: [] for name in self.arms}
for _ in range(n_sim):
samples = {name: arm.sample() for name, arm in self.arms.items()}
best = max(samples.values())
best_samples.append(best)
for name, s in samples.items():
arm_samples[name].append(s)
expected_loss = {}
for name in self.arms:
losses = [b - a for b, a in zip(best_samples, arm_samples[name])]
expected_loss[name] = np.mean(losses)
return expected_loss
# Compare Thompson Sampling vs Epsilon-Greedy
print("=== Thompson Sampling vs Epsilon-Greedy ===\n")
true_rates = {"model_A": 0.05, "model_B": 0.08, "model_C": 0.06}
best_arm = max(true_rates, key=true_rates.get)
n_rounds = 10000
# Thompson Sampling
ts_bandit = ThompsonSamplingBandit(list(true_rates.keys()))
ts_rewards = []
ts_selections = []
for t in range(n_rounds):
arm = ts_bandit.select_arm()
reward = int(np.random.random() < true_rates[arm])
ts_bandit.update(arm, reward)
ts_rewards.append(reward)
ts_selections.append(arm)
# Epsilon-Greedy baseline
eg_bandit = EpsilonGreedyBandit(list(true_rates.keys()), epsilon=0.1)
eg_rewards = []
eg_selections = []
for t in range(n_rounds):
arm = eg_bandit.select_arm()
reward = int(np.random.random() < true_rates[arm])
eg_bandit.update(arm, reward)
eg_rewards.append(reward)
eg_selections.append(arm)
# Results
optimal_total = sum(true_rates[best_arm] for _ in range(n_rounds))
ts_total = sum(ts_rewards)
eg_total = sum(eg_rewards)
print(f"Optimal (always best arm): {optimal_total:.0f} expected rewards")
print(f"Thompson Sampling: {ts_total} rewards "
f"(regret: {optimal_total - ts_total:.0f})")
print(f"Epsilon-Greedy (eps=0.10): {eg_total} rewards "
f"(regret: {optimal_total - eg_total:.0f})")
print(f"\nFraction of time on best arm ({best_arm}):")
print(f" Thompson Sampling: {ts_selections.count(best_arm)/n_rounds:.1%}")
print(f" Epsilon-Greedy: {eg_selections.count(best_arm)/n_rounds:.1%}")
print(f"\nThompson Sampling posterior after {n_rounds} rounds:")
for name, arm in ts_bandit.arms.items():
print(f" {name}: mean={arm.posterior_mean:.4f} ± {arm.posterior_std:.4f} "
f"(true: {true_rates[name]:.4f}, n={arm.n_observations:,})")
print(f"\nProbability each arm is best:")
prob_best = ts_bandit.get_probability_best()
for name, prob in sorted(prob_best.items(), key=lambda x: -x[1]):
print(f" {name}: {prob:.1%}")
UCB (Upper Confidence Bound): Optimism Under Uncertainty
UCB takes a different approach: be optimistic about arms you are uncertain about. Select the arm with the highest upper confidence bound on its estimated rate:
Where is the estimated rate, is the total number of pulls, and is pulls on arm . The second term decreases as you pull arm more, reducing its attractiveness once well-explored.
class UCBBandit:
"""
UCB1 (Upper Confidence Bound) bandit algorithm.
Theoretically optimal regret bound: O(log T).
"""
def __init__(self, arm_names: List[str], c: float = 2.0):
self.arms = {name: ArmStats(name) for name in arm_names}
self.c = c # exploration constant
self.total_pulls = 0
def select_arm(self) -> str:
"""Select arm with highest UCB score."""
# Pull each arm at least once
for name, arm in self.arms.items():
if arm.n_pulls == 0:
return name
ucb_scores = {}
for name, arm in self.arms.items():
exploitation = arm.estimated_rate
exploration = np.sqrt(self.c * np.log(self.total_pulls) / arm.n_pulls)
ucb_scores[name] = exploitation + exploration
return max(ucb_scores, key=ucb_scores.get)
def update(self, arm_name: str, reward: int):
self.arms[arm_name].n_pulls += 1
self.arms[arm_name].n_rewards += reward
self.total_pulls += 1
UCB is deterministic (given the same data, always selects the same arm) and has provably optimal regret bounds. Thompson Sampling tends to outperform UCB in practice on small datasets, while UCB has stronger theoretical guarantees.
Stopping Criteria: When to Declare a Winner
A/B tests have fixed duration. Bandits need principled stopping rules.
For Thompson Sampling: Stop when the expected loss from committing to the best arm falls below a business threshold (e.g., $0.001 in revenue per user):
def should_stop_experiment(
bandit: ThompsonSamplingBandit,
stopping_threshold: float = 0.001,
min_observations: int = 1000
) -> dict:
"""
Decide whether to stop the bandit experiment and declare a winner.
stopping_threshold: maximum acceptable expected loss from committing to best arm
min_observations: minimum pulls before any arm is eligible to be declared winner
"""
# Check minimum observations requirement
for name, arm in bandit.arms.items():
if arm.n_observations < min_observations:
return {
"should_stop": False,
"reason": f"arm {name} has only {arm.n_observations} observations (min: {min_observations})"
}
expected_loss = bandit.get_expected_loss()
best_arm = min(expected_loss, key=expected_loss.get)
min_loss = expected_loss[best_arm]
prob_best = bandit.get_probability_best()
if min_loss <= stopping_threshold:
return {
"should_stop": True,
"winner": best_arm,
"expected_loss": min_loss,
"probability_best": prob_best[best_arm],
"reason": f"Expected loss {min_loss:.5f} below threshold {stopping_threshold}"
}
return {
"should_stop": False,
"winner": None,
"expected_loss": min_loss,
"probability_best": prob_best.get(best_arm),
"reason": f"Expected loss {min_loss:.5f} above threshold {stopping_threshold}"
}
# Simulate progressive stopping decisions
ts = ThompsonSamplingBandit(["model_A", "model_B", "model_C"])
true_rates = {"model_A": 0.05, "model_B": 0.08, "model_C": 0.06}
checkpoints = [100, 250, 500, 1000, 2000, 5000]
checkpoint_idx = 0
for t in range(5001):
arm = ts.select_arm()
reward = int(np.random.random() < true_rates[arm])
ts.update(arm, reward)
if checkpoint_idx < len(checkpoints) and t + 1 == checkpoints[checkpoint_idx]:
decision = should_stop_experiment(ts, stopping_threshold=0.002, min_observations=200)
total_obs = sum(a.n_observations for a in ts.arms.values())
print(f"\nAfter {total_obs} total observations:")
print(f" Stop? {decision['should_stop']} - {decision['reason']}")
if decision["should_stop"]:
print(f" Winner: {decision['winner']} (P(best)={decision['probability_best']:.1%})")
checkpoint_idx += 1
Contextual Bandits: Personalization at Scale
Standard multi-armed bandits select the globally best arm. But the best recommendation model for a new user might differ from the best for a power user. Contextual bandits extend MABs by conditioning arm selection on features of the current context (user, query, time).
from sklearn.linear_model import SGDClassifier
import numpy as np
class LinUCBContextualBandit:
"""
Linear UCB contextual bandit (simplified implementation).
Maintains a linear model per arm predicting reward from context.
Selects arm with highest UCB: predicted_reward + uncertainty_bonus.
"""
def __init__(self, n_arms: int, n_features: int, alpha: float = 1.0):
self.n_arms = n_arms
self.alpha = alpha # exploration parameter
# Per-arm: A matrix (n_features x n_features) and b vector (n_features)
self.A = [np.identity(n_features) for _ in range(n_arms)]
self.b = [np.zeros(n_features) for _ in range(n_arms)]
def select_arm(self, context: np.ndarray) -> int:
"""Select arm with highest UCB score given context."""
ucb_scores = []
for arm_idx in range(self.n_arms):
A_inv = np.linalg.inv(self.A[arm_idx])
theta = A_inv @ self.b[arm_idx] # parameter estimate
predicted_reward = theta @ context
uncertainty = self.alpha * np.sqrt(context @ A_inv @ context)
ucb_scores.append(predicted_reward + uncertainty)
return int(np.argmax(ucb_scores))
def update(self, arm_idx: int, context: np.ndarray, reward: float):
"""Update arm model with observed (context, reward) pair."""
self.A[arm_idx] += np.outer(context, context)
self.b[arm_idx] += reward * context
# Example: contextual bandit for recommendation model selection
# Context: [is_new_user, engagement_score, mobile_indicator, time_of_day_normalized]
print("=== Contextual Bandit for Personalized Model Selection ===\n")
n_arms = 3 # 3 recommendation model variants
n_features = 4
bandit_ctx = LinUCBContextualBandit(n_arms=n_arms, n_features=n_features, alpha=0.5)
def true_reward(arm: int, context: np.ndarray) -> float:
"""Simulate true reward: different arms work better for different users."""
is_new = context[0]
engagement = context[1]
# Model 0: better for new users (simple exploration)
# Model 1: better for engaged users (deep personalization)
# Model 2: balanced
rewards = [
0.05 + 0.03 * is_new - 0.01 * engagement, # model 0
0.04 - 0.01 * is_new + 0.04 * engagement, # model 1
0.055 + 0.005 * engagement, # model 2
]
return max(0, rewards[arm] + np.random.normal(0, 0.02))
n_rounds = 2000
total_reward = 0
optimal_reward = 0
for t in range(n_rounds):
# Sample a user context
context = np.array([
np.random.binomial(1, 0.3), # is_new_user (30% new)
np.random.beta(2, 5), # engagement score (0-1)
np.random.binomial(1, 0.6), # mobile (60% mobile)
np.random.uniform(0, 1) # normalized time of day
])
arm = bandit_ctx.select_arm(context)
reward = true_reward(arm, context)
bandit_ctx.update(arm, context, reward)
total_reward += reward
optimal_reward += max(true_reward(a, context) for a in range(n_arms))
print(f"After {n_rounds} rounds:")
print(f" Total reward: {total_reward:.1f}")
print(f" Optimal reward: {optimal_reward:.1f}")
print(f" Regret: {optimal_reward - total_reward:.1f} ({(optimal_reward-total_reward)/optimal_reward:.1%})")
Bandits vs A/B Tests: When to Use Each
Use A/B tests when: you need a defensible causal estimate (regulatory requirements, high-stakes decisions), the experiment runs quickly enough that regret is not a concern, or you need a clean separation of learning and exploitation phases for statistical rigor.
Use bandits when: you are testing many variants (bandit efficiency increases with more arms), user harm from suboptimal arms is costly, or you want to deploy the winner progressively as you learn rather than waiting for a fixed duration.
Production Engineering Notes
Delayed rewards: Bandits require prompt reward signals to adapt traffic allocation. If your reward metric takes 24–48 hours (e.g., subscription renewal), bandit algorithms cannot adapt quickly. Use bandits for metrics with immediate rewards (click, add-to-cart) and A/B tests for delayed metrics (LTV, retention).
Reward poisoning: In production, not all "rewards" are real. Bots, ad fraud, and return purchases can poison the reward signal. Filter your reward signal carefully before feeding it to the bandit update step.
Arm disappearance: In production, a model version can fail (OOM, deployment error) at any time. Your bandit system needs to detect arm unavailability and route traffic away from failed arms, then reintroduce the arm's statistics when it recovers - or treat the recovered arm as a fresh arm to avoid rewarding past performance before the failure.
Logging for causal inference: Bandit logs cannot be naively used for offline policy evaluation because the logging policy was adaptive. Store the propensity scores (probability of selecting each arm at each timestep) alongside the reward to enable inverse propensity score correction.
Common Mistakes
:::danger Comparing Long-Running Bandits Against A/B Test Results Bandit traffic allocations are not random - early on, traffic is more uniform; later, traffic concentrates on the leading arm. This means bandit experiments have lower statistical power for detecting small effects than equivalent-length A/B tests. Do not use bandit data to compute a p-value as if it were from a 50/50 A/B test - the selection bias will produce invalid statistics. Use IPS-corrected estimators for causal inference from bandit logs. :::
:::danger Using Bandits When Legal or Regulatory Compliance Requires Equal Treatment Bandits intentionally give users different experiences based on observed outcomes. In regulated domains (credit, hiring, healthcare), this adaptive differentiation may constitute discrimination or violate equal treatment requirements. A/B tests with pre-specified equal splits are more defensible in regulatory contexts. Consult your legal team before using bandits in regulated product areas. :::
:::warning Cold Start: Bandits Perform Poorly for New Arms When a new arm is added to an existing bandit experiment, it starts with no observations and gets a wide uncertainty interval. Thompson Sampling will initially over-explore the new arm, potentially directing significant traffic away from the currently leading arm. Warm-start new arms with informative priors (based on offline evaluation) to reduce cold start exploration cost. :::
:::warning Bandits Are Not a Replacement for Feature Flags and Staged Rollouts A bandit selects among pre-validated model variants. It does not replace shadow mode testing (validating correctness), canary deployment (catching regressions with small blast radius), or feature flags (kill switch capability). Run shadow mode and canary before introducing a new arm into a bandit experiment. :::
Interview Q&A
Q: Explain the exploration-exploitation tradeoff in the context of ML model selection.
A: During an A/B test or bandit experiment, you must balance two competing goals: exploring (routing traffic to arms you know little about, to learn their true quality) and exploiting (routing traffic to the arm you currently believe is best, to maximize reward during the experiment). Pure exploration (equal splits like A/B testing) maximizes learning but incurs regret - users receive suboptimal recommendations while you learn. Pure exploitation (always serve the current best model) stops learning and gets stuck if early estimates are wrong. Bandits formalize this tradeoff: Thompson Sampling explores arms proportionally to their probability of being the best, naturally transitioning from exploration to exploitation as evidence accumulates. The cost of suboptimal arm selections over time is called regret, and bandit algorithms aim to minimize it.
Q: How does Thompson Sampling work?
A: Thompson Sampling maintains a probability distribution over each arm's true reward rate (the posterior), samples once from each arm's distribution, and selects the arm with the highest sample. For binary rewards, the Beta distribution is the natural posterior: starting with Beta(1,1) as a uniform prior, each success adds 1 to the alpha parameter and each failure adds 1 to beta. The algorithm naturally explores uncertain arms (wide distributions produce high samples with meaningful probability) and exploits confident leaders (narrow distributions near the true rate produce consistently high samples only if the true rate is high). It is theoretically optimal in terms of Bayes regret and outperforms UCB empirically on small datasets. The implementation is extremely simple: it is essentially just sampling from Beta distributions and taking the argmax.
Q: When would you choose a bandit over an A/B test for an ML experiment?
A: Bandits are preferable when: (1) the cost of routing traffic to a suboptimal model is high - for a revenue-generating recommendation system, every user receiving the inferior model costs money, and bandits minimize this; (2) you are comparing many variants - with 10+ model variants, a bandit converges to the winner much faster than a fixed-split A/B test by focusing exploration on the top contenders; (3) you want progressive deployment - bandits naturally shift traffic to winners as evidence accumulates, eliminating the abrupt 0-100% switch. Stick with A/B tests when: regulatory requirements mandate equal treatment, you need a clean p-value with known type I error control (bayesian bandit stopping rules are harder to audit), or the reward signal is delayed (bandits cannot adapt without timely feedback).
Q: What is a contextual bandit and how does it differ from a standard MAB?
A: A standard MAB selects a single globally optimal arm - the same arm for every user, regardless of context. A contextual bandit conditions arm selection on the observed context (user features, query, time of day) and learns a mapping from context to optimal arm. Concretely: a standard MAB might learn that "model B is best overall." A contextual bandit might learn that "model B is best for high-engagement users, but model A is best for new users." LinUCB is the canonical contextual bandit algorithm: it maintains a linear model per arm predicting reward from context, and selects the arm with the highest UCB score computed from the linear prediction plus an uncertainty term. Neural contextual bandits (like NeuralUCB or neural Thompson Sampling) handle non-linear context-reward relationships but require more data and computation.
Q: How do you handle the cold start problem in bandit experiments?
A: The cold start problem occurs when a new arm has no observations - it gets a maximally uncertain prior, causing the bandit to over-explore it, potentially routing significant traffic away from well-validated existing arms. Solutions: (1) warm-start the prior using offline evaluation - if your A/B test data or holdout evaluation suggests the new arm has an estimated rate of 0.07 with a standard deviation corresponding to 500 effective observations, initialize the arm's Beta distribution as Beta(35, 465) rather than Beta(1,1); (2) use a separate exploration budget - route a fixed 5% of traffic to new arms regardless of the bandit's allocation until they accumulate sufficient observations; (3) require shadow mode validation before introducing an arm into the live bandit - this accumulates observations safely before the arm affects production traffic allocation.
