Counterfactual Evaluation
The Content Platform That Could Not Afford to Be Wrong
A content recommendation platform had a problem. Their existing recommendation model had been running for three years. It generated 40% of all user engagement. They were building a new model that promised to improve session depth by 15%. But they could not just run an A/B test.
The platform had experienced a catastrophic A/B test failure eighteen months earlier. A new model variant had been live for six hours before anyone noticed it was recommending increasingly sensational content to users who engaged with political news. By the time the rollout was stopped, thousands of users had been served a content diet that the company's trust and safety team described as "a radicalization pipeline." The reputational damage took months to recover from.
Now, any new recommendation model required months of offline evaluation before any user would see it. The offline evaluation system had to answer: "If we had deployed this model instead of the existing model for the past 90 days, how would user engagement have differed?" Answering this question without running an A/B test is the problem of counterfactual evaluation.
The challenge: you have logs from the old model. You want to estimate what would have happened under the new model. But the new model's recommendations are a distribution over items, and your logs only recorded what actually happened - not the counterfactual.
:::tip 🎮 Interactive Playground Visualize this concept: Try the Counterfactual Explanations demo on the EngineersOfAI Playground - no code required. :::
The Potential Outcomes Framework
Counterfactual evaluation builds on the potential outcomes framework from causal statistics (Rubin, 1974; Imbens & Rubin, 2015).
For each user-item interaction, define:
- : the outcome (engagement, click, revenue) if action is taken
- The fundamental problem of causal inference: you can only observe , not
In recommendation, the "action" is which item to show. The existing model (logging policy ) showed item to user . The new model (target policy ) would have shown item . You observe the engagement when item was shown. You want to estimate what engagement would have been under .
You cannot directly observe this. But if you know the probability that the logging policy chose each action (the propensity score), you can reweight observed outcomes to estimate counterfactual expectations.
Inverse Propensity Scoring (IPS)
IPS is the foundational method for counterfactual evaluation. The key idea: if the logging policy showed item to user with probability , we can correct for this selection bias by weighting each outcome by .
This is an unbiased estimator of the value of given the logged data from , under the assumption that whenever (overlap assumption).
import numpy as np
import pandas as pd
from scipy import stats
from typing import Callable, Dict, List, Tuple
def ips_estimator(
logged_data: pd.DataFrame,
target_policy: Callable[[dict], Dict[str, float]],
outcome_col: str = "reward",
action_col: str = "action",
context_col: str = "context",
logging_prob_col: str = "logging_prob",
clip_weight: float = 10.0 # Cap importance weights to reduce variance
) -> Dict:
"""
Inverse Propensity Scoring (IPS) estimator for off-policy evaluation.
logged_data: DataFrame with one row per logged interaction.
Must contain: action, reward, context, logging_prob (propensity of logged action)
target_policy: function mapping context to action probability distribution
clip_weight: maximum importance weight (clipped IPS reduces variance at cost of bias)
Returns: dict with estimated value, SE, and diagnostics
"""
weights = []
rewards = []
for _, row in logged_data.iterrows():
context = row[context_col]
action = row[action_col]
reward = row[outcome_col]
logging_prob = row[logging_prob_col]
# Probability that target policy would have taken the logged action
target_probs = target_policy(context)
target_prob = target_probs.get(action, 0.0)
if logging_prob > 0:
# IPS weight: how much more/less likely is target policy to take this action?
weight = target_prob / logging_prob
weight = min(weight, clip_weight) # clip to reduce variance
else:
weight = 0.0
weights.append(weight)
rewards.append(reward)
weights = np.array(weights)
rewards = np.array(rewards)
# IPS estimate
ips_value = np.mean(weights * rewards)
# Standard error via bootstrap
n = len(rewards)
bootstrap_values = []
for _ in range(1000):
idx = np.random.randint(0, n, n)
bootstrap_values.append(np.mean(weights[idx] * rewards[idx]))
se = np.std(bootstrap_values)
ci_95 = np.percentile(bootstrap_values, [2.5, 97.5])
return {
"ips_estimate": ips_value,
"std_error": se,
"ci_95": tuple(ci_95),
"mean_weight": weights.mean(),
"max_weight": weights.max(),
"weight_clipped_fraction": (weights >= clip_weight).mean(),
"effective_sample_size": (weights.sum() ** 2) / (weights ** 2).sum(),
}
# Generate synthetic logged data
np.random.seed(42)
n_interactions = 10_000
actions = ["article_A", "article_B", "article_C", "article_D"]
def logging_policy(context: dict) -> Dict[str, float]:
"""Old policy: slightly biased toward article_A for all users."""
return {"article_A": 0.50, "article_B": 0.20, "article_C": 0.20, "article_D": 0.10}
def target_policy_v1(context: dict) -> Dict[str, float]:
"""New policy v1: more balanced, favors article_B for engaged users."""
if context.get("engagement_score", 0) > 0.5:
return {"article_A": 0.20, "article_B": 0.50, "article_C": 0.20, "article_D": 0.10}
return {"article_A": 0.30, "article_B": 0.30, "article_C": 0.30, "article_D": 0.10}
# True reward rates per article (the counterfactual truth we want to estimate)
true_reward_rates = {
"article_A": 0.05,
"article_B": 0.08, # better article
"article_C": 0.06,
"article_D": 0.04,
}
# Simulate logged interactions under the old policy
logged_rows = []
for i in range(n_interactions):
context = {"engagement_score": np.random.beta(2, 5), "user_id": i}
log_probs = logging_policy(context)
# Sample action from logging policy
action = np.random.choice(list(log_probs.keys()), p=list(log_probs.values()))
logging_prob = log_probs[action]
# Observe reward under this action
reward = int(np.random.random() < true_reward_rates[action])
logged_rows.append({
"user_id": i,
"context": context,
"action": action,
"reward": reward,
"logging_prob": logging_prob,
})
logged_data = pd.DataFrame(logged_rows)
# Estimate value of target policy using IPS
result = ips_estimator(logged_data, target_policy_v1)
# True value of target policy (computed analytically for this simulation)
true_value_v1 = sum(
(0.5 * 0.20 + 0.5 * 0.30) * true_reward_rates["article_A"] + # weighted by context
(0.5 * 0.50 + 0.5 * 0.30) * true_reward_rates["article_B"] +
(0.5 * 0.20 + 0.5 * 0.30) * true_reward_rates["article_C"] +
(0.5 * 0.10 + 0.5 * 0.10) * true_reward_rates["article_D"]
for _ in [1] # simplified for illustration
)
print("=== IPS Evaluation Results ===\n")
print(f"True value of target policy: {true_value_v1:.4f}")
print(f"IPS estimate: {result['ips_estimate']:.4f}")
print(f"95% CI: [{result['ci_95'][0]:.4f}, {result['ci_95'][1]:.4f}]")
print(f"\nDiagnostics:")
print(f" Mean importance weight: {result['mean_weight']:.3f}")
print(f" Max importance weight: {result['max_weight']:.3f}")
print(f" Fraction clipped: {result['weight_clipped_fraction']:.1%}")
print(f" Effective sample size: {result['effective_sample_size']:.0f} / {n_interactions}")
The Variance Problem: When IPS Fails
IPS is unbiased but can have extremely high variance when the target policy differs substantially from the logging policy. If assigns high probability to actions that rarely took, those observations get enormous weights, making the estimator noisy.
The effective sample size (ESS) quantifies this:
An ESS of 100 out of 10,000 interactions means you effectively have only 100 independent observations - 99% of your data is irrelevant for evaluating the target policy.
def diagnose_ips_quality(logged_data: pd.DataFrame, target_policy: Callable) -> None:
"""
Diagnose whether IPS will produce reliable estimates.
High weight variance = unreliable estimates.
"""
weights = []
for _, row in logged_data.iterrows():
target_probs = target_policy(row["context"])
target_prob = target_probs.get(row["action"], 0.0)
if row["logging_prob"] > 0:
weights.append(target_prob / row["logging_prob"])
weights = np.array(weights)
ess = (weights.sum() ** 2) / (weights ** 2).sum()
ess_fraction = ess / len(weights)
print("=== IPS Quality Diagnosis ===")
print(f"n interactions: {len(weights):,}")
print(f"Effective sample size: {ess:.0f} ({ess_fraction:.1%})")
print(f"Weight distribution:")
print(f" mean={weights.mean():.3f}, std={weights.std():.3f}")
print(f" max={weights.max():.1f} ({(weights > 5).mean():.1%} above 5)")
print(f" fraction zero: {(weights == 0).mean():.1%}")
if ess_fraction < 0.1:
print("\nWARNING: Low ESS. IPS estimates will be unreliable.")
print("Consider: doubly robust estimator, DM estimator, or narrowing target policy.")
elif ess_fraction < 0.3:
print("\nCAUTION: Moderate ESS. Use clipped IPS and report wide confidence intervals.")
else:
print("\nOK: Adequate ESS for reliable IPS estimates.")
diagnose_ips_quality(logged_data, target_policy_v1)
Doubly Robust Estimators: Combining IPS with Direct Modeling
The doubly robust (DR) estimator combines IPS with a direct model (DM) that predicts reward from context and action. It is "doubly robust" because it is consistent if either the importance weights or the direct model is correctly specified - but not necessarily both.
Where is a learned reward model (Q-function).
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.preprocessing import LabelEncoder
def doubly_robust_estimator(
logged_data: pd.DataFrame,
target_policy: Callable,
clip_weight: float = 10.0
) -> Dict:
"""
Doubly Robust estimator combining IPS weights with a direct reward model.
More accurate than pure IPS when direct model is reasonable.
More robust than pure direct modeling when policy overlap is poor.
"""
# Step 1: Fit a reward model Q(context, action) -> expected reward
# Encode features for the ML model
le = LabelEncoder()
logged_data = logged_data.copy()
logged_data["action_encoded"] = le.fit_transform(logged_data["action"])
logged_data["engagement"] = logged_data["context"].apply(
lambda c: c.get("engagement_score", 0.0)
)
X = logged_data[["action_encoded", "engagement"]].values
y = logged_data["reward"].values
# Fit reward model
q_model = GradientBoostingRegressor(n_estimators=50, max_depth=3)
q_model.fit(X, y)
actions = le.classes_
dr_values = []
for _, row in logged_data.iterrows():
context = row["context"]
action = row["action"]
reward = row["reward"]
logging_prob = row["logging_prob"]
engagement = context.get("engagement_score", 0.0)
# Target policy distribution
target_probs = target_policy(context)
# Direct model term: E_{a~pi1}[Q(x, a)]
dm_value = sum(
target_probs.get(a, 0.0) * q_model.predict([[le.transform([a])[0], engagement]])[0]
for a in actions if a in target_probs and le.transform([a]).size > 0
)
# IPS residual term: correct for difference between DM prediction and actual reward
target_prob = target_probs.get(action, 0.0)
if logging_prob > 0:
weight = min(target_prob / logging_prob, clip_weight)
else:
weight = 0.0
q_logged_action = q_model.predict(
[[le.transform([action])[0] if action in le.classes_ else 0, engagement]]
)[0]
# DR value for this observation
dr_value = dm_value + weight * (reward - q_logged_action)
dr_values.append(dr_value)
dr_values = np.array(dr_values)
return {
"dr_estimate": np.mean(dr_values),
"std_error": np.std(dr_values) / np.sqrt(len(dr_values)),
"ci_95": (
np.mean(dr_values) - 1.96 * np.std(dr_values) / np.sqrt(len(dr_values)),
np.mean(dr_values) + 1.96 * np.std(dr_values) / np.sqrt(len(dr_values))
),
}
dr_result = doubly_robust_estimator(logged_data, target_policy_v1)
ips_result = ips_estimator(logged_data, target_policy_v1)
print("=== Estimator Comparison ===\n")
print(f"{'Method':>15} | {'Estimate':>10} | {'CI Lower':>10} | {'CI Upper':>10}")
print("-" * 55)
print(f"{'IPS':>15} | {ips_result['ips_estimate']:>10.4f} | "
f"{ips_result['ci_95'][0]:>10.4f} | {ips_result['ci_95'][1]:>10.4f}")
print(f"{'Doubly Robust':>15} | {dr_result['dr_estimate']:>10.4f} | "
f"{dr_result['ci_95'][0]:>10.4f} | {dr_result['ci_95'][1]:>10.4f}")
Counterfactual Evaluation Pipeline
Propensity Score Matching for Causal ML Evaluation
When you cannot run an A/B test (historical data only, or a quasi-experiment), propensity score matching creates pseudo-control groups by matching treated units to untreated units with similar propensity scores.
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
def propensity_score_matching(
df: pd.DataFrame,
treatment_col: str,
outcome_col: str,
feature_cols: List[str],
caliper: float = 0.1 # maximum allowed propensity difference for matching
) -> Dict:
"""
Estimate ATE using 1:1 propensity score matching.
Used when treatment assignment is observational (not randomized),
e.g., evaluating effect of organic model deployment on user cohorts.
"""
# Step 1: Estimate propensity scores P(treatment=1 | features)
X = df[feature_cols].values
t = df[treatment_col].values
y = df[outcome_col].values
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
lr = LogisticRegression(random_state=42)
lr.fit(X_scaled, t)
propensity_scores = lr.predict_proba(X_scaled)[:, 1]
df = df.copy()
df["propensity_score"] = propensity_scores
# Step 2: Match treated units to control units by propensity score
treated = df[df[treatment_col] == 1].copy()
control = df[df[treatment_col] == 0].copy()
matches = []
used_control_idx = set()
for _, treated_row in treated.iterrows():
# Find nearest control unit by propensity score
candidates = control[~control.index.isin(used_control_idx)].copy()
if len(candidates) == 0:
break
candidates["ps_diff"] = abs(candidates["propensity_score"] - treated_row["propensity_score"])
nearest = candidates.nsmallest(1, "ps_diff")
if nearest["ps_diff"].values[0] <= caliper:
matches.append({
"treated_outcome": treated_row[outcome_col],
"control_outcome": nearest[outcome_col].values[0],
"ps_treated": treated_row["propensity_score"],
"ps_control": nearest["propensity_score"].values[0],
})
used_control_idx.add(nearest.index[0])
if not matches:
return {"error": "No valid matches found within caliper"}
matches_df = pd.DataFrame(matches)
ate = (matches_df["treated_outcome"] - matches_df["control_outcome"]).mean()
se = (matches_df["treated_outcome"] - matches_df["control_outcome"]).std() / np.sqrt(len(matches_df))
return {
"ate": ate,
"std_error": se,
"ci_95": (ate - 1.96 * se, ate + 1.96 * se),
"n_matched": len(matches_df),
"n_unmatched_treated": len(treated) - len(matches_df),
"mean_propensity_difference": matches_df["ps_treated"].subtract(matches_df["ps_control"]).abs().mean()
}
Production Engineering Notes
Log propensities at serving time: For IPS to work, you must log the probability that the logging policy assigned to each action taken. This must be logged at serving time - reconstructing it afterward from a model that may have been updated is error-prone and often impossible. Include propensity logging in your model serving infrastructure from day one.
Overlap is not guaranteed: Counterfactual evaluation only works when the target policy has meaningful overlap with the logging policy. If your new model always recommends article X, but the logging policy never recommended article X, IPS has no data to estimate the counterfactual outcome under article X. Check ESS before trusting any IPS estimate.
Counterfactual evaluation is not a replacement for A/B testing: IPS is an unbiased estimator, but it has higher variance than direct experiment. Use it for screening many model variants offline, then confirm top candidates with A/B tests. Think of it as: offline eval narrows from 100 candidates to 3, A/B test validates the final 3.
Reward model quality matters for DR: The doubly robust estimator's "double robustness" only holds if one of the two components (importance weights or reward model) is approximately correct. In practice, both are approximate, so DR is better than pure IPS or pure DM but not guaranteed to be unbiased. Cross-validate your reward model carefully.
Common Mistakes
:::danger Confusing Counterfactual Evaluation with Causal Identification IPS provides an unbiased estimate of what would have happened under the target policy, conditional on the assumption that the logging policy's propensities are correctly specified and that overlap holds. It does not identify causal effects in the presence of unmeasured confounders. If user characteristics that affect both which article is shown AND engagement are not captured in your logs, IPS is biased. This is not a statistical problem - it is a data problem, and no amount of weighting fixes it. :::
:::warning Trusting High-Variance IPS Estimates An unbiased estimator with a confidence interval of ±50% relative effect is not useful for decision-making. Always compute ESS before reporting IPS results. If ESS is below 10% of your sample size, the estimate is unreliable regardless of what the point estimate says. Consider whether the target policy is too different from the logging policy for offline evaluation to be informative. :::
:::warning Selecting the Logging Policy as Part of Target Policy Development If you use logged data to tune your target policy (e.g., try 50 versions of the new model and pick the one with the best IPS estimate on historical data), you are overfitting to the logged data. The selected policy will appear to win in offline eval but may not win online. Use separate logged data for model development and for final evaluation, or use hold-out validation for counterfactual eval. :::
Interview Q&A
Q: What is counterfactual evaluation and why is it useful for ML?
A: Counterfactual evaluation answers: "What would have happened if we had deployed model B instead of model A, using data from when model A was running?" It allows you to evaluate a new model on historical production data without running a live experiment. This is useful when: running an A/B test is too costly or risky (safety-critical systems, low-traffic products), when you want to screen many model variants quickly before committing to A/B testing, or when you need to evaluate on historical scenarios that cannot be replicated (e.g., last year's holiday traffic). The fundamental challenge is selection bias: the logging policy chose actions non-uniformly, so naive averages over logged outcomes are biased estimators of what a different policy would have achieved.
Q: Explain Inverse Propensity Scoring (IPS) and when it can fail.
A: IPS reweights logged outcomes by the ratio of the target policy's probability to the logging policy's probability for each observed action: . The intuition: if the logging policy selected action A with 10% probability but the target policy assigns 50% probability to action A, we upweight those observations by 5x to correct for their underrepresentation. IPS is unbiased under the overlap assumption (logging policy assigns nonzero probability to every action the target policy might take). It fails when: (1) overlap is poor - target policy recommends items the logging policy rarely showed, producing high-variance weights and unreliable estimates; (2) propensities are misspecified - if you reconstruct logging probabilities from a model that was updated after logging, the propensities will be wrong and IPS will be biased; (3) the target and logging policies are very different, causing the variance to explode even with technically nonzero overlap.
Q: What is the doubly robust estimator and why is it better than pure IPS?
A: The doubly robust (DR) estimator combines IPS with a direct model that predicts rewards from context and action: . The "doubly robust" property means the estimator is consistent if either the importance weights or the reward model is correctly specified - but not necessarily both. In practice, this means DR is more robust to misspecification than pure IPS (which relies entirely on correct propensities) or pure direct modeling (which relies entirely on correct reward predictions). DR also has lower variance than IPS when the reward model is good, because the IPS term only needs to correct for residuals from the model, not for the entire reward. Use DR when: IPS shows high variance (ESS between 20-40%), the reward signal is predictable from context and action (your Q-model fits well), and you have enough data to fit a good reward model.
Q: How do you validate the quality of a counterfactual evaluation before trusting its results?
A: Several validation steps. First, compute Effective Sample Size (ESS = (Σw)²/(Σw²)) and check it exceeds 20% of your sample size - below this, estimates are too noisy to trust. Second, check the weight distribution: extreme weights (above 10x) indicate poor overlap and should be clipped, but clipping introduces bias. Third, validate on historical holdout data: use logs from period T-1 to predict what would have happened in period T under the existing model (for which you have ground truth), and compare the IPS estimate against the true value. If IPS recovers the known value accurately, it is likely to work for the target policy evaluation. Fourth, for doubly robust, cross-validate your Q-model on held-out logged data and check its RMSE and calibration - a poorly calibrated Q-model makes DR unreliable. Finally, triangulate with other methods: if IPS and DR agree, and the estimates are consistent with offline model metrics, you can be more confident in the offline evaluation.
