Skip to main content

Counterfactual Evaluation

The Content Platform That Could Not Afford to Be Wrong

A content recommendation platform had a problem. Their existing recommendation model had been running for three years. It generated 40% of all user engagement. They were building a new model that promised to improve session depth by 15%. But they could not just run an A/B test.

The platform had experienced a catastrophic A/B test failure eighteen months earlier. A new model variant had been live for six hours before anyone noticed it was recommending increasingly sensational content to users who engaged with political news. By the time the rollout was stopped, thousands of users had been served a content diet that the company's trust and safety team described as "a radicalization pipeline." The reputational damage took months to recover from.

Now, any new recommendation model required months of offline evaluation before any user would see it. The offline evaluation system had to answer: "If we had deployed this model instead of the existing model for the past 90 days, how would user engagement have differed?" Answering this question without running an A/B test is the problem of counterfactual evaluation.

The challenge: you have logs from the old model. You want to estimate what would have happened under the new model. But the new model's recommendations are a distribution over items, and your logs only recorded what actually happened - not the counterfactual.


:::tip 🎮 Interactive Playground Visualize this concept: Try the Counterfactual Explanations demo on the EngineersOfAI Playground - no code required. :::

The Potential Outcomes Framework

Counterfactual evaluation builds on the potential outcomes framework from causal statistics (Rubin, 1974; Imbens & Rubin, 2015).

For each user-item interaction, define:

  • Y(a)Y(a): the outcome (engagement, click, revenue) if action aa is taken
  • The fundamental problem of causal inference: you can only observe Y(aobserved)Y(a_{observed}), not Y(acounterfactual)Y(a_{counterfactual})

In recommendation, the "action" is which item to show. The existing model (logging policy π0\pi_0) showed item a0a_0 to user uu. The new model (target policy π1\pi_1) would have shown item a1a_1. You observe the engagement when item a0a_0 was shown. You want to estimate what engagement would have been under π1\pi_1.

You cannot directly observe this. But if you know the probability that the logging policy chose each action (the propensity score), you can reweight observed outcomes to estimate counterfactual expectations.


Inverse Propensity Scoring (IPS)

IPS is the foundational method for counterfactual evaluation. The key idea: if the logging policy showed item aa to user uu with probability p0(au)p_0(a|u), we can correct for this selection bias by weighting each outcome by π1(au)p0(au)\frac{\pi_1(a|u)}{p_0(a|u)}.

V^IPS(π1)=1ni=1nπ1(aixi)π0(aixi)ri\hat{V}_{IPS}(\pi_1) = \frac{1}{n} \sum_{i=1}^{n} \frac{\pi_1(a_i | x_i)}{\pi_0(a_i | x_i)} \cdot r_i

This is an unbiased estimator of the value of π1\pi_1 given the logged data from π0\pi_0, under the assumption that π0(ax)>0\pi_0(a|x) > 0 whenever π1(ax)>0\pi_1(a|x) > 0 (overlap assumption).

import numpy as np
import pandas as pd
from scipy import stats
from typing import Callable, Dict, List, Tuple

def ips_estimator(
logged_data: pd.DataFrame,
target_policy: Callable[[dict], Dict[str, float]],
outcome_col: str = "reward",
action_col: str = "action",
context_col: str = "context",
logging_prob_col: str = "logging_prob",
clip_weight: float = 10.0 # Cap importance weights to reduce variance
) -> Dict:
"""
Inverse Propensity Scoring (IPS) estimator for off-policy evaluation.

logged_data: DataFrame with one row per logged interaction.
Must contain: action, reward, context, logging_prob (propensity of logged action)
target_policy: function mapping context to action probability distribution
clip_weight: maximum importance weight (clipped IPS reduces variance at cost of bias)

Returns: dict with estimated value, SE, and diagnostics
"""
weights = []
rewards = []

for _, row in logged_data.iterrows():
context = row[context_col]
action = row[action_col]
reward = row[outcome_col]
logging_prob = row[logging_prob_col]

# Probability that target policy would have taken the logged action
target_probs = target_policy(context)
target_prob = target_probs.get(action, 0.0)

if logging_prob > 0:
# IPS weight: how much more/less likely is target policy to take this action?
weight = target_prob / logging_prob
weight = min(weight, clip_weight) # clip to reduce variance
else:
weight = 0.0

weights.append(weight)
rewards.append(reward)

weights = np.array(weights)
rewards = np.array(rewards)

# IPS estimate
ips_value = np.mean(weights * rewards)

# Standard error via bootstrap
n = len(rewards)
bootstrap_values = []
for _ in range(1000):
idx = np.random.randint(0, n, n)
bootstrap_values.append(np.mean(weights[idx] * rewards[idx]))

se = np.std(bootstrap_values)
ci_95 = np.percentile(bootstrap_values, [2.5, 97.5])

return {
"ips_estimate": ips_value,
"std_error": se,
"ci_95": tuple(ci_95),
"mean_weight": weights.mean(),
"max_weight": weights.max(),
"weight_clipped_fraction": (weights >= clip_weight).mean(),
"effective_sample_size": (weights.sum() ** 2) / (weights ** 2).sum(),
}


# Generate synthetic logged data
np.random.seed(42)
n_interactions = 10_000
actions = ["article_A", "article_B", "article_C", "article_D"]

def logging_policy(context: dict) -> Dict[str, float]:
"""Old policy: slightly biased toward article_A for all users."""
return {"article_A": 0.50, "article_B": 0.20, "article_C": 0.20, "article_D": 0.10}

def target_policy_v1(context: dict) -> Dict[str, float]:
"""New policy v1: more balanced, favors article_B for engaged users."""
if context.get("engagement_score", 0) > 0.5:
return {"article_A": 0.20, "article_B": 0.50, "article_C": 0.20, "article_D": 0.10}
return {"article_A": 0.30, "article_B": 0.30, "article_C": 0.30, "article_D": 0.10}

# True reward rates per article (the counterfactual truth we want to estimate)
true_reward_rates = {
"article_A": 0.05,
"article_B": 0.08, # better article
"article_C": 0.06,
"article_D": 0.04,
}

# Simulate logged interactions under the old policy
logged_rows = []
for i in range(n_interactions):
context = {"engagement_score": np.random.beta(2, 5), "user_id": i}
log_probs = logging_policy(context)

# Sample action from logging policy
action = np.random.choice(list(log_probs.keys()), p=list(log_probs.values()))
logging_prob = log_probs[action]

# Observe reward under this action
reward = int(np.random.random() < true_reward_rates[action])

logged_rows.append({
"user_id": i,
"context": context,
"action": action,
"reward": reward,
"logging_prob": logging_prob,
})

logged_data = pd.DataFrame(logged_rows)

# Estimate value of target policy using IPS
result = ips_estimator(logged_data, target_policy_v1)

# True value of target policy (computed analytically for this simulation)
true_value_v1 = sum(
(0.5 * 0.20 + 0.5 * 0.30) * true_reward_rates["article_A"] + # weighted by context
(0.5 * 0.50 + 0.5 * 0.30) * true_reward_rates["article_B"] +
(0.5 * 0.20 + 0.5 * 0.30) * true_reward_rates["article_C"] +
(0.5 * 0.10 + 0.5 * 0.10) * true_reward_rates["article_D"]
for _ in [1] # simplified for illustration
)

print("=== IPS Evaluation Results ===\n")
print(f"True value of target policy: {true_value_v1:.4f}")
print(f"IPS estimate: {result['ips_estimate']:.4f}")
print(f"95% CI: [{result['ci_95'][0]:.4f}, {result['ci_95'][1]:.4f}]")
print(f"\nDiagnostics:")
print(f" Mean importance weight: {result['mean_weight']:.3f}")
print(f" Max importance weight: {result['max_weight']:.3f}")
print(f" Fraction clipped: {result['weight_clipped_fraction']:.1%}")
print(f" Effective sample size: {result['effective_sample_size']:.0f} / {n_interactions}")

The Variance Problem: When IPS Fails

IPS is unbiased but can have extremely high variance when the target policy differs substantially from the logging policy. If π1\pi_1 assigns high probability to actions that π0\pi_0 rarely took, those observations get enormous weights, making the estimator noisy.

The effective sample size (ESS) quantifies this:

ESS=(iwi)2iwi2ESS = \frac{\left(\sum_i w_i\right)^2}{\sum_i w_i^2}

An ESS of 100 out of 10,000 interactions means you effectively have only 100 independent observations - 99% of your data is irrelevant for evaluating the target policy.

def diagnose_ips_quality(logged_data: pd.DataFrame, target_policy: Callable) -> None:
"""
Diagnose whether IPS will produce reliable estimates.
High weight variance = unreliable estimates.
"""
weights = []
for _, row in logged_data.iterrows():
target_probs = target_policy(row["context"])
target_prob = target_probs.get(row["action"], 0.0)
if row["logging_prob"] > 0:
weights.append(target_prob / row["logging_prob"])

weights = np.array(weights)
ess = (weights.sum() ** 2) / (weights ** 2).sum()
ess_fraction = ess / len(weights)

print("=== IPS Quality Diagnosis ===")
print(f"n interactions: {len(weights):,}")
print(f"Effective sample size: {ess:.0f} ({ess_fraction:.1%})")
print(f"Weight distribution:")
print(f" mean={weights.mean():.3f}, std={weights.std():.3f}")
print(f" max={weights.max():.1f} ({(weights > 5).mean():.1%} above 5)")
print(f" fraction zero: {(weights == 0).mean():.1%}")

if ess_fraction < 0.1:
print("\nWARNING: Low ESS. IPS estimates will be unreliable.")
print("Consider: doubly robust estimator, DM estimator, or narrowing target policy.")
elif ess_fraction < 0.3:
print("\nCAUTION: Moderate ESS. Use clipped IPS and report wide confidence intervals.")
else:
print("\nOK: Adequate ESS for reliable IPS estimates.")


diagnose_ips_quality(logged_data, target_policy_v1)

Doubly Robust Estimators: Combining IPS with Direct Modeling

The doubly robust (DR) estimator combines IPS with a direct model (DM) that predicts reward from context and action. It is "doubly robust" because it is consistent if either the importance weights or the direct model is correctly specified - but not necessarily both.

V^DR(π1)=1ni[Q^(xi,aiπ1)+π1(aixi)π0(aixi)(riQ^(xi,ai))]\hat{V}_{DR}(\pi_1) = \frac{1}{n}\sum_i \left[\hat{Q}(x_i, a_i^{\pi_1}) + \frac{\pi_1(a_i|x_i)}{\pi_0(a_i|x_i)}\left(r_i - \hat{Q}(x_i, a_i)\right)\right]

Where Q^(x,a)\hat{Q}(x, a) is a learned reward model (Q-function).

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.preprocessing import LabelEncoder

def doubly_robust_estimator(
logged_data: pd.DataFrame,
target_policy: Callable,
clip_weight: float = 10.0
) -> Dict:
"""
Doubly Robust estimator combining IPS weights with a direct reward model.

More accurate than pure IPS when direct model is reasonable.
More robust than pure direct modeling when policy overlap is poor.
"""
# Step 1: Fit a reward model Q(context, action) -> expected reward
# Encode features for the ML model
le = LabelEncoder()
logged_data = logged_data.copy()
logged_data["action_encoded"] = le.fit_transform(logged_data["action"])
logged_data["engagement"] = logged_data["context"].apply(
lambda c: c.get("engagement_score", 0.0)
)

X = logged_data[["action_encoded", "engagement"]].values
y = logged_data["reward"].values

# Fit reward model
q_model = GradientBoostingRegressor(n_estimators=50, max_depth=3)
q_model.fit(X, y)

actions = le.classes_

dr_values = []

for _, row in logged_data.iterrows():
context = row["context"]
action = row["action"]
reward = row["reward"]
logging_prob = row["logging_prob"]
engagement = context.get("engagement_score", 0.0)

# Target policy distribution
target_probs = target_policy(context)

# Direct model term: E_{a~pi1}[Q(x, a)]
dm_value = sum(
target_probs.get(a, 0.0) * q_model.predict([[le.transform([a])[0], engagement]])[0]
for a in actions if a in target_probs and le.transform([a]).size > 0
)

# IPS residual term: correct for difference between DM prediction and actual reward
target_prob = target_probs.get(action, 0.0)
if logging_prob > 0:
weight = min(target_prob / logging_prob, clip_weight)
else:
weight = 0.0

q_logged_action = q_model.predict(
[[le.transform([action])[0] if action in le.classes_ else 0, engagement]]
)[0]

# DR value for this observation
dr_value = dm_value + weight * (reward - q_logged_action)
dr_values.append(dr_value)

dr_values = np.array(dr_values)
return {
"dr_estimate": np.mean(dr_values),
"std_error": np.std(dr_values) / np.sqrt(len(dr_values)),
"ci_95": (
np.mean(dr_values) - 1.96 * np.std(dr_values) / np.sqrt(len(dr_values)),
np.mean(dr_values) + 1.96 * np.std(dr_values) / np.sqrt(len(dr_values))
),
}


dr_result = doubly_robust_estimator(logged_data, target_policy_v1)
ips_result = ips_estimator(logged_data, target_policy_v1)

print("=== Estimator Comparison ===\n")
print(f"{'Method':>15} | {'Estimate':>10} | {'CI Lower':>10} | {'CI Upper':>10}")
print("-" * 55)
print(f"{'IPS':>15} | {ips_result['ips_estimate']:>10.4f} | "
f"{ips_result['ci_95'][0]:>10.4f} | {ips_result['ci_95'][1]:>10.4f}")
print(f"{'Doubly Robust':>15} | {dr_result['dr_estimate']:>10.4f} | "
f"{dr_result['ci_95'][0]:>10.4f} | {dr_result['ci_95'][1]:>10.4f}")

Counterfactual Evaluation Pipeline


Propensity Score Matching for Causal ML Evaluation

When you cannot run an A/B test (historical data only, or a quasi-experiment), propensity score matching creates pseudo-control groups by matching treated units to untreated units with similar propensity scores.

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

def propensity_score_matching(
df: pd.DataFrame,
treatment_col: str,
outcome_col: str,
feature_cols: List[str],
caliper: float = 0.1 # maximum allowed propensity difference for matching
) -> Dict:
"""
Estimate ATE using 1:1 propensity score matching.

Used when treatment assignment is observational (not randomized),
e.g., evaluating effect of organic model deployment on user cohorts.
"""
# Step 1: Estimate propensity scores P(treatment=1 | features)
X = df[feature_cols].values
t = df[treatment_col].values
y = df[outcome_col].values

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

lr = LogisticRegression(random_state=42)
lr.fit(X_scaled, t)
propensity_scores = lr.predict_proba(X_scaled)[:, 1]

df = df.copy()
df["propensity_score"] = propensity_scores

# Step 2: Match treated units to control units by propensity score
treated = df[df[treatment_col] == 1].copy()
control = df[df[treatment_col] == 0].copy()

matches = []
used_control_idx = set()

for _, treated_row in treated.iterrows():
# Find nearest control unit by propensity score
candidates = control[~control.index.isin(used_control_idx)].copy()
if len(candidates) == 0:
break
candidates["ps_diff"] = abs(candidates["propensity_score"] - treated_row["propensity_score"])
nearest = candidates.nsmallest(1, "ps_diff")

if nearest["ps_diff"].values[0] <= caliper:
matches.append({
"treated_outcome": treated_row[outcome_col],
"control_outcome": nearest[outcome_col].values[0],
"ps_treated": treated_row["propensity_score"],
"ps_control": nearest["propensity_score"].values[0],
})
used_control_idx.add(nearest.index[0])

if not matches:
return {"error": "No valid matches found within caliper"}

matches_df = pd.DataFrame(matches)
ate = (matches_df["treated_outcome"] - matches_df["control_outcome"]).mean()
se = (matches_df["treated_outcome"] - matches_df["control_outcome"]).std() / np.sqrt(len(matches_df))

return {
"ate": ate,
"std_error": se,
"ci_95": (ate - 1.96 * se, ate + 1.96 * se),
"n_matched": len(matches_df),
"n_unmatched_treated": len(treated) - len(matches_df),
"mean_propensity_difference": matches_df["ps_treated"].subtract(matches_df["ps_control"]).abs().mean()
}

Production Engineering Notes

Log propensities at serving time: For IPS to work, you must log the probability that the logging policy assigned to each action taken. This must be logged at serving time - reconstructing it afterward from a model that may have been updated is error-prone and often impossible. Include propensity logging in your model serving infrastructure from day one.

Overlap is not guaranteed: Counterfactual evaluation only works when the target policy has meaningful overlap with the logging policy. If your new model always recommends article X, but the logging policy never recommended article X, IPS has no data to estimate the counterfactual outcome under article X. Check ESS before trusting any IPS estimate.

Counterfactual evaluation is not a replacement for A/B testing: IPS is an unbiased estimator, but it has higher variance than direct experiment. Use it for screening many model variants offline, then confirm top candidates with A/B tests. Think of it as: offline eval narrows from 100 candidates to 3, A/B test validates the final 3.

Reward model quality matters for DR: The doubly robust estimator's "double robustness" only holds if one of the two components (importance weights or reward model) is approximately correct. In practice, both are approximate, so DR is better than pure IPS or pure DM but not guaranteed to be unbiased. Cross-validate your reward model carefully.


Common Mistakes

:::danger Confusing Counterfactual Evaluation with Causal Identification IPS provides an unbiased estimate of what would have happened under the target policy, conditional on the assumption that the logging policy's propensities are correctly specified and that overlap holds. It does not identify causal effects in the presence of unmeasured confounders. If user characteristics that affect both which article is shown AND engagement are not captured in your logs, IPS is biased. This is not a statistical problem - it is a data problem, and no amount of weighting fixes it. :::

:::warning Trusting High-Variance IPS Estimates An unbiased estimator with a confidence interval of ±50% relative effect is not useful for decision-making. Always compute ESS before reporting IPS results. If ESS is below 10% of your sample size, the estimate is unreliable regardless of what the point estimate says. Consider whether the target policy is too different from the logging policy for offline evaluation to be informative. :::

:::warning Selecting the Logging Policy as Part of Target Policy Development If you use logged data to tune your target policy (e.g., try 50 versions of the new model and pick the one with the best IPS estimate on historical data), you are overfitting to the logged data. The selected policy will appear to win in offline eval but may not win online. Use separate logged data for model development and for final evaluation, or use hold-out validation for counterfactual eval. :::


Interview Q&A

Q: What is counterfactual evaluation and why is it useful for ML?

A: Counterfactual evaluation answers: "What would have happened if we had deployed model B instead of model A, using data from when model A was running?" It allows you to evaluate a new model on historical production data without running a live experiment. This is useful when: running an A/B test is too costly or risky (safety-critical systems, low-traffic products), when you want to screen many model variants quickly before committing to A/B testing, or when you need to evaluate on historical scenarios that cannot be replicated (e.g., last year's holiday traffic). The fundamental challenge is selection bias: the logging policy chose actions non-uniformly, so naive averages over logged outcomes are biased estimators of what a different policy would have achieved.

Q: Explain Inverse Propensity Scoring (IPS) and when it can fail.

A: IPS reweights logged outcomes by the ratio of the target policy's probability to the logging policy's probability for each observed action: V^IPS=1niπ1(aixi)π0(aixi)ri\hat{V}_{IPS} = \frac{1}{n}\sum_i \frac{\pi_1(a_i|x_i)}{\pi_0(a_i|x_i)} r_i. The intuition: if the logging policy selected action A with 10% probability but the target policy assigns 50% probability to action A, we upweight those observations by 5x to correct for their underrepresentation. IPS is unbiased under the overlap assumption (logging policy assigns nonzero probability to every action the target policy might take). It fails when: (1) overlap is poor - target policy recommends items the logging policy rarely showed, producing high-variance weights and unreliable estimates; (2) propensities are misspecified - if you reconstruct logging probabilities from a model that was updated after logging, the propensities will be wrong and IPS will be biased; (3) the target and logging policies are very different, causing the variance to explode even with technically nonzero overlap.

Q: What is the doubly robust estimator and why is it better than pure IPS?

A: The doubly robust (DR) estimator combines IPS with a direct model that predicts rewards from context and action: V^DR=Eaπ1[Q^(x,a)]+IPS correction\hat{V}_{DR} = E_{a \sim \pi_1}[\hat{Q}(x, a)] + \text{IPS correction}. The "doubly robust" property means the estimator is consistent if either the importance weights or the reward model is correctly specified - but not necessarily both. In practice, this means DR is more robust to misspecification than pure IPS (which relies entirely on correct propensities) or pure direct modeling (which relies entirely on correct reward predictions). DR also has lower variance than IPS when the reward model is good, because the IPS term only needs to correct for residuals from the model, not for the entire reward. Use DR when: IPS shows high variance (ESS between 20-40%), the reward signal is predictable from context and action (your Q-model fits well), and you have enough data to fit a good reward model.

Q: How do you validate the quality of a counterfactual evaluation before trusting its results?

A: Several validation steps. First, compute Effective Sample Size (ESS = (Σw)²/(Σw²)) and check it exceeds 20% of your sample size - below this, estimates are too noisy to trust. Second, check the weight distribution: extreme weights (above 10x) indicate poor overlap and should be clipped, but clipping introduces bias. Third, validate on historical holdout data: use logs from period T-1 to predict what would have happened in period T under the existing model (for which you have ground truth), and compare the IPS estimate against the true value. If IPS recovers the known value accurately, it is likely to work for the target policy evaluation. Fourth, for doubly robust, cross-validate your Q-model on held-out logged data and check its RMSE and calibration - a poorly calibrated Q-model makes DR unreliable. Finally, triangulate with other methods: if IPS and DR agree, and the estimates are consistent with offline model metrics, you can be more confident in the offline evaluation.

© 2026 EngineersOfAI. All rights reserved.