Causal Inference Basics: The Science of What Actually Works

Reading time: ~40 min | Interview relevance: Very High | Target roles: MLE, AI Engineer, Data Scientist, Applied Research

The Production Scenario

Your recommendation system's offline metrics look excellent. Your model trained on last quarter's data achieves NDCG@10 = 0.51, up from 0.47. You ship it. Two weeks later, you check your A/B test results - conversion rate has not changed at all. Not even a rounding error.

What happened? Your offline evaluation measured whether the model correctly predicted what users clicked on in the past. But users clicked on those items because those items were shown to them - by the old model, with its own recommendation biases. You were evaluating your model's ability to replicate the old model's selection, not its ability to find better items.

This is the feedback loop problem in recommendation systems, and it is fundamentally a causal inference problem. The observed data does not tell you what would have happened if a different recommendation had been shown. To know that, you need a randomised experiment - an A/B test.

Causal inference is the formal framework for asking "what would have happened if we had done something different?" It is the most important concept in the science of ML evaluation.

Correlation vs Causation: Concrete ML Examples

The first rule of causal inference: correlation does not imply causation. But in ML, this rule is violated constantly.

Example 1: Ice Cream and Drowning

Ice cream sales are positively correlated with drowning deaths. Does ice cream cause drowning? No - both are caused by hot weather (the confounder).

Example 2: The Recommendation Feedback Loop

Round 1: Model M1 recommends items {A, B, C}
         Users click on A and C
         We record: (user, item A, click=1), (user, item B, click=0), (user, item C, click=1)

Round 2: Train new model M2 on this data
         M2 learns: "A and C are relevant for this user"
         But M2 has never seen items D, E, F - which might be much better

The "correlation" between M2's recommendations and clicks tells us:
  M2 can predict what M1 showed that users clicked on.
  NOT: M2 finds the best items for users.

Example 3: User Engagement and Premium Features

A study finds that users who use Feature X are 40% more likely to renew their subscription. Marketing pushes to add Feature X to the onboarding flow. Subscriptions do not improve.

Why? Users who seek out Feature X are already highly engaged, power users - they would have renewed anyway. Feature X and renewal are both caused by high engagement (the confounder). Feature X did not cause renewal.

Correlation story:        Causal story:
Feature X → Renewal       High Engagement → Feature X
                          High Engagement → Renewal
(spurious)                (correct)

The Potential Outcomes Framework (Rubin Causal Model)

The potential outcomes framework (also called the Rubin Causal Model or Neyman-Rubin model) gives us the formal language to define causality.

For each unit $i$ (user, example, query), define:

$Y_i(1)$ : the outcome if unit $i$ receives the treatment
$Y_i(0)$ : the outcome if unit $i$ does not receive the treatment

The Individual Treatment Effect (ITE) for unit $i$ :

$\tau_i = Y_i(1) - Y_i(0)$

The Average Treatment Effect (ATE):

$\text{ATE} = \mathbb{E}[\tau_i] = \mathbb{E}[Y_i(1) - Y_i(0)] = \mathbb{E}[Y_i(1)] - \mathbb{E}[Y_i(0)]$

The Fundamental Problem of Causal Inference

For any individual $i$ , we observe either $Y_i(1)$ or $Y_i(0)$ , but never both. The unobserved value is called the counterfactual.

User 001: Shown new recommendation → clicked (Y=1 under treatment)
          Never shown old recommendation → Y(0) is UNOBSERVED
          ITE = Y(1) - Y(0) = 1 - ??? = ???

User 002: Shown old recommendation → did not click (Y=0 under control)
          Never shown new recommendation → Y(1) is UNOBSERVED
          ITE = Y(1) - Y(0) = ??? - 0 = ???

We cannot observe individual causal effects. We can only estimate the average treatment effect using the right experimental design.

Why Randomised Experiments (A/B Tests) Work

In a randomised controlled trial (RCT), we randomly assign each unit to treatment (T=1) or control (T=0). The key property of randomisation:

$T \perp (Y(0), Y(1))$

Treatment assignment is independent of potential outcomes. This means:

$\mathbb{E}[Y_i(1) | T_i=1] = \mathbb{E}[Y_i(1)]$ $\mathbb{E}[Y_i(0) | T_i=0] = \mathbb{E}[Y_i(0)]$

And therefore the naive comparison of observed outcomes gives the ATE:

$\text{ATE} = \mathbb{E}[Y_i(1)] - \mathbb{E}[Y_i(0)]$ $= \mathbb{E}[Y_i | T_i=1] - \mathbb{E}[Y_i | T_i=0]$ $= \text{(observed treatment mean)} - \text{(observed control mean)}$

This is exactly what an A/B test computes. The random assignment ensures that the treatment and control groups are comparable in all ways - observed and unobserved. Any difference in outcomes must be due to the treatment.

import numpy as np
import scipy.stats as stats

np.random.seed(42)
n = 10_000

# Simulate potential outcomes
# Users have a "natural engagement" level (unobserved confounder)
natural_engagement = np.random.beta(2, 5, n)  # 0 to 1

# Potential outcomes depend on engagement
# Y(1) = outcome under new rec model
# Y(0) = outcome under old rec model
# Treatment effect is +0.05 for everyone
Y1 = natural_engagement + 0.05 + np.random.normal(0, 0.1, n)
Y0 = natural_engagement + np.random.normal(0, 0.1, n)
true_ate = np.mean(Y1 - Y0)  # should be ~0.05

# Scenario 1: Randomised experiment (A/B test)
treatment_flag = np.random.choice([0, 1], n, p=[0.5, 0.5])
Y_observed = np.where(treatment_flag == 1, Y1, Y0)
naive_diff = np.mean(Y_observed[treatment_flag==1]) - np.mean(Y_observed[treatment_flag==0])
t, p = stats.ttest_ind(Y_observed[treatment_flag==1], Y_observed[treatment_flag==0])
print("Randomised Experiment (A/B Test):")
print(f"  True ATE:          {true_ate:.4f}")
print(f"  Estimated ATE:     {naive_diff:.4f}  (unbiased estimate!)")
print(f"  p-value:           {p:.4f}")

# Scenario 2: Observational study (high-engagement users self-select into new model)
# Users with higher engagement are more likely to opt into treatment
treatment_observational = (np.random.rand(n) < natural_engagement).astype(int)
Y_obs2 = np.where(treatment_observational == 1, Y1, Y0)
biased_diff = np.mean(Y_obs2[treatment_observational==1]) - np.mean(Y_obs2[treatment_observational==0])
print("\nObservational Study (Self-Selection):")
print(f"  True ATE:          {true_ate:.4f}")
print(f"  Observed diff:     {biased_diff:.4f}  BIASED! Confounded by engagement!")
print(f"  Bias:              {biased_diff - true_ate:.4f}")

The simulation above shows why self-selection (observational data) leads to biased estimates. High-engagement users who self-select into a new feature already have better outcomes regardless of the feature. The A/B test, by randomising, eliminates this bias.

Selection Bias and Confounders

Selection bias occurs when the probability of being in treatment depends on the potential outcomes.

Mathematically, a confounder is a variable $Z$ that:

Affects treatment assignment: $Z \to T$
Affects the outcome: $Z \to Y$
Is not on the causal path from $T$ to $Y$

When we ignore $Z$ and just compare treated vs untreated:

$\mathbb{E}[Y|T=1] - \mathbb{E}[Y|T=0]$

This does not equal the ATE - it includes the confounding effect of $Z$ :

$\underbrace{\mathbb{E}[Y|T=1] - \mathbb{E}[Y|T=0]}_{\text{Naive estimator}} = \underbrace{\text{ATE}}_{\text{True effect}} + \underbrace{\text{Selection Bias}}_{\text{Confounder effect}}$

Average Treatment Effect on the Treated (ATT)

Often we care about the treatment effect specifically among those who would receive the treatment:

$\text{ATT} = \mathbb{E}[Y_i(1) - Y_i(0) | T_i = 1]$

In ML: "Did the new recommendation model help the users who actually got the new model (the treatment group)?" vs the ATE: "Would the new model help an average random user?"

These differ when treatment effects are heterogeneous - e.g., the new model helps power users more than casual users.

Difference-in-Differences (DiD)

When you cannot run a randomised experiment, DiD is a popular alternative. It compares the change over time in treatment and control groups:

$\text{DiD} = \underbrace{(\bar{Y}_T^{\text{post}} - \bar{Y}_T^{\text{pre}})}_{\text{Change in treatment}} - \underbrace{(\bar{Y}_C^{\text{post}} - \bar{Y}_C^{\text{pre}})}_{\text{Change in control}}$

Key assumption: In the absence of treatment, the treatment and control groups would have had parallel trends.

                           Treatment group
Outcome                   ____/____________  ← treatment effect = gap
  ^                  ____/
  |                 /   |
  |     Control ___/    |
  |           /         |
  |     _____/          |
  |    /                |
  |___/
  |
  +------------+---------► time
              Pre       Post
              Period    Period

import numpy as np
import pandas as pd
import statsmodels.formula.api as smf

np.random.seed(42)
n = 500

# Simulate: treatment is rolled out to a subset of markets
# Pre-period: both groups trend upward naturally
# Post-period: treatment group gets an additional boost

time_periods = ['pre', 'post']
markets = range(n)

data = []
for market in markets:
    treatment = 1 if market < n//2 else 0
    # Parallel trends before treatment
    pre_outcome = 100 + 0.5*market + np.random.normal(0, 5)
    # Post treatment: treatment group gets +10 lift
    post_outcome = 110 + 0.5*market + (10 if treatment else 0) + np.random.normal(0, 5)

    data.append({'market': market, 'treatment': treatment,
                 'post': 0, 'outcome': pre_outcome})
    data.append({'market': market, 'treatment': treatment,
                 'post': 1, 'outcome': post_outcome})

df = pd.DataFrame(data)

# DiD estimate via regression
# outcome = alpha + beta1*treatment + beta2*post + beta3*(treatment*post) + error
# beta3 is the DiD estimator (the treatment effect)
model = smf.ols('outcome ~ treatment + post + treatment:post', data=df).fit()
print("Difference-in-Differences Regression:")
print(model.summary().tables[1])
print(f"\nDiD estimate (interaction term): {model.params['treatment:post']:.4f}")
print(f"True treatment effect: 10.0")

# Manual DiD
pre_control = df[(df['treatment']==0) & (df['post']==0)]['outcome'].mean()
post_control = df[(df['treatment']==0) & (df['post']==1)]['outcome'].mean()
pre_treat = df[(df['treatment']==1) & (df['post']==0)]['outcome'].mean()
post_treat = df[(df['treatment']==1) & (df['post']==1)]['outcome'].mean()

did_manual = (post_treat - pre_treat) - (post_control - pre_control)
print(f"\nManual DiD: {did_manual:.4f}")

Instrumental Variables (IV)

When there are unobserved confounders that DiD cannot handle, Instrumental Variables can still identify the causal effect under the right conditions.

An instrument $Z$ is a variable that:

Relevance: $Z$ is correlated with the treatment $T$
Exclusion: $Z$ only affects the outcome $Y$ through $T$ (not directly)
Independence: $Z$ is uncorrelated with unobserved confounders

The IV estimator is the ratio of reduced-form to first-stage coefficients:

$\hat{\tau}_{IV} = \frac{\text{Cov}(Z, Y)}{\text{Cov}(Z, T)}$

ML example: You want to know the effect of app loading time on user retention, but fast loading time is correlated with other quality improvements (confounder). Instrument: geography-based server outages that randomly increase loading time for some users independent of other quality signals.

Why Offline Evaluation of Recommendation Systems Fails

This deserves special emphasis because it is one of the most common misconceptions in ML engineering.

The Feedback Loop

Model M1 recommends items. Some get clicks, some do not.
You record (user, item, click) as training data.
You train model M2 on this data.
You evaluate M2 offline: does M2 predict which items got clicked?

What M2 is actually learning: M2 learns to predict which items M1 showed that users happened to click. This is NOT the same as "which items would users find most relevant."

Items that M1 never recommended have no click data - M2 cannot learn about them. Items that M1 overexposed have artificially inflated click counts. This is called exposure bias or selection bias in recommendation.

Counterfactual Evaluation (IPS)

Inverse Propensity Scoring (IPS) is a technique to correct for this bias. For each observed (user, item, click) tuple, weight it by the inverse of the probability that M1 would show that item:

$\hat{V}_{IPS}(\pi) = \frac{1}{n}\sum_{i=1}^n \frac{\mathbf{1}[\text{item}_i = a_i] \cdot r_i}{\pi_0(a_i | x_i)}$

where $\pi_0(a|x)$ is the logging policy's probability of showing item $a$ to user $x$ , and $r_i$ is the reward (click).

import numpy as np

def ips_evaluation(observed_items, true_items, clicks, logging_probs):
    """
    Inverse Propensity Scoring for offline evaluation.

    observed_items: which item the logging policy showed
    true_items: which item the new policy would show
    clicks: whether the user clicked
    logging_probs: probability the logging policy showed this item

    Returns: IPS-corrected estimate of new policy value
    """
    # Only include cases where the new policy agrees with the logging policy
    matches = (observed_items == true_items)
    if matches.sum() == 0:
        return 0.0

    # Inverse propensity weighted average
    ips_weights = matches.astype(float) / (logging_probs + 1e-10)
    ips_value = np.sum(ips_weights * clicks) / len(clicks)
    return ips_value

# Simulation
np.random.seed(42)
n = 1000
n_items = 50

# Logging policy shows popular items with high probability (biased)
item_popularity = np.random.dirichlet(np.ones(n_items) * 0.5)
logging_policy_items = np.random.choice(n_items, n, p=item_popularity)
logging_probs = item_popularity[logging_policy_items]

# True item relevance (unknown to logging policy)
true_relevance = np.random.beta(2, 5, n_items)

# Clicks: depend on true relevance
clicks = (np.random.rand(n) < true_relevance[logging_policy_items]).astype(int)

# New policy: show the most relevant item per user (but we don't know relevance)
# Simulate: new policy shows a random item (for illustration)
new_policy_items = np.random.randint(0, n_items, n)

# Naive offline estimate (ignores selection bias)
naive_estimate = np.mean(clicks)  # Wrong: this is the logging policy's performance

# IPS estimate of new policy
ips_estimate = ips_evaluation(logging_policy_items, new_policy_items, clicks, logging_probs)

# True value of new policy (counterfactual, computed via simulation)
true_new_policy_value = np.mean(true_relevance[new_policy_items])

print(f"Logging policy true value:       {np.mean(true_relevance[logging_policy_items]):.4f}")
print(f"New policy true value:           {true_new_policy_value:.4f}")
print(f"Naive offline estimate (biased): {naive_estimate:.4f}")
print(f"IPS offline estimate:            {ips_estimate:.4f}")

:::tip ML Engineering Connection This is why companies like Netflix, Spotify, and Airbnb run online A/B tests for every recommendation model change - they do not trust offline metrics alone. Offline metrics tell you "how well does your model replicate what users clicked on in a world shaped by your previous model." Online A/B tests tell you "does this model actually cause better user outcomes?" The A/B test is the ground truth. Offline metrics are a cheap proxy used to filter candidates before the expensive online test. :::

Confounders Checklist for ML Systems

When you observe a correlation in your ML system, ask:

Is this correlation causal?
  │
  ├── Is there a plausible mechanism? (T directly causes Y)
  │
  ├── Temporal order: does T happen before Y?
  │
  ├── Is there a confounder Z that causes both T and Y?
  │     Examples:
  │     - User engagement (affects both feature use and outcome)
  │     - Seasonality (affects both exposure and conversion)
  │     - Platform/device type (affects both UI interaction and conversion)
  │
  ├── Is there selection bias? (certain users more likely to be in "treated" group)
  │
  └── Did you run an experiment? (gold standard for causal claims)

Interview Q&A

Q1: What is the fundamental problem of causal inference?

The fundamental problem is that we can never observe both potential outcomes for the same unit at the same time. Each user either gets the treatment or the control - never both. We observe $Y_i(1)$ for treated users and $Y_i(0)$ for control users, but never $Y_i(1) - Y_i(0)$ for any individual. We can only estimate the Average Treatment Effect (ATE = $\mathbb{E}[Y(1)] - \mathbb{E}[Y(0)]$ ) by comparing groups. Randomisation makes this estimate unbiased by ensuring treatment assignment is independent of potential outcomes.

Q2: Why do offline evaluation metrics often disagree with online A/B test results in recommendation systems?

Offline evaluation tests whether the model predicts what users clicked on historically. But historical clicks were generated by the previous recommendation model, which chose which items to show. This creates selection bias: items the old model never showed have no click data, and click rates for items it showed reflect exposure bias (popular items get more clicks not because they're more relevant but because they're shown more). The new model's offline score measures how well it replicates the old model's selections, not how much better it serves users. Online A/B tests randomise which model users see, eliminating these biases. This is why offline metrics are used as cheap filters, and online experiments are the true signal.

Q3: What is difference-in-differences and when is it valid?

Difference-in-differences (DiD) estimates the causal effect when you cannot randomise. It computes: DiD = (change in treated group) − (change in control group) over time. The key assumption is "parallel trends": in the absence of treatment, both groups would have followed the same trend. If this holds, subtracting the control group's change removes time-varying confounders. It fails when the treatment and control groups are on different trends before treatment (e.g., you deploy a new model first in your fastest-growing markets), or when external shocks affect only one group.

Q4: What is a confounder? Give an example relevant to ML evaluation.

A confounder is a variable that causes both the treatment assignment and the outcome, creating a spurious correlation between treatment and outcome. ML example: suppose you find that users who use the "advanced search" feature have 60% higher retention. But advanced search users are already power users - their high engagement causes both their use of advanced features and their retention. Engagement is the confounder. If you add advanced search to the onboarding flow to increase retention, you will likely see no effect because the causal mechanism is engagement, not the feature itself. To test the causal effect, you need an A/B test that randomly assigns users to see or not see the feature.

Q5: What is the ATE vs ATT distinction and when does it matter?

ATE (Average Treatment Effect) = $\mathbb{E}[Y(1) - Y(0)]$ over the entire population. ATT (Average Treatment Effect on the Treated) = $\mathbb{E}[Y(1) - Y(0) | T=1]$ among only those who received treatment. They differ when treatment effects are heterogeneous - when the effect varies across individuals, and those who receive treatment are systematically different from those who do not. Example: a premium recommendation feature might produce a large lift for power users (who get assigned to treatment) but a small effect for casual users. ATT captures the effect for power users; ATE captures the average across all users. For business decisions like "should we deploy this feature broadly?", ATE matters. For "was it worth deploying to the treatment group specifically?", ATT matters.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Causal DAG Explorer demo on the EngineersOfAI Playground - no code required.

:::

The Production Scenario​

Correlation vs Causation: Concrete ML Examples​

Example 1: Ice Cream and Drowning​

Example 2: The Recommendation Feedback Loop​

Example 3: User Engagement and Premium Features​

The Potential Outcomes Framework (Rubin Causal Model)​

The Fundamental Problem of Causal Inference​

Why Randomised Experiments (A/B Tests) Work​

Selection Bias and Confounders​

Average Treatment Effect on the Treated (ATT)​

Difference-in-Differences (DiD)​

Instrumental Variables (IV)​

Why Offline Evaluation of Recommendation Systems Fails​

The Feedback Loop​

Counterfactual Evaluation (IPS)​

Confounders Checklist for ML Systems​

Interview Q&A​