Conformal Prediction - Distribution-Free Uncertainty with Guaranteed Coverage
Reading time: 50 min | Interview relevance: Very High - appears in interviews for ML Engineer, AI Safety Engineer, Applied Scientist at any company deploying ML in regulated or safety-critical contexts | Target roles: ML Engineer, Applied Scientist, AI Safety Engineer, MLOps Engineer
The Diagnostic AI That Cannot Afford to Be Wrong Alone
It is 2023 in a large hospital system in the northeastern United States. The radiology department has deployed a diagnostic AI for chest X-ray interpretation. The model is a ResNet-50 fine-tuned on 150,000 labeled X-rays, with a top-1 accuracy of 91.3% on the held-out test set. The radiology team is enthusiastic: the model processes a study in 0.3 seconds versus 15 minutes for a radiologist, and it catches patterns that tired human eyes miss.
But the deployment team faces a hard problem from hospital legal and administration. When the model says "pneumonia" with a softmax confidence of 0.87, what does 0.87 actually mean? In neural network classification, softmax scores are notoriously miscalibrated. A model that outputs 0.87 might be right only 70% of the time for that confidence level. More importantly: the hospital needs to know not just "the most likely diagnosis" but "the set of diagnoses we can confidently rule out." For a patient presenting with respiratory symptoms, if the model can say "I'm 95% sure the correct diagnosis is one of {pneumonia, COVID-19, normal} - and I can rule out tuberculosis and lung cancer," that is clinically actionable. If the model can only say "pneumonia: 0.87, normal: 0.07, COVID-19: 0.04," the clinical team has to do additional work to interpret what 0.87 means and whether it provides the confidence level they need.
What the hospital needs is a prediction set: a set of diagnoses that contains the true diagnosis with provable probability, say 95%, regardless of the model's calibration quality, regardless of the distribution of the test cases, and without any assumptions about the underlying data distribution. In 2022, three papers (from Angelopoulos & Bates, from Tibshirani's group, and from Candès's group) collectively crystallized the modern theory of conformal prediction into a toolkit that delivers exactly this. The tool is conformal prediction.
Why Standard Confidence Intervals Fall Short
Before conformal prediction, there were two main approaches to uncertainty quantification in ML:
Approach 1: Model confidence (softmax score): Use the model's output probability directly. "The model says 87% probability of pneumonia, so we are 87% confident." Problem: neural network softmax scores are not calibrated probabilities. Guo et al. (2017) showed that modern neural networks are systematically overconfident - a model that says 95% is often right only 75–80% of the time. Temperature scaling can help but does not provide formal guarantees.
Approach 2: Bayesian credible intervals: Place a prior over model parameters, update to a posterior given data, and compute credible intervals from the posterior predictive distribution. Advantage: principled uncertainty. Disadvantage: the coverage guarantee (a 95% Bayesian credible interval contains the true value 95% of the time) holds only if the prior and likelihood are correctly specified. In practice, both are approximations, and the coverage guarantee is approximate. For a neural network, computing the exact posterior is intractable.
Conformal prediction takes a different approach. It says: we do not need to model the data distribution. We do not need a correctly specified prior. We only need to assume exchangeability - a condition much weaker than i.i.d. The coverage guarantee is exact and finite-sample: with exactly calibration points, the coverage is guaranteed to be at least for any model, any data distribution, any feature space.
This inequality holds regardless of whether the model is a logistic regression or a 70-billion parameter language model, regardless of whether the data is Gaussian or multimodal, regardless of sample size. The only assumption is exchangeability of the calibration and test data.
The Exchangeability Assumption
Exchangeability is weaker than i.i.d. (independent and identically distributed). A sequence of random variables is exchangeable if their joint distribution is invariant to any permutation:
for any permutation of .
Every i.i.d. sequence is exchangeable, but not every exchangeable sequence is i.i.d. Exchangeability rules out one major non-i.i.d. case that still fails: time series with temporal dependence. It allows cases like: sampling without replacement from a finite population, mixture models (as long as the mixture proportions are unknown but fixed), and group-correlated data (as long as group membership is exchangeable).
What exchangeability does not allow: distribution shift between calibration and test time. If the calibration data is from one hospital and the test data is from a different hospital with a different patient population, the exchangeability assumption is violated and the coverage guarantee no longer holds. This is the main practical limitation of standard conformal prediction - addressed later by Adaptive Conformal Inference.
Split Conformal Prediction: The Core Algorithm
There are many variants of conformal prediction. We focus on split conformal prediction (also called inductive conformal prediction) because it is simple, computationally efficient, and the most widely deployed in practice.
Setup:
- Training set - used to fit the underlying model
- Calibration set - held out from training, used only for conformal calibration
- Test point - the new observation we want to predict with uncertainty
Algorithm (Split Conformal):
- Train model on (any model - this is fully model-agnostic)
- Choose a nonconformity score function that measures "how unusual" the pair is according to
- Compute nonconformity scores on the calibration set: for
- Compute the quantile of the calibration scores:
- Return prediction set:
The coverage guarantee follows from the fact that, under exchangeability, the rank of among is uniform on . The corrected quantile ensures:
The correction is important: it makes the guarantee hold for finite , not just asymptotically.
Nonconformity Scores
The choice of nonconformity score determines the shape of prediction sets and their efficiency (how small they are). Different scores are appropriate for different tasks.
For Classification
Score 1 - Softmax score (simple but conservative):
where is the softmax probability assigned to class . A label is included in the prediction set if the model assigns it a softmax probability above .
Problem with the softmax score: when the model is uncertain (softmax probabilities are spread across classes), the threshold includes many classes. When the model is confident, it includes few. But the sets can be variable in size in an unintuitive way - sometimes including very improbable classes.
Score 2 - Adaptive Prediction Sets (APS, Angelopoulos et al. 2020):
APS is designed to produce prediction sets that are efficient (small) in the typical case and that adaptively include more classes when the model is uncertain.
The APS score cumulates softmax probabilities in descending order until the true class is covered:
where is a uniform random variable added for randomization (necessary to achieve exact rather than conservative coverage). The APS score is the sum of probabilities of classes ranked higher than , plus a fraction of 's own probability.
APS produces smaller prediction sets than the naive softmax score, especially for easy examples where the model is confident.
For Regression
Score 1 - Absolute residual (simple):
where is the model's point prediction. The prediction interval is:
This produces symmetric intervals of constant width . The width does not adapt to input uncertainty - it is the same for high-confidence and low-confidence regions.
Score 2 - Conformalized Quantile Regression (CQR, Romano et al. 2019):
CQR starts with a quantile regression model that estimates conditional quantiles and (the lower and upper ends of the prediction interval). The CQR nonconformity score is:
This measures how far outside the model's estimated quantile interval the true label falls. The prediction set is:
CQR produces adaptive intervals - wider where the quantile regression model is uncertain, narrower where it is confident - while maintaining the coverage guarantee.
RAPS: Regularized Adaptive Prediction Sets
Angelopoulos, Bates, Malik, and Jordan (2021) introduced RAPS (Regularized Adaptive Prediction Sets) for classification, addressing a limitation of APS: APS prediction sets can be excessively large for hard examples where the model assigns nearly equal probability to many classes.
RAPS adds a regularization term to the APS score that penalizes large prediction sets:
where is the rank of class in the sorted probability list, is a threshold (e.g., 5), and is a regularization weight. The term adds a linear penalty for including low-ranked classes, discouraging large prediction sets for hard examples.
RAPS achieves smaller average prediction set size than APS (tighter uncertainty estimates) while maintaining the coverage guarantee.
Full Python Implementation
import numpy as np
from typing import List, Set, Optional, Tuple, Union
from dataclasses import dataclass
import warnings
warnings.filterwarnings("ignore")
@dataclass
class ConformalResult:
"""Output of conformal prediction."""
prediction_set: Union[List[int], Tuple[float, float]] # classes or interval
coverage_level: float # 1 - alpha
quantile: float # q_hat
n_calibration: int
method: str
class ConformalClassifier:
"""
Split conformal classifier.
Supports nonconformity scores:
- "softmax": 1 - softmax_prob(true_class)
- "aps": Adaptive Prediction Sets (Angelopoulos 2020)
- "raps": Regularized APS (Angelopoulos 2021)
Usage:
clf = ConformalClassifier(model, score="aps")
clf.calibrate(X_cal, y_cal, alpha=0.1) # 90% coverage
result = clf.predict(X_test[0])
print(result.prediction_set) # e.g., [3, 7] - classes 3 and 7 are plausible
"""
def __init__(
self,
model,
score: str = "aps",
raps_lambda: float = 0.01,
raps_k_reg: int = 5,
random_seed: int = 42,
):
self.model = model
self.score_type = score
self.raps_lambda = raps_lambda
self.raps_k_reg = raps_k_reg
self.rng = np.random.default_rng(random_seed)
self._q_hat: Optional[float] = None
self._n_cal: int = 0
self._alpha: float = 0.1
self._classes: Optional[np.ndarray] = None
def _get_softmax_probs(self, X: np.ndarray) -> np.ndarray:
"""Get class probabilities from model."""
probs = self.model.predict_proba(X)
return probs # shape (n, n_classes)
def _score_softmax(
self, probs: np.ndarray, y: np.ndarray
) -> np.ndarray:
"""Score: 1 - softmax(true class). Shape (n,)."""
n = len(y)
true_class_probs = probs[np.arange(n), y.astype(int)]
return 1.0 - true_class_probs
def _score_aps(
self, probs: np.ndarray, y: np.ndarray, randomize: bool = True
) -> np.ndarray:
"""
Adaptive Prediction Sets score.
For each sample: cumulate sorted softmax probabilities until true class.
"""
n = len(y)
scores = np.zeros(n)
for i in range(n):
p = probs[i]
y_i = int(y[i])
# Sort classes by probability descending
sorted_indices = np.argsort(p)[::-1]
sorted_probs = p[sorted_indices]
# Find rank of true class
true_rank = np.where(sorted_indices == y_i)[0][0]
# Cumulate probabilities up to (but not including) true class
cumsum = np.sum(sorted_probs[:true_rank])
# Add randomized portion of true class probability
u = self.rng.uniform(0, 1) if randomize else 0.5
scores[i] = cumsum + u * sorted_probs[true_rank]
return scores
def _score_raps(
self, probs: np.ndarray, y: np.ndarray
) -> np.ndarray:
"""RAPS: APS + regularization for large prediction sets."""
aps_scores = self._score_aps(probs, y, randomize=True)
n = len(y)
regularization = np.zeros(n)
for i in range(n):
p = probs[i]
y_i = int(y[i])
sorted_indices = np.argsort(p)[::-1]
true_rank = np.where(sorted_indices == y_i)[0][0] + 1 # 1-indexed
regularization[i] = (
self.raps_lambda * max(true_rank - self.raps_k_reg, 0)
)
return aps_scores + regularization
def _compute_scores(self, X: np.ndarray, y: np.ndarray) -> np.ndarray:
probs = self._get_softmax_probs(X)
if self.score_type == "softmax":
return self._score_softmax(probs, y)
elif self.score_type == "aps":
return self._score_aps(probs, y)
elif self.score_type == "raps":
return self._score_raps(probs, y)
else:
raise ValueError(f"Unknown score: {self.score_type}")
def calibrate(
self, X_cal: np.ndarray, y_cal: np.ndarray, alpha: float = 0.1
) -> None:
"""
Calibrate using held-out calibration set.
Sets q_hat such that prediction sets have >= 1-alpha coverage.
alpha: miscoverage level (e.g., 0.1 for 90% coverage)
"""
self._alpha = alpha
self._n_cal = len(y_cal)
self._classes = np.unique(y_cal)
# Compute calibration scores
cal_scores = self._compute_scores(X_cal, y_cal)
# Corrected quantile: (n+1)(1-alpha)/n to ensure >= 1-alpha coverage
n = len(cal_scores)
level = np.ceil((n + 1) * (1 - alpha)) / n
level = min(level, 1.0) # cap at 1.0
self._q_hat = float(np.quantile(cal_scores, level))
print(
f"Calibration complete: n_cal={n}, alpha={alpha}, "
f"q_hat={self._q_hat:.4f}, "
f"target_coverage={1-alpha:.1%}"
)
def predict(self, x: np.ndarray, top_k: int = None) -> ConformalResult:
"""
Return prediction set for a single test input.
Prediction set = all classes y where score(x, y) <= q_hat.
"""
if self._q_hat is None:
raise RuntimeError("Call calibrate() first.")
x = x.reshape(1, -1)
probs = self._get_softmax_probs(x)[0] # shape (n_classes,)
n_classes = len(probs)
included = []
for y_candidate in range(n_classes):
# Compute score for this candidate class
if self.score_type == "softmax":
s = 1.0 - probs[y_candidate]
elif self.score_type in ("aps", "raps"):
# For prediction: use deterministic version (u=0.5)
sorted_indices = np.argsort(probs)[::-1]
sorted_probs = probs[sorted_indices]
true_rank = np.where(sorted_indices == y_candidate)[0][0]
s = np.sum(sorted_probs[:true_rank]) + 0.5 * sorted_probs[true_rank]
if self.score_type == "raps":
s += self.raps_lambda * max(true_rank + 1 - self.raps_k_reg, 0)
else:
s = 0.0
if s <= self._q_hat:
included.append(y_candidate)
return ConformalResult(
prediction_set=included,
coverage_level=1 - self._alpha,
quantile=self._q_hat,
n_calibration=self._n_cal,
method=f"split_conformal_{self.score_type}",
)
def predict_batch(
self, X_test: np.ndarray
) -> List[ConformalResult]:
"""Predict conformal sets for all test points."""
return [self.predict(x) for x in X_test]
def empirical_coverage(
self, X_test: np.ndarray, y_test: np.ndarray
) -> float:
"""Measure empirical coverage on test set (should be >= 1-alpha)."""
covered = 0
for x, y_true in zip(X_test, y_test):
result = self.predict(x)
if int(y_true) in result.prediction_set:
covered += 1
return covered / len(y_test)
def avg_prediction_set_size(self, X_test: np.ndarray) -> float:
"""Average size of prediction sets (efficiency measure)."""
sizes = [len(self.predict(x).prediction_set) for x in X_test]
return float(np.mean(sizes))
class ConformalRegressor:
"""
Split conformal regressor.
Supports:
- "residual": absolute residual score (constant-width intervals)
- "cqr": Conformalized Quantile Regression (Romano 2019)
For CQR, the model must be a quantile regressor with methods
predict_lower(X) and predict_upper(X).
"""
def __init__(
self,
model,
score: str = "residual",
):
self.model = model
self.score_type = score
self._q_hat: Optional[float] = None
self._n_cal: int = 0
self._alpha: float = 0.1
def _score_residual(self, X: np.ndarray, y: np.ndarray) -> np.ndarray:
"""Absolute residual: |y - f_hat(x)|."""
preds = self.model.predict(X)
return np.abs(y - preds)
def _score_cqr(self, X: np.ndarray, y: np.ndarray) -> np.ndarray:
"""
CQR score: max(q_low(x) - y, y - q_high(x))
Requires model to have predict_lower() and predict_upper() methods.
"""
try:
q_low = self.model.predict_lower(X)
q_high = self.model.predict_upper(X)
except AttributeError:
raise ValueError(
"For CQR, model must have predict_lower(X) and predict_upper(X) methods. "
"Wrap your quantile regression model accordingly."
)
return np.maximum(q_low - y, y - q_high)
def calibrate(
self, X_cal: np.ndarray, y_cal: np.ndarray, alpha: float = 0.1
) -> None:
"""Calibrate for 1-alpha coverage."""
self._alpha = alpha
self._n_cal = len(y_cal)
if self.score_type == "residual":
cal_scores = self._score_residual(X_cal, y_cal)
elif self.score_type == "cqr":
cal_scores = self._score_cqr(X_cal, y_cal)
else:
raise ValueError(f"Unknown score type: {self.score_type}")
n = len(cal_scores)
level = np.ceil((n + 1) * (1 - alpha)) / n
level = min(level, 1.0)
self._q_hat = float(np.quantile(cal_scores, level))
print(
f"Calibration: n_cal={n}, alpha={alpha}, "
f"q_hat={self._q_hat:.4f}, "
f"target_coverage={1-alpha:.1%}"
)
def predict(self, x: np.ndarray) -> ConformalResult:
"""Return prediction interval [lower, upper] with 1-alpha coverage."""
if self._q_hat is None:
raise RuntimeError("Call calibrate() first.")
x = x.reshape(1, -1)
if self.score_type == "residual":
center = self.model.predict(x)[0]
lower = center - self._q_hat
upper = center + self._q_hat
elif self.score_type == "cqr":
q_low = self.model.predict_lower(x)[0]
q_high = self.model.predict_upper(x)[0]
lower = q_low - self._q_hat
upper = q_high + self._q_hat
else:
raise ValueError(f"Unknown score: {self.score_type}")
return ConformalResult(
prediction_set=(lower, upper),
coverage_level=1 - self._alpha,
quantile=self._q_hat,
n_calibration=self._n_cal,
method=f"split_conformal_{self.score_type}",
)
def empirical_coverage(
self, X_test: np.ndarray, y_test: np.ndarray
) -> float:
"""Fraction of test points covered by their prediction interval."""
covered = 0
for x, y_true in zip(X_test, y_test):
result = self.predict(x)
lower, upper = result.prediction_set
if lower <= y_true <= upper:
covered += 1
return covered / len(y_test)
def avg_interval_width(self, X_test: np.ndarray) -> float:
"""Average prediction interval width (efficiency measure)."""
widths = []
for x in X_test:
result = self.predict(x)
lower, upper = result.prediction_set
widths.append(upper - lower)
return float(np.mean(widths))
# ─── DEMO ─────────────────────────────────────────────────────────────────────
def run_classification_demo():
"""Demonstrate ConformalClassifier on a synthetic 5-class dataset."""
from sklearn.datasets import make_classification
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
np.random.seed(42)
X, y = make_classification(
n_samples=5000, n_features=20, n_informative=10,
n_classes=5, n_clusters_per_class=1, random_state=42
)
# Three-way split: train / calibrate / test
X_train, X_temp, y_train, y_temp = train_test_split(
X, y, test_size=0.4, random_state=42
)
X_cal, X_test, y_cal, y_test = train_test_split(
X_temp, y_temp, test_size=0.5, random_state=42
)
# Train model
model = GradientBoostingClassifier(
n_estimators=200, max_depth=4, random_state=42
)
model.fit(X_train, y_train)
base_acc = model.score(X_test, y_test)
print(f"Base model accuracy: {base_acc:.4f}")
# Compare score types
print("\n--- Classification Conformal Prediction ---")
for score_type in ["softmax", "aps", "raps"]:
clf = ConformalClassifier(model, score=score_type)
clf.calibrate(X_cal, y_cal, alpha=0.1) # 90% coverage target
cov = clf.empirical_coverage(X_test, y_test)
avg_size = clf.avg_prediction_set_size(X_test)
print(f"\n{score_type.upper()}: coverage={cov:.3f} (target >= 0.90), avg_set_size={avg_size:.2f}")
# Show example prediction sets
for i in range(3):
result = clf.predict(X_test[i])
true_label = y_test[i]
in_set = true_label in result.prediction_set
print(f" Sample {i}: true={true_label}, set={result.prediction_set}, covered={in_set}")
def run_regression_demo():
"""Demonstrate ConformalRegressor on housing price data."""
from sklearn.datasets import make_regression
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
np.random.seed(42)
X, y = make_regression(
n_samples=5000, n_features=15, n_informative=10,
noise=25.0, random_state=42
)
X_train, X_temp, y_train, y_temp = train_test_split(
X, y, test_size=0.4, random_state=42
)
X_cal, X_test, y_cal, y_test = train_test_split(
X_temp, y_temp, test_size=0.5, random_state=42
)
# Train base model
model = GradientBoostingRegressor(
n_estimators=200, max_depth=4, random_state=42
)
model.fit(X_train, y_train)
# Conformal regression with residual score (constant-width)
print("\n--- Regression Conformal Prediction (Residual Score) ---")
reg = ConformalRegressor(model, score="residual")
reg.calibrate(X_cal, y_cal, alpha=0.1) # 90% coverage
cov = reg.empirical_coverage(X_test, y_test)
avg_width = reg.avg_interval_width(X_test)
print(f"Empirical coverage: {cov:.3f} (target >= 0.90)")
print(f"Average interval width: {avg_width:.2f}")
# Show examples
for i in range(3):
result = reg.predict(X_test[i])
lower, upper = result.prediction_set
true_val = y_test[i]
covered = lower <= true_val <= upper
print(
f"Sample {i}: true={true_val:.2f}, "
f"interval=[{lower:.2f}, {upper:.2f}], "
f"width={upper-lower:.2f}, covered={covered}"
)
return reg
if __name__ == "__main__":
run_classification_demo()
run_regression_demo()
Historical Context: From Venn Predictors to Modern Conformal
Conformal prediction was first introduced by Vladimir Vovk, Alexander Gammerman, and Glenn Shafer in the late 1990s and formalized in their 2005 book "Algorithmic Learning in a Random World." The original framework - transductive conformal prediction - computed conformal scores by including the test point in the calibration set and checking its conformity. This was computationally expensive: for each candidate label , you had to refit the underlying model or measure a score that depended on the entire dataset including the test point.
Inductive Conformal Prediction (split conformal), introduced in the same era and popularized through the work of Harris Papadopoulos, reduced the computation to a single model fit - making conformal prediction practical for real models. The key insight: split the calibration computation away from the model training, computing scores on a held-out calibration set after training is complete.
The modern renaissance of conformal prediction came from 2019–2023, driven by three influential papers. Romano, Patterson, and Candès (2019) introduced Conformalized Quantile Regression, bringing adaptive prediction intervals to regression. Angelopoulos, Bates, Malik, and Jordan (2020) introduced Adaptive Prediction Sets (APS) for classification - a nonconformity score that produces smaller, more efficient prediction sets. Gibbs and Candès (2021) introduced Adaptive Conformal Inference for online settings with distribution shift. Together, these papers moved conformal prediction from a theoretical curiosity to a practical tool now being deployed in medical AI, autonomous systems, and LLM evaluation.
Conformal Risk Control: Beyond Coverage
Angelopoulos, Bates, Fischler, Malik, and Jordan (2022) generalized conformal prediction beyond the coverage guarantee. Instead of controlling only the probability of coverage, conformal risk control allows you to control any bounded loss function with a formal guarantee:
Why this matters: Coverage is a binary loss - either the true label is in the set or it is not. But many real problems have graded losses. In medical diagnosis: missing a severe condition (true label not in set) is worse than missing a mild one. In object detection: a slightly wrong bounding box is better than a completely wrong one. Conformal risk control handles these cases by replacing the 0-1 coverage loss with any bounded, monotone loss function.
Examples of controlled risks:
- False negative rate in binary classification: control the fraction of positive cases missed by the prediction set
- Expected size of prediction set: control average set size while maintaining some coverage
- Graph distance in sequence prediction: control the average edit distance between the prediction set and the true sequence
- F1 score in multi-label classification: control expected F1 measured against the prediction set
The algorithm mirrors split conformal: replace the coverage indicator with the loss , compute the corrected quantile of the loss values on the calibration set, and use that quantile to set the threshold. The coverage guarantee generalizes exactly: under exchangeability.
Weighted Conformal Prediction for Covariate Shift
When calibration and test distributions differ (covariate shift: ), the coverage guarantee fails. Tibshirani, Barber, Candès, and Ramdas (2019) introduced weighted conformal prediction to restore the guarantee under covariate shift.
Idea: Weight the calibration scores by their likelihood ratio - points in the calibration set that are "similar to" the test point get higher weight, points that are "dissimilar" get lower weight.
The weighted quantile is:
where are the likelihood ratio weights (estimated via a classifier trained to distinguish test from calibration data), and is the weight assigned to infinity (ensuring the prediction set is non-empty when all weights are concentrated elsewhere).
In practice: Train a binary classifier to predict whether a point came from the calibration or test distribution. The classifier's output probabilities give the likelihood ratio. This approach works when the test distribution is specified in advance (e.g., a known demographic subgroup), but is harder when the test distribution is unknown at calibration time.
Adaptive Conformal Inference for Distribution Shift
Standard split conformal prediction requires exchangeability. When the test distribution shifts from the calibration distribution (a common occurrence in deployed ML systems), the coverage guarantee breaks down. Isaac Gibbs and Emmanuel Candès (2021) introduced Adaptive Conformal Inference (ACI) for this setting.
ACI Algorithm: Instead of using a fixed quantile computed once from the calibration set, ACI updates the quantile at each time step based on whether the previous prediction set covered the true label:
where is a step size. If the previous prediction missed (coverage error: ), the quantile increases (prediction set gets larger). If it covered correctly (no error: ), the quantile decreases (prediction set gets smaller). This online update ensures that long-run average coverage tracks even under distribution shift.
ACI is the right tool for time-series or streaming data where the distribution changes gradually over time - weather forecasting, financial risk, network intrusion detection.
Cross-Conformal and Jackknife+ for Small Calibration Sets
Split conformal requires holding out a calibration set from training data. When data is scarce (n < 200), this wasteful split hurts both model quality and calibration reliability. Two alternatives:
Cross-conformal: Split data into folds (like K-fold CV). Train models, each leaving one fold out. Use each model to compute scores for its left-out fold. Pool all scores as the calibration set. More data-efficient but adds computation (train models).
Jackknife+ (Barber et al. 2021): Use leave-one-out training. For each calibration point , train the model on all data except point , compute the residual on point . The prediction interval for a new point averages over all LOO predictions. Jackknife+ has a coverage guarantee of (slightly weaker than split conformal's ) but uses the full dataset for model training.
For most practical settings with , split conformal with a 20% calibration split is sufficient. Use jackknife+ when data is very scarce or when every training point matters.
Production Monitoring of Conformal Coverage
Deploying conformal prediction does not end the work - you must monitor empirical coverage in production to verify that the exchangeability assumption holds.
Coverage monitoring pipeline:
import numpy as np
from collections import deque
from typing import Deque, Tuple
import time
class CoverageMonitor:
"""
Monitors empirical coverage of conformal prediction sets in production.
Maintains a rolling window of (prediction_set, true_label) pairs
and computes empirical coverage over the window.
Alerts when coverage drops below target level by more than a threshold.
"""
def __init__(
self,
target_coverage: float = 0.90,
window_size: int = 500,
alert_threshold: float = 0.05, # alert if coverage drops by more than 5%
):
self.target_coverage = target_coverage
self.window_size = window_size
self.alert_threshold = alert_threshold
# Rolling window of (is_covered: bool, timestamp: float)
self._window: Deque[Tuple[bool, float]] = deque(maxlen=window_size)
self._n_alerts = 0
def record(
self,
prediction_set, # list of class indices or (lower, upper) tuple
true_label, # int for classification, float for regression
) -> bool:
"""
Record a new prediction and its outcome.
Returns True if coverage alert is triggered.
"""
# Check coverage
if isinstance(prediction_set, (list, set)):
covered = int(true_label) in prediction_set
else:
lower, upper = prediction_set
covered = lower <= true_label <= upper
self._window.append((covered, time.time()))
return self._check_alert()
def _check_alert(self) -> bool:
if len(self._window) < 50: # need minimum samples for reliable estimate
return False
empirical = self.empirical_coverage()
if empirical < self.target_coverage - self.alert_threshold:
self._n_alerts += 1
return True
return False
def empirical_coverage(self) -> float:
"""Fraction of predictions where true label was covered."""
if not self._window:
return 1.0
return sum(covered for covered, _ in self._window) / len(self._window)
def coverage_report(self) -> dict:
"""Generate coverage report for monitoring dashboard."""
return {
"n_predictions": len(self._window),
"empirical_coverage": round(self.empirical_coverage(), 4),
"target_coverage": self.target_coverage,
"coverage_gap": round(
self.empirical_coverage() - self.target_coverage, 4
),
"alert": self._check_alert(),
"n_alerts_total": self._n_alerts,
}
# Usage example
monitor = CoverageMonitor(target_coverage=0.90, window_size=500)
# Simulate incoming predictions
rng = np.random.default_rng(42)
for i in range(1000):
# Simulate: true label is int in [0, 9]
true_label = rng.integers(0, 10)
# Simulate: prediction set contains true label ~91% of the time
pred_set = [true_label] if rng.random() < 0.91 else [true_label + 1]
alert = monitor.record(pred_set, true_label)
print(monitor.coverage_report())
# Expected: empirical_coverage ≈ 0.91, alert=False
Monitoring checklist in production:
- Log every prediction set size alongside the request ID and model version
- When true labels become available (delayed feedback), log coverage (was label in set)
- Compute empirical coverage in a rolling 500-prediction window
- Alert operations team when rolling coverage drops below
- Track average prediction set size over time - growing sets indicate increasing model uncertainty
- Monitor calibration score quantile over time - changes indicate input distribution shift
- Re-calibrate when coverage consistently deviates or when the model is retrained
Conformal vs Bayesian Credible Intervals
| Dimension | Conformal Prediction | Bayesian Credible Interval |
|---|---|---|
| Coverage guarantee | Exact finite-sample, | Approximate - depends on prior/likelihood correctness |
| Distributional assumptions | Exchangeability only | Prior + likelihood must be correctly specified |
| Computational cost | calibration, inference | Expensive posterior inference (MCMC, VI) |
| Adaptivity | Depends on score choice | Naturally adaptive via posterior uncertainty |
| Distribution shift | Fails (breaks exchangeability) | Fails (model mismatch) |
| Model agnosticism | Fully agnostic - works with any model | Requires access to model's probability structure |
| Practical use case | Production ML, regulatory compliance | Research, scientific inference |
| Multi-task uncertainty | Requires multi-output conformal | Natural via joint posterior |
When to use conformal: When you need a hard coverage guarantee for a black-box model in production. When regulatory or compliance requirements demand provable uncertainty. When you cannot specify a correct prior (most deployed ML systems).
When to use Bayesian: When you need to express prior domain knowledge. When interpretability of the uncertainty source matters. When you are doing scientific inference and want posterior over model parameters.
Computational Complexity
Conformal prediction is computationally lightweight:
- Calibration: model evaluations + sorting for quantile computation. Run once after model training.
- Inference: - for regression, two additions. For classification with classes, score computations.
- Memory: Store calibration scores ( floats) plus model weights.
Compare to Bayesian methods: MCMC requires per chain for samples; variational inference requires an optimization loop. Conformal has essentially no inference overhead beyond the base model call.
Deployments: Waymo, Medical AI, and Language Models
Waymo (Object Detection): Conformal prediction sets for 3D bounding box regression - instead of outputting a single bounding box, the system outputs a region guaranteed to contain the true object position with 95% probability. This is safety-critical: an autonomous vehicle must know not just where it thinks a pedestrian is, but the envelope of uncertainty around that position.
Medical Diagnostics (FDA AI/ML): The FDA's proposed framework for AI/ML-based Software as a Medical Device increasingly requires quantified uncertainty bounds. Conformal prediction satisfies this requirement with provable guarantees that posterior predictive intervals from neural networks cannot provide without additional calibration assumptions.
Conformal Language Models (CONFORMAL-LANGUAGE, Quach et al. 2023): Token-level conformal prediction for LLMs. At each token position, instead of sampling a single token, the model returns a prediction set of tokens that covers the true next token with 90% probability. This bounds hallucination: if the true continuation is never in the prediction set, the LLM has failed. The prediction set size measures calibrated uncertainty at each generation step.
Common Mistakes
:::danger Mistake 1: Violating the calibration–test split by using overlapping data The coverage guarantee assumes the calibration set and test set are exchangeable with each other - which requires that calibration data was not used for model training. If you compute SHAP values or perform any post-hoc analysis on the calibration set that influences the model (e.g., retraining the model based on calibration errors), you break the exchangeability assumption and the coverage guarantee fails. The calibration set must be completely held out - used only for computing calibration scores and quantile, nothing else. :::
:::danger Mistake 2: Forgetting the (n+1) correction in the quantile computation The standard quantile gives conservative but not guaranteed coverage. The correct quantile uses the correction: . Without this correction, coverage is slightly below for small . For large (>1000) the difference is negligible, but for the uncorrected quantile can give 90.5% coverage when you asked for 95%. :::
:::warning Mistake 3: Using conformal prediction when exchangeability is violated If your test data comes from a different distribution than your calibration data - different time period, different geography, different demographic - exchangeability is violated and the coverage guarantee fails. Quantify the covariate shift before relying on conformal guarantees. If shift is present, use Adaptive Conformal Inference (ACI) or weighted conformal prediction (Tibshirani et al. 2019), which adjusts the quantile based on covariate shift estimates. :::
:::warning Mistake 4: Interpreting prediction set size as confidence A prediction set of size 1 does not mean 100% confidence. It means the model's calibrated uncertainty at this point happens to exclude all other classes. A prediction set of size 4 does not mean the model is uncertain - it means the nonconformity scores for 4 classes all fell below the threshold. Prediction set size is an efficiency metric (smaller is better for a fixed coverage level), not a confidence metric. Always report empirical coverage alongside set size. :::
YouTube Resources
| Resource | Creator | Focus |
|---|---|---|
| A Gentle Introduction to Conformal Prediction | Anastasios Angelopoulos | Tutorial from ICML 2022, split conformal from scratch |
| Conformal Prediction for Production ML | Emmanuel Candès | Stanford lecture: theory and applications |
| Adaptive Conformal Inference Under Distribution Shift | Isaac Gibbs | ACI paper walkthrough |
| Conformal Risk Control | Angelopoulos et al. | Extending conformal beyond coverage |
| Uncertainty Quantification for ML Practitioners | Chip Huyen | Practical overview including conformal prediction |
Interview Q&A
Q1: Derive the coverage guarantee for split conformal prediction. What role does exchangeability play?
The proof is elegantly simple. Consider exchangeable random variables: (calibration) and (test). Compute nonconformity scores for all points. By exchangeability, the joint distribution is invariant to permutations, so the rank of among is uniformly distributed on . Define as the empirical quantile of . Then . The prediction set contains exactly when . So . Exchangeability is the only distributional assumption - it ensures the rank of is uniform.
Q2: What is the difference between exchangeability and i.i.d.? When does conformal prediction fail?
An i.i.d. sequence has each element independently drawn from the same distribution - independence and identical distribution. An exchangeable sequence has a joint distribution invariant to permutations - no independence required, no identical marginal distributions required. Every i.i.d. sequence is exchangeable; sampling without replacement from a finite population is exchangeable but not i.i.d.; a time series with autocorrelation is neither. Conformal prediction fails when the calibration and test data are not exchangeable with each other. Common failure modes: (1) covariate shift - test data comes from a different input distribution; (2) temporal drift - the test data is from a later time period with changed patterns; (3) selection bias - the calibration set was selected non-randomly. In practice, you can test for exchangeability violations by running a two-sample test (MMD, KS, or classifier-based) between calibration and test features. If the test rejects, the conformal guarantee is not reliable.
Q3: How does Conformalized Quantile Regression differ from standard conformal regression with absolute residual scores? When would you use each?
Standard conformal regression with absolute residual score produces constant-width prediction intervals: where is fixed. This is appropriate when the noise level is homoscedastic - the same variance across all input regions. CQR uses a quantile regression model to estimate conditional lower and upper quantiles and , and the CQR nonconformity score is . The resulting CQR prediction interval is wider where the quantile model estimates more uncertainty and narrower where it estimates less. This is adaptive - the interval width varies across input space. Use residual conformal when you believe noise is roughly homoscedastic. Use CQR when you have a well-calibrated quantile regression model and expect heteroscedastic noise - common in housing prices (high variance in luxury markets), financial returns (volatility clustering), and biological measurements (size-dependent variance).
Q4: A medical AI model is deployed. Six months after deployment, the calibration set is from pre-pandemic data but the test data is post-pandemic. How does this affect conformal coverage, and what can you do about it?
The pre-to-post-pandemic shift almost certainly violates exchangeability - patient demographics, disease presentations, and imaging protocols changed. The coverage guarantee is no longer valid. In practice, the coverage has likely dropped below the target level (the prediction sets are too small for the new distribution). To address this: (1) Most robust fix: collect new post-pandemic labeled data and re-calibrate. Even 100–200 labeled examples from the new distribution is sufficient for split conformal calibration. (2) Weighted conformal prediction (Tibshirani et al. 2019): estimate the density ratio (e.g., with a classifier) and use importance-weighted quantile computation. This adjusts for covariate shift without requiring new labeled data for calibration - but requires the new distribution to overlap with the old. (3) Adaptive Conformal Inference (ACI): deploy ACI, which updates the quantile based on observed coverage errors. Over time, it adapts to the new distribution. ACI works well for gradual drift but converges slowly for sudden shifts. (4) Empirically monitor coverage: regularly compute empirical coverage on incoming labeled data (if available) and alert when coverage drops significantly below target.
Q5: What is the tradeoff between prediction set size and coverage level? How do you choose alpha in practice?
Coverage level and average prediction set size trade off directly. Higher coverage ( closer to 1) requires a larger threshold , which includes more candidates in the prediction set. Lower coverage allows smaller sets but misses the true label more often. The right depends on the application: for medical diagnosis (miss a rare cancer), require 99% coverage - accept large prediction sets because the cost of missing is high. For product recommendations (show user a set of options), 80% coverage may be sufficient - smaller, more curated sets are better user experience. For autonomous driving (object detection), 99.9% coverage may be required by safety standards. The efficiency of an explanation method also matters: APS and RAPS achieve smaller average set sizes than naive softmax scoring at the same coverage level. Compare methods by plotting the coverage-vs-size Pareto frontier: run calibration at multiple values and plot (coverage, avg_set_size) pairs for each method. The method that achieves the smallest set size at each coverage level dominates.
Key Takeaways
Conformal prediction provides a finite-sample coverage guarantee - - under only the exchangeability assumption, with no distributional assumptions and for any underlying model. The algorithm is simple: split data into train/calibrate/test, compute nonconformity scores on the calibration set, take the corrected quantile, include all candidates where the score is below the threshold. The choice of nonconformity score determines efficiency: APS and RAPS produce smaller prediction sets than the naive softmax score at the same coverage level; CQR produces adaptive regression intervals. The main limitation is the exchangeability assumption - violated by distribution shift. Adaptive Conformal Inference addresses gradual drift. Conformal prediction is the right tool when you need a hard coverage guarantee for a black-box model without distributional assumptions, which describes most deployed production ML systems in regulated industries.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Conformal Prediction Coverage demo on the EngineersOfAI Playground - no code required.
:::
