What is conformal prediction?

Conformal prediction constructs prediction sets with provable finite-sample coverage guarantees under only the exchangeability assumption - no distributional assumptions required. Complete Python implementation for classification and regression.

How does coverage guarantee work in practice?

Conformal Prediction - Distribution-Free Uncertainty with Guaranteed Coverage covers conformal prediction, coverage guarantee, exchangeability from first principles with code examples. Free lesson at https://engineersofai.com/docs/ml/bayesian-ml/conformal-prediction

What is the difference between conformal prediction and exchangeability?

See the full breakdown at https://engineersofai.com/docs/ml/bayesian-ml/conformal-prediction

Conformal Prediction - Distribution-Free Uncertainty with Guaranteed Coverage

Reading time: 50 min | Interview relevance: Very High - appears in interviews for ML Engineer, AI Safety Engineer, Applied Scientist at any company deploying ML in regulated or safety-critical contexts | Target roles: ML Engineer, Applied Scientist, AI Safety Engineer, MLOps Engineer

The Diagnostic AI That Cannot Afford to Be Wrong Alone

It is 2023 in a large hospital system in the northeastern United States. The radiology department has deployed a diagnostic AI for chest X-ray interpretation. The model is a ResNet-50 fine-tuned on 150,000 labeled X-rays, with a top-1 accuracy of 91.3% on the held-out test set. The radiology team is enthusiastic: the model processes a study in 0.3 seconds versus 15 minutes for a radiologist, and it catches patterns that tired human eyes miss.

But the deployment team faces a hard problem from hospital legal and administration. When the model says "pneumonia" with a softmax confidence of 0.87, what does 0.87 actually mean? In neural network classification, softmax scores are notoriously miscalibrated. A model that outputs 0.87 might be right only 70% of the time for that confidence level. More importantly: the hospital needs to know not just "the most likely diagnosis" but "the set of diagnoses we can confidently rule out." For a patient presenting with respiratory symptoms, if the model can say "I'm 95% sure the correct diagnosis is one of {pneumonia, COVID-19, normal} - and I can rule out tuberculosis and lung cancer," that is clinically actionable. If the model can only say "pneumonia: 0.87, normal: 0.07, COVID-19: 0.04," the clinical team has to do additional work to interpret what 0.87 means and whether it provides the confidence level they need.

What the hospital needs is a prediction set: a set of diagnoses that contains the true diagnosis with provable probability, say 95%, regardless of the model's calibration quality, regardless of the distribution of the test cases, and without any assumptions about the underlying data distribution. In 2022, three papers (from Angelopoulos & Bates, from Tibshirani's group, and from Candès's group) collectively crystallized the modern theory of conformal prediction into a toolkit that delivers exactly this. The tool is conformal prediction.

Why Standard Confidence Intervals Fall Short

Before conformal prediction, there were two main approaches to uncertainty quantification in ML:

Approach 1: Model confidence (softmax score): Use the model's output probability directly. "The model says 87% probability of pneumonia, so we are 87% confident." Problem: neural network softmax scores are not calibrated probabilities. Guo et al. (2017) showed that modern neural networks are systematically overconfident - a model that says 95% is often right only 75–80% of the time. Temperature scaling can help but does not provide formal guarantees.

Approach 2: Bayesian credible intervals: Place a prior over model parameters, update to a posterior given data, and compute credible intervals from the posterior predictive distribution. Advantage: principled uncertainty. Disadvantage: the coverage guarantee (a 95% Bayesian credible interval contains the true value 95% of the time) holds only if the prior and likelihood are correctly specified. In practice, both are approximations, and the coverage guarantee is approximate. For a neural network, computing the exact posterior is intractable.

Conformal prediction takes a different approach. It says: we do not need to model the data distribution. We do not need a correctly specified prior. We only need to assume exchangeability - a condition much weaker than i.i.d. The coverage guarantee is exact and finite-sample: with exactly $n$ calibration points, the coverage is guaranteed to be at least $1 - \alpha$ for any model, any data distribution, any feature space.

$P(Y_{n+1} \in \hat{C}(X_{n+1})) \geq 1 - \alpha$

This inequality holds regardless of whether the model is a logistic regression or a 70-billion parameter language model, regardless of whether the data is Gaussian or multimodal, regardless of sample size. The only assumption is exchangeability of the calibration and test data.

The Exchangeability Assumption

Exchangeability is weaker than i.i.d. (independent and identically distributed). A sequence of random variables $Z_1, Z_2, \ldots, Z_n$ is exchangeable if their joint distribution is invariant to any permutation:

$P(Z_1 = z_1, \ldots, Z_n = z_n) = P(Z_{\pi(1)} = z_1, \ldots, Z_{\pi(n)} = z_n)$

for any permutation $\pi$ of $\{1, \ldots, n\}$ .

Every i.i.d. sequence is exchangeable, but not every exchangeable sequence is i.i.d. Exchangeability rules out one major non-i.i.d. case that still fails: time series with temporal dependence. It allows cases like: sampling without replacement from a finite population, mixture models (as long as the mixture proportions are unknown but fixed), and group-correlated data (as long as group membership is exchangeable).

What exchangeability does not allow: distribution shift between calibration and test time. If the calibration data is from one hospital and the test data is from a different hospital with a different patient population, the exchangeability assumption is violated and the coverage guarantee no longer holds. This is the main practical limitation of standard conformal prediction - addressed later by Adaptive Conformal Inference.

Split Conformal Prediction: The Core Algorithm

There are many variants of conformal prediction. We focus on split conformal prediction (also called inductive conformal prediction) because it is simple, computationally efficient, and the most widely deployed in practice.

Setup:

Training set $\mathcal{D}_{\text{train}}$ - used to fit the underlying model $\hat{f}$
Calibration set $\mathcal{D}_{\text{cal}} = \{(X_1, Y_1), \ldots, (X_n, Y_n)\}$ - held out from training, used only for conformal calibration
Test point $(X_{n+1}, Y_{n+1})$ - the new observation we want to predict with uncertainty

Algorithm (Split Conformal):

Train model $\hat{f}$ on $\mathcal{D}_{\text{train}}$ (any model - this is fully model-agnostic)
Choose a nonconformity score function $s(x, y)$ that measures "how unusual" the pair $(x, y)$ is according to $\hat{f}$
Compute nonconformity scores on the calibration set: $s_i = s(X_i, Y_i)$ for $i = 1, \ldots, n$
Compute the $(1-\alpha)$ quantile of the calibration scores: $\hat{q} = \text{Quantile}(s_1, \ldots, s_n; \frac{\lceil (n+1)(1-\alpha) \rceil}{n})$
Return prediction set: $\hat{C}(X_{n+1}) = \{y : s(X_{n+1}, y) \leq \hat{q}\}$

The coverage guarantee follows from the fact that, under exchangeability, the rank of $s_{n+1}$ among $\{s_1, \ldots, s_{n+1}\}$ is uniform on $\{1, \ldots, n+1\}$ . The corrected quantile $\frac{\lceil(n+1)(1-\alpha)\rceil}{n}$ ensures:

$P(Y_{n+1} \in \hat{C}(X_{n+1})) \geq 1 - \alpha$

The $n+1$ correction is important: it makes the guarantee hold for finite $n$ , not just asymptotically.

Nonconformity Scores

The choice of nonconformity score determines the shape of prediction sets and their efficiency (how small they are). Different scores are appropriate for different tasks.

For Classification

Score 1 - Softmax score (simple but conservative):

$s(x, y) = 1 - \hat{f}(x)_y$

where $\hat{f}(x)_y$ is the softmax probability assigned to class $y$ . A label $y$ is included in the prediction set if the model assigns it a softmax probability above $1 - \hat{q}$ .

Problem with the softmax score: when the model is uncertain (softmax probabilities are spread across classes), the threshold $1 - \hat{q}$ includes many classes. When the model is confident, it includes few. But the sets can be variable in size in an unintuitive way - sometimes including very improbable classes.

Score 2 - Adaptive Prediction Sets (APS, Angelopoulos et al. 2020):

APS is designed to produce prediction sets that are efficient (small) in the typical case and that adaptively include more classes when the model is uncertain.

The APS score cumulates softmax probabilities in descending order until the true class is covered:

$s_{\text{APS}}(x, y) = \sum_{y' : \hat{f}(x)_{y'} > \hat{f}(x)_y} \hat{f}(x)_{y'} + u \cdot \hat{f}(x)_y$

where $u \sim \text{Uniform}(0, 1)$ is a uniform random variable added for randomization (necessary to achieve exact rather than conservative coverage). The APS score is the sum of probabilities of classes ranked higher than $y$ , plus a fraction of $y$ 's own probability.

APS produces smaller prediction sets than the naive softmax score, especially for easy examples where the model is confident.

For Regression

Score 1 - Absolute residual (simple):

$s(x, y) = |y - \hat{f}(x)|$

where $\hat{f}(x)$ is the model's point prediction. The prediction interval is:

$\hat{C}(x_{n+1}) = [\hat{f}(x_{n+1}) - \hat{q}, \; \hat{f}(x_{n+1}) + \hat{q}]$

This produces symmetric intervals of constant width $2\hat{q}$ . The width does not adapt to input uncertainty - it is the same for high-confidence and low-confidence regions.

Score 2 - Conformalized Quantile Regression (CQR, Romano et al. 2019):

CQR starts with a quantile regression model that estimates conditional quantiles $\hat{q}_{\alpha/2}(x)$ and $\hat{q}_{1-\alpha/2}(x)$ (the lower and upper ends of the prediction interval). The CQR nonconformity score is:

$s_{\text{CQR}}(x, y) = \max\left(\hat{q}_{\alpha/2}(x) - y, \; y - \hat{q}_{1-\alpha/2}(x)\right)$

This measures how far outside the model's estimated quantile interval the true label $y$ falls. The prediction set is:

$\hat{C}(x_{n+1}) = [\hat{q}_{\alpha/2}(x_{n+1}) - \hat{q}, \; \hat{q}_{1-\alpha/2}(x_{n+1}) + \hat{q}]$

CQR produces adaptive intervals - wider where the quantile regression model is uncertain, narrower where it is confident - while maintaining the coverage guarantee.

RAPS: Regularized Adaptive Prediction Sets

Angelopoulos, Bates, Malik, and Jordan (2021) introduced RAPS (Regularized Adaptive Prediction Sets) for classification, addressing a limitation of APS: APS prediction sets can be excessively large for hard examples where the model assigns nearly equal probability to many classes.

RAPS adds a regularization term to the APS score that penalizes large prediction sets:

$s_{\text{RAPS}}(x, y) = s_{\text{APS}}(x, y) + \lambda \cdot \max(o(y|x) - k_{\text{reg}}, 0)$

where $o(y|x)$ is the rank of class $y$ in the sorted probability list, $k_{\text{reg}}$ is a threshold (e.g., 5), and $\lambda > 0$ is a regularization weight. The $\max(o - k_{\text{reg}}, 0)$ term adds a linear penalty for including low-ranked classes, discouraging large prediction sets for hard examples.

RAPS achieves smaller average prediction set size than APS (tighter uncertainty estimates) while maintaining the coverage guarantee.

Full Python Implementation

import numpy as np
from typing import List, Set, Optional, Tuple, Union
from dataclasses import dataclass
import warnings
warnings.filterwarnings("ignore")


@dataclass
class ConformalResult:
    """Output of conformal prediction."""
    prediction_set: Union[List[int], Tuple[float, float]]   # classes or interval
    coverage_level: float       # 1 - alpha
    quantile: float             # q_hat
    n_calibration: int
    method: str


class ConformalClassifier:
    """
    Split conformal classifier.

    Supports nonconformity scores:
    - "softmax": 1 - softmax_prob(true_class)
    - "aps": Adaptive Prediction Sets (Angelopoulos 2020)
    - "raps": Regularized APS (Angelopoulos 2021)

    Usage:
        clf = ConformalClassifier(model, score="aps")
        clf.calibrate(X_cal, y_cal, alpha=0.1)   # 90% coverage
        result = clf.predict(X_test[0])
        print(result.prediction_set)   # e.g., [3, 7] - classes 3 and 7 are plausible
    """

    def __init__(
        self,
        model,
        score: str = "aps",
        raps_lambda: float = 0.01,
        raps_k_reg: int = 5,
        random_seed: int = 42,
    ):
        self.model = model
        self.score_type = score
        self.raps_lambda = raps_lambda
        self.raps_k_reg = raps_k_reg
        self.rng = np.random.default_rng(random_seed)

        self._q_hat: Optional[float] = None
        self._n_cal: int = 0
        self._alpha: float = 0.1
        self._classes: Optional[np.ndarray] = None

    def _get_softmax_probs(self, X: np.ndarray) -> np.ndarray:
        """Get class probabilities from model."""
        probs = self.model.predict_proba(X)
        return probs   # shape (n, n_classes)

    def _score_softmax(
        self, probs: np.ndarray, y: np.ndarray
    ) -> np.ndarray:
        """Score: 1 - softmax(true class). Shape (n,)."""
        n = len(y)
        true_class_probs = probs[np.arange(n), y.astype(int)]
        return 1.0 - true_class_probs

    def _score_aps(
        self, probs: np.ndarray, y: np.ndarray, randomize: bool = True
    ) -> np.ndarray:
        """
        Adaptive Prediction Sets score.
        For each sample: cumulate sorted softmax probabilities until true class.
        """
        n = len(y)
        scores = np.zeros(n)

        for i in range(n):
            p = probs[i]
            y_i = int(y[i])

            # Sort classes by probability descending
            sorted_indices = np.argsort(p)[::-1]
            sorted_probs = p[sorted_indices]

            # Find rank of true class
            true_rank = np.where(sorted_indices == y_i)[0][0]

            # Cumulate probabilities up to (but not including) true class
            cumsum = np.sum(sorted_probs[:true_rank])

            # Add randomized portion of true class probability
            u = self.rng.uniform(0, 1) if randomize else 0.5
            scores[i] = cumsum + u * sorted_probs[true_rank]

        return scores

    def _score_raps(
        self, probs: np.ndarray, y: np.ndarray
    ) -> np.ndarray:
        """RAPS: APS + regularization for large prediction sets."""
        aps_scores = self._score_aps(probs, y, randomize=True)
        n = len(y)
        regularization = np.zeros(n)

        for i in range(n):
            p = probs[i]
            y_i = int(y[i])
            sorted_indices = np.argsort(p)[::-1]
            true_rank = np.where(sorted_indices == y_i)[0][0] + 1  # 1-indexed
            regularization[i] = (
                self.raps_lambda * max(true_rank - self.raps_k_reg, 0)
            )

        return aps_scores + regularization

    def _compute_scores(self, X: np.ndarray, y: np.ndarray) -> np.ndarray:
        probs = self._get_softmax_probs(X)
        if self.score_type == "softmax":
            return self._score_softmax(probs, y)
        elif self.score_type == "aps":
            return self._score_aps(probs, y)
        elif self.score_type == "raps":
            return self._score_raps(probs, y)
        else:
            raise ValueError(f"Unknown score: {self.score_type}")

    def calibrate(
        self, X_cal: np.ndarray, y_cal: np.ndarray, alpha: float = 0.1
    ) -> None:
        """
        Calibrate using held-out calibration set.
        Sets q_hat such that prediction sets have >= 1-alpha coverage.

        alpha: miscoverage level (e.g., 0.1 for 90% coverage)
        """
        self._alpha = alpha
        self._n_cal = len(y_cal)
        self._classes = np.unique(y_cal)

        # Compute calibration scores
        cal_scores = self._compute_scores(X_cal, y_cal)

        # Corrected quantile: (n+1)(1-alpha)/n to ensure >= 1-alpha coverage
        n = len(cal_scores)
        level = np.ceil((n + 1) * (1 - alpha)) / n
        level = min(level, 1.0)   # cap at 1.0

        self._q_hat = float(np.quantile(cal_scores, level))
        print(
            f"Calibration complete: n_cal={n}, alpha={alpha}, "
            f"q_hat={self._q_hat:.4f}, "
            f"target_coverage={1-alpha:.1%}"
        )

    def predict(self, x: np.ndarray, top_k: int = None) -> ConformalResult:
        """
        Return prediction set for a single test input.
        Prediction set = all classes y where score(x, y) <= q_hat.
        """
        if self._q_hat is None:
            raise RuntimeError("Call calibrate() first.")

        x = x.reshape(1, -1)
        probs = self._get_softmax_probs(x)[0]   # shape (n_classes,)
        n_classes = len(probs)

        included = []
        for y_candidate in range(n_classes):
            # Compute score for this candidate class
            if self.score_type == "softmax":
                s = 1.0 - probs[y_candidate]
            elif self.score_type in ("aps", "raps"):
                # For prediction: use deterministic version (u=0.5)
                sorted_indices = np.argsort(probs)[::-1]
                sorted_probs = probs[sorted_indices]
                true_rank = np.where(sorted_indices == y_candidate)[0][0]
                s = np.sum(sorted_probs[:true_rank]) + 0.5 * sorted_probs[true_rank]
                if self.score_type == "raps":
                    s += self.raps_lambda * max(true_rank + 1 - self.raps_k_reg, 0)
            else:
                s = 0.0

            if s <= self._q_hat:
                included.append(y_candidate)

        return ConformalResult(
            prediction_set=included,
            coverage_level=1 - self._alpha,
            quantile=self._q_hat,
            n_calibration=self._n_cal,
            method=f"split_conformal_{self.score_type}",
        )

    def predict_batch(
        self, X_test: np.ndarray
    ) -> List[ConformalResult]:
        """Predict conformal sets for all test points."""
        return [self.predict(x) for x in X_test]

    def empirical_coverage(
        self, X_test: np.ndarray, y_test: np.ndarray
    ) -> float:
        """Measure empirical coverage on test set (should be >= 1-alpha)."""
        covered = 0
        for x, y_true in zip(X_test, y_test):
            result = self.predict(x)
            if int(y_true) in result.prediction_set:
                covered += 1
        return covered / len(y_test)

    def avg_prediction_set_size(self, X_test: np.ndarray) -> float:
        """Average size of prediction sets (efficiency measure)."""
        sizes = [len(self.predict(x).prediction_set) for x in X_test]
        return float(np.mean(sizes))


class ConformalRegressor:
    """
    Split conformal regressor.

    Supports:
    - "residual": absolute residual score (constant-width intervals)
    - "cqr": Conformalized Quantile Regression (Romano 2019)

    For CQR, the model must be a quantile regressor with methods
    predict_lower(X) and predict_upper(X).
    """

    def __init__(
        self,
        model,
        score: str = "residual",
    ):
        self.model = model
        self.score_type = score
        self._q_hat: Optional[float] = None
        self._n_cal: int = 0
        self._alpha: float = 0.1

    def _score_residual(self, X: np.ndarray, y: np.ndarray) -> np.ndarray:
        """Absolute residual: |y - f_hat(x)|."""
        preds = self.model.predict(X)
        return np.abs(y - preds)

    def _score_cqr(self, X: np.ndarray, y: np.ndarray) -> np.ndarray:
        """
        CQR score: max(q_low(x) - y, y - q_high(x))
        Requires model to have predict_lower() and predict_upper() methods.
        """
        try:
            q_low = self.model.predict_lower(X)
            q_high = self.model.predict_upper(X)
        except AttributeError:
            raise ValueError(
                "For CQR, model must have predict_lower(X) and predict_upper(X) methods. "
                "Wrap your quantile regression model accordingly."
            )
        return np.maximum(q_low - y, y - q_high)

    def calibrate(
        self, X_cal: np.ndarray, y_cal: np.ndarray, alpha: float = 0.1
    ) -> None:
        """Calibrate for 1-alpha coverage."""
        self._alpha = alpha
        self._n_cal = len(y_cal)

        if self.score_type == "residual":
            cal_scores = self._score_residual(X_cal, y_cal)
        elif self.score_type == "cqr":
            cal_scores = self._score_cqr(X_cal, y_cal)
        else:
            raise ValueError(f"Unknown score type: {self.score_type}")

        n = len(cal_scores)
        level = np.ceil((n + 1) * (1 - alpha)) / n
        level = min(level, 1.0)
        self._q_hat = float(np.quantile(cal_scores, level))

        print(
            f"Calibration: n_cal={n}, alpha={alpha}, "
            f"q_hat={self._q_hat:.4f}, "
            f"target_coverage={1-alpha:.1%}"
        )

    def predict(self, x: np.ndarray) -> ConformalResult:
        """Return prediction interval [lower, upper] with 1-alpha coverage."""
        if self._q_hat is None:
            raise RuntimeError("Call calibrate() first.")

        x = x.reshape(1, -1)

        if self.score_type == "residual":
            center = self.model.predict(x)[0]
            lower = center - self._q_hat
            upper = center + self._q_hat
        elif self.score_type == "cqr":
            q_low = self.model.predict_lower(x)[0]
            q_high = self.model.predict_upper(x)[0]
            lower = q_low - self._q_hat
            upper = q_high + self._q_hat
        else:
            raise ValueError(f"Unknown score: {self.score_type}")

        return ConformalResult(
            prediction_set=(lower, upper),
            coverage_level=1 - self._alpha,
            quantile=self._q_hat,
            n_calibration=self._n_cal,
            method=f"split_conformal_{self.score_type}",
        )

    def empirical_coverage(
        self, X_test: np.ndarray, y_test: np.ndarray
    ) -> float:
        """Fraction of test points covered by their prediction interval."""
        covered = 0
        for x, y_true in zip(X_test, y_test):
            result = self.predict(x)
            lower, upper = result.prediction_set
            if lower <= y_true <= upper:
                covered += 1
        return covered / len(y_test)

    def avg_interval_width(self, X_test: np.ndarray) -> float:
        """Average prediction interval width (efficiency measure)."""
        widths = []
        for x in X_test:
            result = self.predict(x)
            lower, upper = result.prediction_set
            widths.append(upper - lower)
        return float(np.mean(widths))


# ─── DEMO ─────────────────────────────────────────────────────────────────────

def run_classification_demo():
    """Demonstrate ConformalClassifier on a synthetic 5-class dataset."""
    from sklearn.datasets import make_classification
    from sklearn.ensemble import GradientBoostingClassifier
    from sklearn.model_selection import train_test_split

    np.random.seed(42)
    X, y = make_classification(
        n_samples=5000, n_features=20, n_informative=10,
        n_classes=5, n_clusters_per_class=1, random_state=42
    )

    # Three-way split: train / calibrate / test
    X_train, X_temp, y_train, y_temp = train_test_split(
        X, y, test_size=0.4, random_state=42
    )
    X_cal, X_test, y_cal, y_test = train_test_split(
        X_temp, y_temp, test_size=0.5, random_state=42
    )

    # Train model
    model = GradientBoostingClassifier(
        n_estimators=200, max_depth=4, random_state=42
    )
    model.fit(X_train, y_train)
    base_acc = model.score(X_test, y_test)
    print(f"Base model accuracy: {base_acc:.4f}")

    # Compare score types
    print("\n--- Classification Conformal Prediction ---")
    for score_type in ["softmax", "aps", "raps"]:
        clf = ConformalClassifier(model, score=score_type)
        clf.calibrate(X_cal, y_cal, alpha=0.1)  # 90% coverage target

        cov = clf.empirical_coverage(X_test, y_test)
        avg_size = clf.avg_prediction_set_size(X_test)
        print(f"\n{score_type.upper()}: coverage={cov:.3f} (target >= 0.90), avg_set_size={avg_size:.2f}")

        # Show example prediction sets
        for i in range(3):
            result = clf.predict(X_test[i])
            true_label = y_test[i]
            in_set = true_label in result.prediction_set
            print(f"  Sample {i}: true={true_label}, set={result.prediction_set}, covered={in_set}")


def run_regression_demo():
    """Demonstrate ConformalRegressor on housing price data."""
    from sklearn.datasets import make_regression
    from sklearn.ensemble import GradientBoostingRegressor
    from sklearn.model_selection import train_test_split

    np.random.seed(42)
    X, y = make_regression(
        n_samples=5000, n_features=15, n_informative=10,
        noise=25.0, random_state=42
    )

    X_train, X_temp, y_train, y_temp = train_test_split(
        X, y, test_size=0.4, random_state=42
    )
    X_cal, X_test, y_cal, y_test = train_test_split(
        X_temp, y_temp, test_size=0.5, random_state=42
    )

    # Train base model
    model = GradientBoostingRegressor(
        n_estimators=200, max_depth=4, random_state=42
    )
    model.fit(X_train, y_train)

    # Conformal regression with residual score (constant-width)
    print("\n--- Regression Conformal Prediction (Residual Score) ---")
    reg = ConformalRegressor(model, score="residual")
    reg.calibrate(X_cal, y_cal, alpha=0.1)  # 90% coverage

    cov = reg.empirical_coverage(X_test, y_test)
    avg_width = reg.avg_interval_width(X_test)
    print(f"Empirical coverage: {cov:.3f} (target >= 0.90)")
    print(f"Average interval width: {avg_width:.2f}")

    # Show examples
    for i in range(3):
        result = reg.predict(X_test[i])
        lower, upper = result.prediction_set
        true_val = y_test[i]
        covered = lower <= true_val <= upper
        print(
            f"Sample {i}: true={true_val:.2f}, "
            f"interval=[{lower:.2f}, {upper:.2f}], "
            f"width={upper-lower:.2f}, covered={covered}"
        )

    return reg


if __name__ == "__main__":
    run_classification_demo()
    run_regression_demo()

Historical Context: From Venn Predictors to Modern Conformal

Conformal prediction was first introduced by Vladimir Vovk, Alexander Gammerman, and Glenn Shafer in the late 1990s and formalized in their 2005 book "Algorithmic Learning in a Random World." The original framework - transductive conformal prediction - computed conformal scores by including the test point in the calibration set and checking its conformity. This was computationally expensive: for each candidate label $y$ , you had to refit the underlying model or measure a score that depended on the entire dataset including the test point.

Inductive Conformal Prediction (split conformal), introduced in the same era and popularized through the work of Harris Papadopoulos, reduced the computation to a single model fit - making conformal prediction practical for real models. The key insight: split the calibration computation away from the model training, computing scores on a held-out calibration set after training is complete.

The modern renaissance of conformal prediction came from 2019–2023, driven by three influential papers. Romano, Patterson, and Candès (2019) introduced Conformalized Quantile Regression, bringing adaptive prediction intervals to regression. Angelopoulos, Bates, Malik, and Jordan (2020) introduced Adaptive Prediction Sets (APS) for classification - a nonconformity score that produces smaller, more efficient prediction sets. Gibbs and Candès (2021) introduced Adaptive Conformal Inference for online settings with distribution shift. Together, these papers moved conformal prediction from a theoretical curiosity to a practical tool now being deployed in medical AI, autonomous systems, and LLM evaluation.

Conformal Risk Control: Beyond Coverage

Angelopoulos, Bates, Fischler, Malik, and Jordan (2022) generalized conformal prediction beyond the coverage guarantee. Instead of controlling only the probability of coverage, conformal risk control allows you to control any bounded loss function $\ell(C(X), Y)$ with a formal guarantee:

$\mathbb{E}[\ell(\hat{C}(X_{n+1}), Y_{n+1})] \leq \alpha$

Why this matters: Coverage is a binary loss - either the true label is in the set or it is not. But many real problems have graded losses. In medical diagnosis: missing a severe condition (true label not in set) is worse than missing a mild one. In object detection: a slightly wrong bounding box is better than a completely wrong one. Conformal risk control handles these cases by replacing the 0-1 coverage loss with any bounded, monotone loss function.

Examples of controlled risks:

False negative rate in binary classification: control the fraction of positive cases missed by the prediction set
Expected size of prediction set: control average set size while maintaining some coverage
Graph distance in sequence prediction: control the average edit distance between the prediction set and the true sequence
F1 score in multi-label classification: control expected F1 measured against the prediction set

The algorithm mirrors split conformal: replace the coverage indicator $\mathbf{1}[Y \notin C(X)]$ with the loss $\ell(C(X), Y)$ , compute the corrected quantile of the loss values on the calibration set, and use that quantile to set the threshold. The coverage guarantee generalizes exactly: $\mathbb{E}[\ell] \leq \alpha + O(1/n)$ under exchangeability.

Weighted Conformal Prediction for Covariate Shift

When calibration and test distributions differ (covariate shift: $p_{\text{test}}(x) \neq p_{\text{cal}}(x)$ ), the coverage guarantee fails. Tibshirani, Barber, Candès, and Ramdas (2019) introduced weighted conformal prediction to restore the guarantee under covariate shift.

Idea: Weight the calibration scores by their likelihood ratio - points in the calibration set that are "similar to" the test point get higher weight, points that are "dissimilar" get lower weight.

The weighted quantile is:

$\hat{q}_w = \text{Quantile}\left(\sum_{i=1}^n w_i \delta_{s_i} + w_{n+1} \delta_\infty;\; 1-\alpha\right)$

where $w_i = p_{\text{test}}(x_i) / p_{\text{cal}}(x_i)$ are the likelihood ratio weights (estimated via a classifier trained to distinguish test from calibration data), and $w_{n+1}$ is the weight assigned to infinity (ensuring the prediction set is non-empty when all weights are concentrated elsewhere).

In practice: Train a binary classifier to predict whether a point came from the calibration or test distribution. The classifier's output probabilities give the likelihood ratio. This approach works when the test distribution is specified in advance (e.g., a known demographic subgroup), but is harder when the test distribution is unknown at calibration time.

Adaptive Conformal Inference for Distribution Shift

Standard split conformal prediction requires exchangeability. When the test distribution shifts from the calibration distribution (a common occurrence in deployed ML systems), the coverage guarantee breaks down. Isaac Gibbs and Emmanuel Candès (2021) introduced Adaptive Conformal Inference (ACI) for this setting.

ACI Algorithm: Instead of using a fixed quantile $\hat{q}$ computed once from the calibration set, ACI updates the quantile at each time step based on whether the previous prediction set covered the true label:

$\hat{q}_{t+1} = \hat{q}_t + \gamma \left(\alpha - \mathbf{1}[Y_t \notin \hat{C}_t(X_t)]\right)$

where $\gamma > 0$ is a step size. If the previous prediction missed (coverage error: $\mathbf{1}[Y_t \notin \hat{C}_t] = 1$ ), the quantile increases (prediction set gets larger). If it covered correctly (no error: $\mathbf{1} = 0$ ), the quantile decreases (prediction set gets smaller). This online update ensures that long-run average coverage tracks $1 - \alpha$ even under distribution shift.

ACI is the right tool for time-series or streaming data where the distribution changes gradually over time - weather forecasting, financial risk, network intrusion detection.

Cross-Conformal and Jackknife+ for Small Calibration Sets

Split conformal requires holding out a calibration set from training data. When data is scarce (n < 200), this wasteful split hurts both model quality and calibration reliability. Two alternatives:

Cross-conformal: Split data into $K$ folds (like K-fold CV). Train $K$ models, each leaving one fold out. Use each model to compute scores for its left-out fold. Pool all scores as the calibration set. More data-efficient but adds computation (train $K$ models).

Jackknife+ (Barber et al. 2021): Use leave-one-out training. For each calibration point $i$ , train the model on all data except point $i$ , compute the residual on point $i$ . The prediction interval for a new point averages over all LOO predictions. Jackknife+ has a coverage guarantee of $1 - 2\alpha$ (slightly weaker than split conformal's $1 - \alpha$ ) but uses the full dataset for model training.

For most practical settings with $n > 500$ , split conformal with a 20% calibration split is sufficient. Use jackknife+ when data is very scarce or when every training point matters.

Production Monitoring of Conformal Coverage

Deploying conformal prediction does not end the work - you must monitor empirical coverage in production to verify that the exchangeability assumption holds.

Coverage monitoring pipeline:

import numpy as np
from collections import deque
from typing import Deque, Tuple
import time


class CoverageMonitor:
    """
    Monitors empirical coverage of conformal prediction sets in production.

    Maintains a rolling window of (prediction_set, true_label) pairs
    and computes empirical coverage over the window.

    Alerts when coverage drops below target level by more than a threshold.
    """

    def __init__(
        self,
        target_coverage: float = 0.90,
        window_size: int = 500,
        alert_threshold: float = 0.05,  # alert if coverage drops by more than 5%
    ):
        self.target_coverage = target_coverage
        self.window_size = window_size
        self.alert_threshold = alert_threshold

        # Rolling window of (is_covered: bool, timestamp: float)
        self._window: Deque[Tuple[bool, float]] = deque(maxlen=window_size)
        self._n_alerts = 0

    def record(
        self,
        prediction_set,   # list of class indices or (lower, upper) tuple
        true_label,       # int for classification, float for regression
    ) -> bool:
        """
        Record a new prediction and its outcome.
        Returns True if coverage alert is triggered.
        """
        # Check coverage
        if isinstance(prediction_set, (list, set)):
            covered = int(true_label) in prediction_set
        else:
            lower, upper = prediction_set
            covered = lower <= true_label <= upper

        self._window.append((covered, time.time()))
        return self._check_alert()

    def _check_alert(self) -> bool:
        if len(self._window) < 50:   # need minimum samples for reliable estimate
            return False
        empirical = self.empirical_coverage()
        if empirical < self.target_coverage - self.alert_threshold:
            self._n_alerts += 1
            return True
        return False

    def empirical_coverage(self) -> float:
        """Fraction of predictions where true label was covered."""
        if not self._window:
            return 1.0
        return sum(covered for covered, _ in self._window) / len(self._window)

    def coverage_report(self) -> dict:
        """Generate coverage report for monitoring dashboard."""
        return {
            "n_predictions": len(self._window),
            "empirical_coverage": round(self.empirical_coverage(), 4),
            "target_coverage": self.target_coverage,
            "coverage_gap": round(
                self.empirical_coverage() - self.target_coverage, 4
            ),
            "alert": self._check_alert(),
            "n_alerts_total": self._n_alerts,
        }


# Usage example
monitor = CoverageMonitor(target_coverage=0.90, window_size=500)

# Simulate incoming predictions
rng = np.random.default_rng(42)
for i in range(1000):
    # Simulate: true label is int in [0, 9]
    true_label = rng.integers(0, 10)
    # Simulate: prediction set contains true label ~91% of the time
    pred_set = [true_label] if rng.random() < 0.91 else [true_label + 1]
    alert = monitor.record(pred_set, true_label)

print(monitor.coverage_report())
# Expected: empirical_coverage ≈ 0.91, alert=False

Monitoring checklist in production:

Log every prediction set size alongside the request ID and model version
When true labels become available (delayed feedback), log coverage (was label in set)
Compute empirical coverage in a rolling 500-prediction window
Alert operations team when rolling coverage drops below $1-\alpha - 0.03$
Track average prediction set size over time - growing sets indicate increasing model uncertainty
Monitor calibration score quantile over time - changes indicate input distribution shift
Re-calibrate when coverage consistently deviates or when the model is retrained

Conformal vs Bayesian Credible Intervals

Dimension	Conformal Prediction	Bayesian Credible Interval
Coverage guarantee	Exact finite-sample, $\geq 1-\alpha$	Approximate - depends on prior/likelihood correctness
Distributional assumptions	Exchangeability only	Prior + likelihood must be correctly specified
Computational cost	$O(n)$ calibration, $O(1)$ inference	Expensive posterior inference (MCMC, VI)
Adaptivity	Depends on score choice	Naturally adaptive via posterior uncertainty
Distribution shift	Fails (breaks exchangeability)	Fails (model mismatch)
Model agnosticism	Fully agnostic - works with any model	Requires access to model's probability structure
Practical use case	Production ML, regulatory compliance	Research, scientific inference
Multi-task uncertainty	Requires multi-output conformal	Natural via joint posterior

When to use conformal: When you need a hard coverage guarantee for a black-box model in production. When regulatory or compliance requirements demand provable uncertainty. When you cannot specify a correct prior (most deployed ML systems).

When to use Bayesian: When you need to express prior domain knowledge. When interpretability of the uncertainty source matters. When you are doing scientific inference and want posterior over model parameters.

Computational Complexity

Conformal prediction is computationally lightweight:

Calibration: $O(n)$ model evaluations + $O(n \log n)$ sorting for quantile computation. Run once after model training.
Inference: $O(1)$ - for regression, two additions. For classification with $K$ classes, $O(K)$ score computations.
Memory: Store calibration scores ( $n$ floats) plus model weights.

Compare to Bayesian methods: MCMC requires $O(T \cdot n)$ per chain for $T$ samples; variational inference requires an optimization loop. Conformal has essentially no inference overhead beyond the base model call.

Deployments: Waymo, Medical AI, and Language Models

Waymo (Object Detection): Conformal prediction sets for 3D bounding box regression - instead of outputting a single bounding box, the system outputs a region guaranteed to contain the true object position with 95% probability. This is safety-critical: an autonomous vehicle must know not just where it thinks a pedestrian is, but the envelope of uncertainty around that position.

Medical Diagnostics (FDA AI/ML): The FDA's proposed framework for AI/ML-based Software as a Medical Device increasingly requires quantified uncertainty bounds. Conformal prediction satisfies this requirement with provable guarantees that posterior predictive intervals from neural networks cannot provide without additional calibration assumptions.

Conformal Language Models (CONFORMAL-LANGUAGE, Quach et al. 2023): Token-level conformal prediction for LLMs. At each token position, instead of sampling a single token, the model returns a prediction set of tokens that covers the true next token with 90% probability. This bounds hallucination: if the true continuation is never in the prediction set, the LLM has failed. The prediction set size measures calibrated uncertainty at each generation step.

Common Mistakes

:::danger Mistake 1: Violating the calibration–test split by using overlapping data The coverage guarantee assumes the calibration set and test set are exchangeable with each other - which requires that calibration data was not used for model training. If you compute SHAP values or perform any post-hoc analysis on the calibration set that influences the model (e.g., retraining the model based on calibration errors), you break the exchangeability assumption and the coverage guarantee fails. The calibration set must be completely held out - used only for computing calibration scores and quantile, nothing else. :::

:::danger Mistake 2: Forgetting the (n+1) correction in the quantile computation The standard quantile $\hat{q} = \text{Quantile}(s_1, \ldots, s_n; 1-\alpha)$ gives conservative but not guaranteed coverage. The correct quantile uses the correction: $\hat{q} = \text{Quantile}(s_1, \ldots, s_n; \lceil(n+1)(1-\alpha)\rceil / n)$ . Without this correction, coverage is slightly below $1-\alpha$ for small $n$ . For large $n$ (>1000) the difference is negligible, but for $n = 50$ the uncorrected quantile can give 90.5% coverage when you asked for 95%. :::

:::warning Mistake 3: Using conformal prediction when exchangeability is violated If your test data comes from a different distribution than your calibration data - different time period, different geography, different demographic - exchangeability is violated and the coverage guarantee fails. Quantify the covariate shift before relying on conformal guarantees. If shift is present, use Adaptive Conformal Inference (ACI) or weighted conformal prediction (Tibshirani et al. 2019), which adjusts the quantile based on covariate shift estimates. :::

:::warning Mistake 4: Interpreting prediction set size as confidence A prediction set of size 1 does not mean 100% confidence. It means the model's calibrated uncertainty at this point happens to exclude all other classes. A prediction set of size 4 does not mean the model is uncertain - it means the nonconformity scores for 4 classes all fell below the threshold. Prediction set size is an efficiency metric (smaller is better for a fixed coverage level), not a confidence metric. Always report empirical coverage alongside set size. :::

YouTube Resources

Resource	Creator	Focus
A Gentle Introduction to Conformal Prediction	Anastasios Angelopoulos	Tutorial from ICML 2022, split conformal from scratch
Conformal Prediction for Production ML	Emmanuel Candès	Stanford lecture: theory and applications
Adaptive Conformal Inference Under Distribution Shift	Isaac Gibbs	ACI paper walkthrough
Conformal Risk Control	Angelopoulos et al.	Extending conformal beyond coverage
Uncertainty Quantification for ML Practitioners	Chip Huyen	Practical overview including conformal prediction

Interview Q&A

Q1: Derive the coverage guarantee for split conformal prediction. What role does exchangeability play?

The proof is elegantly simple. Consider $n+1$ exchangeable random variables: $(X_1, Y_1), \ldots, (X_n, Y_n)$ (calibration) and $(X_{n+1}, Y_{n+1})$ (test). Compute nonconformity scores $S_i = s(X_i, Y_i)$ for all $n+1$ points. By exchangeability, the joint distribution is invariant to permutations, so the rank of $S_{n+1}$ among $\{S_1, \ldots, S_{n+1}\}$ is uniformly distributed on $\{1, \ldots, n+1\}$ . Define $\hat{q}$ as the $\lceil(n+1)(1-\alpha)\rceil / n$ empirical quantile of $\{S_1, \ldots, S_n\}$ . Then $P(S_{n+1} \leq \hat{q}) = P(\text{rank}(S_{n+1}) \leq \lceil(n+1)(1-\alpha)\rceil) = \lceil(n+1)(1-\alpha)\rceil / (n+1) \geq 1-\alpha$ . The prediction set $\hat{C}(X_{n+1}) = \{y : s(X_{n+1}, y) \leq \hat{q}\}$ contains $Y_{n+1}$ exactly when $S_{n+1} \leq \hat{q}$ . So $P(Y_{n+1} \in \hat{C}) \geq 1-\alpha$ . Exchangeability is the only distributional assumption - it ensures the rank of $S_{n+1}$ is uniform.

Q2: What is the difference between exchangeability and i.i.d.? When does conformal prediction fail?

An i.i.d. sequence has each element independently drawn from the same distribution - independence and identical distribution. An exchangeable sequence has a joint distribution invariant to permutations - no independence required, no identical marginal distributions required. Every i.i.d. sequence is exchangeable; sampling without replacement from a finite population is exchangeable but not i.i.d.; a time series with autocorrelation is neither. Conformal prediction fails when the calibration and test data are not exchangeable with each other. Common failure modes: (1) covariate shift - test data comes from a different input distribution; (2) temporal drift - the test data is from a later time period with changed patterns; (3) selection bias - the calibration set was selected non-randomly. In practice, you can test for exchangeability violations by running a two-sample test (MMD, KS, or classifier-based) between calibration and test features. If the test rejects, the conformal guarantee is not reliable.

Q3: How does Conformalized Quantile Regression differ from standard conformal regression with absolute residual scores? When would you use each?

Standard conformal regression with absolute residual score $s(x,y) = |y - \hat{f}(x)|$ produces constant-width prediction intervals: $[\hat{f}(x) \pm \hat{q}]$ where $\hat{q}$ is fixed. This is appropriate when the noise level is homoscedastic - the same variance across all input regions. CQR uses a quantile regression model to estimate conditional lower and upper quantiles $\hat{q}_{\alpha/2}(x)$ and $\hat{q}_{1-\alpha/2}(x)$ , and the CQR nonconformity score is $\max(\hat{q}_{\alpha/2}(x) - y, y - \hat{q}_{1-\alpha/2}(x))$ . The resulting CQR prediction interval $[\hat{q}_{\alpha/2}(x) - \hat{q}_{\text{cal}}, \hat{q}_{1-\alpha/2}(x) + \hat{q}_{\text{cal}}]$ is wider where the quantile model estimates more uncertainty and narrower where it estimates less. This is adaptive - the interval width varies across input space. Use residual conformal when you believe noise is roughly homoscedastic. Use CQR when you have a well-calibrated quantile regression model and expect heteroscedastic noise - common in housing prices (high variance in luxury markets), financial returns (volatility clustering), and biological measurements (size-dependent variance).

Q4: A medical AI model is deployed. Six months after deployment, the calibration set is from pre-pandemic data but the test data is post-pandemic. How does this affect conformal coverage, and what can you do about it?

The pre-to-post-pandemic shift almost certainly violates exchangeability - patient demographics, disease presentations, and imaging protocols changed. The coverage guarantee is no longer valid. In practice, the coverage has likely dropped below the target level (the prediction sets are too small for the new distribution). To address this: (1) Most robust fix: collect new post-pandemic labeled data and re-calibrate. Even 100–200 labeled examples from the new distribution is sufficient for split conformal calibration. (2) Weighted conformal prediction (Tibshirani et al. 2019): estimate the density ratio $w(x) = p_{\text{new}}(x) / p_{\text{old}}(x)$ (e.g., with a classifier) and use importance-weighted quantile computation. This adjusts for covariate shift without requiring new labeled data for calibration - but requires the new distribution to overlap with the old. (3) Adaptive Conformal Inference (ACI): deploy ACI, which updates the quantile based on observed coverage errors. Over time, it adapts to the new distribution. ACI works well for gradual drift but converges slowly for sudden shifts. (4) Empirically monitor coverage: regularly compute empirical coverage on incoming labeled data (if available) and alert when coverage drops significantly below target.

Q5: What is the tradeoff between prediction set size and coverage level? How do you choose alpha in practice?

Coverage level $1-\alpha$ and average prediction set size trade off directly. Higher coverage ( $1-\alpha$ closer to 1) requires a larger threshold $\hat{q}$ , which includes more candidates in the prediction set. Lower coverage allows smaller sets but misses the true label more often. The right $\alpha$ depends on the application: for medical diagnosis (miss a rare cancer), require 99% coverage - accept large prediction sets because the cost of missing is high. For product recommendations (show user a set of options), 80% coverage may be sufficient - smaller, more curated sets are better user experience. For autonomous driving (object detection), 99.9% coverage may be required by safety standards. The efficiency of an explanation method also matters: APS and RAPS achieve smaller average set sizes than naive softmax scoring at the same coverage level. Compare methods by plotting the coverage-vs-size Pareto frontier: run calibration at multiple $\alpha$ values and plot (coverage, avg_set_size) pairs for each method. The method that achieves the smallest set size at each coverage level dominates.

Key Takeaways

Conformal prediction provides a finite-sample coverage guarantee - $P(Y_{n+1} \in \hat{C}(X_{n+1})) \geq 1-\alpha$ - under only the exchangeability assumption, with no distributional assumptions and for any underlying model. The algorithm is simple: split data into train/calibrate/test, compute nonconformity scores on the calibration set, take the corrected quantile, include all candidates where the score is below the threshold. The choice of nonconformity score determines efficiency: APS and RAPS produce smaller prediction sets than the naive softmax score at the same coverage level; CQR produces adaptive regression intervals. The main limitation is the exchangeability assumption - violated by distribution shift. Adaptive Conformal Inference addresses gradual drift. Conformal prediction is the right tool when you need a hard coverage guarantee for a black-box model without distributional assumptions, which describes most deployed production ML systems in regulated industries.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Conformal Prediction Coverage demo on the EngineersOfAI Playground - no code required.

:::

The Diagnostic AI That Cannot Afford to Be Wrong Alone​

Why Standard Confidence Intervals Fall Short​

The Exchangeability Assumption​

Split Conformal Prediction: The Core Algorithm​

Nonconformity Scores​

For Classification​

For Regression​

RAPS: Regularized Adaptive Prediction Sets​

Full Python Implementation​

Historical Context: From Venn Predictors to Modern Conformal​

Conformal Risk Control: Beyond Coverage​

Weighted Conformal Prediction for Covariate Shift​

Adaptive Conformal Inference for Distribution Shift​

Cross-Conformal and Jackknife+ for Small Calibration Sets​

Production Monitoring of Conformal Coverage​

Conformal vs Bayesian Credible Intervals​

Computational Complexity​

Deployments: Waymo, Medical AI, and Language Models​

Common Mistakes​

YouTube Resources​

Interview Q&A​

Key Takeaways​