What is uncertainty quantification?

Calibration, reliability diagrams, Expected Calibration Error, temperature scaling, and the full toolkit for quantifying and correcting uncertainty in production ML models.

How does calibration work in practice?

Uncertainty Quantification - Knowing What Your Model Doesn't Know covers uncertainty quantification, calibration, ECE from first principles with code examples. Free lesson at https://engineersofai.com/docs/ml/bayesian-ml/uncertainty-quantification

What is the difference between uncertainty quantification and ECE?

See the full breakdown at https://engineersofai.com/docs/ml/bayesian-ml/uncertainty-quantification

import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem';

Uncertainty Quantification - Knowing What Your Model Doesn't Know

Reading time: 45–55 minutes Interview relevance: Very High - calibration appears in every production ML interview Target roles: Machine Learning Engineer, MLOps Engineer, AI Engineer, Data Scientist

The Real Interview Moment

It is 2021. A hospital system has deployed a clinical decision support system to flag high-risk sepsis patients for early intervention. The model achieves 91% accuracy on held-out test data. The clinical team is told: "When the model says 90% probability of sepsis, it is usually right." They believe it.

Over the next six months, something insidious happens. Clinicians begin noticing that when the model is wrong, it is confidently wrong - outputting 95% or higher on cases that turn out to be false positives. They start to doubt the model. One clinician delays treatment in a high-confidence case because the presentation does not match her clinical intuition. The patient deteriorates.

An audit reveals the problem: the model's 90% confidence predictions are correct only 68% of the time. It is accurate in aggregate (91% overall) but severely miscalibrated. The softmax output is a normalized score, not a probability. No one had checked whether confidence actually corresponded to accuracy.

Calibration is the property that makes a model's confidence scores trustworthy. A perfectly calibrated model saying "90% confidence" is correct exactly 90% of the time. This is not an automatic property of any model - neural networks in particular are systematically overconfident. Modern deep learning has made the calibration problem worse, not better (Guo et al. 2017 showed that ResNet-style models are significantly more miscalibrated than shallow networks).

This lesson covers the full toolkit: how to measure miscalibration, how to fix it, and how to detect when a model is being asked about something it was never trained to handle.

What Is Calibration?

The Formal Definition

A model is perfectly calibrated if:

$P(\hat{Y} = Y \mid P(\hat{Y}) = p) = p \quad \forall\, p \in [0, 1]$

In words: among all predictions made with confidence $p$ , exactly $p$ fraction of them are correct.

This is a frequentist notion of probability. A 70% prediction should be correct 70 times out of 100 - not more, not less. A model that says "90% confidence" on everything is wrong 10% of the time if it is a good model, but if it is wrong 50% of the time at "90% confidence", it is severely miscalibrated.

Why Neural Networks Are Overconfident

The core problem: the softmax function does not produce calibrated probabilities. It produces normalized logits.

Given logits $z_1, z_2, \ldots, z_K$ , softmax outputs:

$\hat{p}_k = \frac{\exp(z_k)}{\sum_{j} \exp(z_j)}$

As logit magnitudes grow (which happens with more training, larger networks, and better feature representations), the softmax output concentrates toward 0 and 1 - even when the model has not actually seen many examples of the class.

Guo et al. (2017) found that modern networks (ResNet, DenseNet) trained on CIFAR-100 are dramatically more overconfident than shallow networks from the early 2000s. The culprit: larger models, longer training, and the lack of explicit calibration objectives.

Overconfident model:         Well-calibrated model:
────────────────────         ──────────────────────
Pred 90% → correct 70%       Pred 90% → correct 90%
Pred 80% → correct 55%       Pred 80% → correct 80%
Pred 70% → correct 50%       Pred 70% → correct 70%
Systematic overestimation    Confidence = accuracy

Measuring Calibration

Reliability Diagram

The reliability diagram (also called calibration curve) is the primary visualization tool for calibration. The procedure:

Sort all test predictions by predicted confidence.
Bin predictions into $M$ bins of equal width (e.g., $[0, 0.1), [0.1, 0.2), \ldots, [0.9, 1.0]$ ).
For each bin $B_m$ : compute average confidence $\text{conf}(B_m)$ and accuracy $\text{acc}(B_m)$ .
Plot $\text{acc}(B_m)$ vs. $\text{conf}(B_m)$ for each bin.

A perfectly calibrated model falls on the diagonal $y = x$ . Above the diagonal: underconfident (model is more accurate than it claims). Below the diagonal: overconfident (model claims more accuracy than it has).

Expected Calibration Error (ECE)

The ECE quantifies the total calibration gap, weighted by how many predictions fall in each bin:

$\text{ECE} = \sum_{m=1}^M \frac{|B_m|}{n} \left|\text{acc}(B_m) - \text{conf}(B_m)\right|$

where $n$ is the total number of predictions, $|B_m|$ is the number in bin $m$ .

A well-calibrated model has $\text{ECE} \approx 0$ . Modern neural networks typically have $\text{ECE} = 3\%$ – $15\%$ without calibration. After temperature scaling, ECE can drop below $1\%$ .

Maximum Calibration Error (MCE)

The MCE focuses on the worst bin:

$\text{MCE} = \max_{m \in \{1,\ldots,M\}} \left|\text{acc}(B_m) - \text{conf}(B_m)\right|$

MCE is more relevant for safety-critical applications where any miscalibrated confidence band is unacceptable.

Negative Log-Likelihood (NLL)

A proper scoring rule that measures both calibration and sharpness:

$\text{NLL} = -\frac{1}{n}\sum_{i=1}^n \log \hat{p}(y_i|x_i)$

A well-calibrated model minimizes NLL. Crucially, NLL penalizes overconfident wrong predictions very heavily - getting a prediction wrong with 99% confidence costs $-\log(0.01) = 4.6$ nats.

Mermaid: Calibration Pipeline

Post-Hoc Calibration Methods

Method 1: Temperature Scaling

Temperature scaling is the simplest and most effective post-hoc calibration method. A single scalar $T > 0$ divides all logits before the softmax:

$\hat{p}_k = \text{softmax}(z / T)_k = \frac{\exp(z_k / T)}{\sum_j \exp(z_j / T)}$

$T > 1$ : softer distribution (more uncertain) - fixes overconfidence.
$T < 1$ : sharper distribution (more confident) - fixes underconfidence (rare).
$T = 1$ : no change.

The temperature is fit by minimizing NLL on the validation set (never the test set):

$T^* = \arg\min_T \text{NLL}(y_{\text{val}}, \text{softmax}(z_{\text{val}} / T))$

This is a one-dimensional convex optimization problem - trivially solvable with L-BFGS or scipy.

Why it works: temperature scaling preserves the relative ordering of predictions (same accuracy) but adjusts the spread of the softmax output. It does not change which class is predicted - only the confidence level. This is why accuracy and calibration can be improved independently.

Limitation: it applies the same scaling to all inputs. If the model is overconfident on some input types and underconfident on others, temperature scaling cannot capture this.

Method 2: Platt Scaling

More flexible than temperature scaling: apply a learned affine transformation to the logits before the sigmoid (for binary classification):

$\hat{p} = \sigma(a \cdot z + b)$

Fit $a$ and $b$ by minimizing NLL on the validation set. For multi-class, apply a learned weight matrix and bias to the logit vector.

Platt scaling can handle the case where the model is not just uniformly overconfident but has different calibration behavior at different confidence levels.

Method 3: Isotonic Regression

A non-parametric, monotonic mapping from predicted probabilities to calibrated probabilities. Fitted on the validation set using the pool adjacent violators (PAV) algorithm. Most flexible - can correct any monotonic miscalibration - but requires more data to fit reliably and may overfit on small validation sets.

Comparison

Method	Parameters	Flexibility	Risk of Overfitting	When to Use
Temperature scaling	1	Low	Very low	Almost always (try first)
Platt scaling	2	Medium	Low	When T-scaling insufficient
Isotonic regression	Non-param	High	Medium	Large validation set
Beta calibration	3	Medium	Low	Skewed confidence distributions

Code: Calibration Toolkit

Reliability Diagram and ECE from Scratch

import numpy as np
import matplotlib.pyplot as plt
from typing import Tuple


def compute_calibration_stats(
    y_true: np.ndarray,
    y_prob: np.ndarray,
    n_bins: int = 10,
) -> dict:
    """
    Compute calibration statistics from predictions.

    Args:
        y_true: binary labels [0, 1], shape [n]
        y_prob: predicted probabilities for class 1, shape [n]
        n_bins: number of equal-width bins

    Returns:
        dict with bin_conf, bin_acc, bin_sizes, ECE, MCE
    """
    bins = np.linspace(0.0, 1.0, n_bins + 1)
    n = len(y_true)

    bin_conf  = np.zeros(n_bins)
    bin_acc   = np.zeros(n_bins)
    bin_sizes = np.zeros(n_bins, dtype=int)

    for m in range(n_bins):
        lo, hi = bins[m], bins[m + 1]
        # Include upper endpoint in last bin
        if m == n_bins - 1:
            mask = (y_prob >= lo) & (y_prob <= hi)
        else:
            mask = (y_prob >= lo) & (y_prob < hi)

        if mask.sum() == 0:
            continue

        bin_sizes[m] = mask.sum()
        bin_conf[m]  = y_prob[mask].mean()
        bin_acc[m]   = y_true[mask].mean()

    # ECE = sum_m (|B_m| / n) * |acc(B_m) - conf(B_m)|
    weights = bin_sizes / n
    gaps    = np.abs(bin_acc - bin_conf)
    ece     = (weights * gaps).sum()
    mce     = gaps[bin_sizes > 0].max() if (bin_sizes > 0).any() else 0.0

    return {
        "bin_conf":  bin_conf,
        "bin_acc":   bin_acc,
        "bin_sizes": bin_sizes,
        "ece":       ece,
        "mce":       mce,
    }


def plot_reliability_diagram(
    y_true: np.ndarray,
    y_prob: np.ndarray,
    n_bins: int = 10,
    title: str = "Reliability Diagram",
    ax=None,
) -> None:
    """Plot reliability diagram with gap visualization."""
    stats = compute_calibration_stats(y_true, y_prob, n_bins)

    if ax is None:
        fig, ax = plt.subplots(figsize=(6, 6))

    bins = np.linspace(0, 1, n_bins + 1)
    bin_centers = 0.5 * (bins[:-1] + bins[1:])
    mask = stats["bin_sizes"] > 0

    # Perfect calibration diagonal
    ax.plot([0, 1], [0, 1], "k--", linewidth=1.5, label="Perfect calibration")

    # Gap (overconfidence) shading
    ax.bar(
        bin_centers[mask],
        stats["bin_acc"][mask],
        width=1 / n_bins,
        alpha=0.3,
        color="#60a5fa",
        label="Accuracy",
        align="center",
    )
    ax.bar(
        bin_centers[mask],
        stats["bin_conf"][mask] - stats["bin_acc"][mask],
        bottom=stats["bin_acc"][mask],
        width=1 / n_bins,
        alpha=0.4,
        color="#f87171",
        label="Gap (overconfidence)",
        align="center",
    )

    ax.set_xlabel("Mean predicted confidence")
    ax.set_ylabel("Fraction of positives (accuracy)")
    ax.set_title(f"{title}\nECE = {stats['ece']:.4f} | MCE = {stats['mce']:.4f}")
    ax.legend(fontsize=9)
    ax.set_xlim(0, 1)
    ax.set_ylim(0, 1)
    plt.tight_layout()


# Example
np.random.seed(42)
n = 2000
y_true = np.random.randint(0, 2, n)
# Overconfident model (softmax pushes toward extremes)
y_prob_bad = np.clip(
    0.5 + (y_true - 0.5) * 0.75 + np.random.normal(0, 0.03, n), 0.01, 0.99
)
# Well-calibrated model
y_prob_good = np.clip(
    0.5 + (y_true - 0.5) * 0.5 + np.random.normal(0, 0.06, n), 0.01, 0.99
)

stats_bad  = compute_calibration_stats(y_true, y_prob_bad)
stats_good = compute_calibration_stats(y_true, y_prob_good)
print(f"Overconfident model ECE: {stats_bad['ece']:.4f}")
print(f"Calibrated model ECE:    {stats_good['ece']:.4f}")

Temperature Scaling Implementation

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
import numpy as np


class TemperatureScaling(nn.Module):
    """
    Temperature scaling calibration (Guo et al. 2017).
    Fits a single temperature T on validation logits.
    """

    def __init__(self):
        super().__init__()
        # Initialize T = 1.5 (slight underconfidence to encourage search)
        self.temperature = nn.Parameter(torch.ones(1) * 1.5)

    def forward(self, logits: torch.Tensor) -> torch.Tensor:
        """Scale logits and return softmax probabilities."""
        return torch.softmax(logits / self.temperature, dim=-1)

    def calibrate(
        self,
        logits: torch.Tensor,
        labels: torch.Tensor,
        max_iter: int = 1000,
        lr: float = 0.01,
    ) -> float:
        """
        Fit temperature T on validation logits + labels.
        Minimizes NLL (cross-entropy) w.r.t. T.
        """
        optimizer = optim.LBFGS(
            [self.temperature], lr=lr, max_iter=max_iter, line_search_fn="strong_wolfe"
        )
        criterion = nn.CrossEntropyLoss()

        def eval_closure():
            optimizer.zero_grad()
            scaled_logits = logits / self.temperature
            loss = criterion(scaled_logits, labels)
            loss.backward()
            return loss

        optimizer.step(eval_closure)

        print(f"Optimal temperature: {self.temperature.item():.4f}")
        return self.temperature.item()

    def get_calibrated_probs(self, logits: torch.Tensor) -> np.ndarray:
        """Return calibrated probabilities as numpy array."""
        self.eval()
        with torch.no_grad():
            return self(logits).numpy()


def evaluate_calibration(logits: torch.Tensor, labels: torch.Tensor, n_bins: int = 10):
    """Compute ECE from logits."""
    probs = torch.softmax(logits, dim=-1).numpy()
    preds = probs.argmax(axis=1)
    max_probs = probs.max(axis=1)
    correct = (preds == labels.numpy()).astype(float)

    bins = np.linspace(0, 1, n_bins + 1)
    n = len(labels)
    ece = 0.0
    for i in range(n_bins):
        mask = (max_probs >= bins[i]) & (max_probs < bins[i + 1])
        if mask.sum() == 0:
            continue
        acc  = correct[mask].mean()
        conf = max_probs[mask].mean()
        ece += (mask.sum() / n) * abs(acc - conf)
    return ece


def demo_temperature_scaling():
    torch.manual_seed(42)
    n_val  = 1000
    n_test = 2000
    n_cls  = 10

    # Simulate overconfident model logits (large magnitude)
    def make_logits(n):
        true_labels = torch.randint(0, n_cls, (n,))
        logits = torch.randn(n, n_cls) * 0.5
        # Boost true class logit - makes model overconfident
        for i in range(n):
            logits[i, true_labels[i]] += 3.5
        return logits, true_labels

    val_logits,  val_labels  = make_logits(n_val)
    test_logits, test_labels = make_logits(n_test)

    # Before calibration
    ece_before = evaluate_calibration(test_logits, test_labels)
    print(f"ECE before calibration: {ece_before:.4f}")

    # Fit temperature on validation set
    ts = TemperatureScaling()
    T  = ts.calibrate(val_logits, val_labels)

    # Apply to test set
    calibrated_logits = test_logits / T
    ece_after = evaluate_calibration(calibrated_logits, test_labels)
    print(f"ECE after  calibration: {ece_after:.4f}")
    print(f"Temperature T = {T:.4f}")


demo_temperature_scaling()

Platt Scaling

import torch
import torch.nn as nn
from sklearn.linear_model import LogisticRegression
from sklearn.calibration import CalibratedClassifierCV
import numpy as np


class PlattScaling(nn.Module):
    """
    Platt scaling: sigma(a * logit + b) for binary classification.
    Extended to multi-class via matrix transformation.
    """

    def __init__(self, n_classes: int):
        super().__init__()
        self.a = nn.Parameter(torch.ones(1))
        self.b = nn.Parameter(torch.zeros(1))
        self.n_classes = n_classes

    def forward(self, logits: torch.Tensor) -> torch.Tensor:
        scaled = self.a * logits + self.b
        return torch.softmax(scaled, dim=-1)

    def calibrate(
        self,
        logits: torch.Tensor,
        labels: torch.Tensor,
        n_epochs: int = 500,
        lr: float = 0.01,
    ):
        optimizer = torch.optim.Adam([self.a, self.b], lr=lr)
        criterion = nn.CrossEntropyLoss()

        for epoch in range(n_epochs):
            optimizer.zero_grad()
            scaled = self.a * logits + self.b
            loss = criterion(scaled, labels)
            loss.backward()
            optimizer.step()

        print(f"Platt scaling: a={self.a.item():.4f}, b={self.b.item():.4f}")

Out-of-Distribution Detection

A calibrated model handles in-distribution inputs well. But what about inputs from a completely different distribution - images from a class the model was never trained on? A standard model may still output high-confidence predictions for these inputs.

OOD detection asks: can we detect when the model is being asked about something outside its training distribution?

Method 1: Maximum Softmax Probability (MSP)

The simplest baseline (Hendrycks & Gimpel 2017): use the maximum softmax probability as the OOD score. If $\max_k \hat{p}_k$ is low, the input is likely OOD.

Limitation: softmax probabilities can be high even for OOD inputs because large logit magnitudes still produce peaked distributions. The method is a weak baseline.

Method 2: Energy Score (Liu et al. 2020)

The energy function maps logits to a scalar score:

$E(x) = -T \log \sum_{k=1}^K \exp(f_k(x) / T)$

OOD inputs tend to have higher energy (less concentrated logit distributions). Use $-E(x)$ as the in-distribution score (higher = more in-distribution).

The energy score is theoretically motivated: it is proportional to $\log p(x)$ under a specific energy-based model, making it a better proxy for data likelihood than softmax max.

Method 3: Mahalanobis Distance (Lee et al. 2018)

Fit class-conditional Gaussians $\mathcal{N}(\mu_c, \Sigma)$ on the penultimate layer features. For a test input $x$ , compute the Mahalanobis distance to the nearest class center:

$M(x) = \max_c -(h(x) - \mu_c)^\top \Sigma^{-1} (h(x) - \mu_c)$

Higher $M(x)$ = more in-distribution. This method captures the feature-space distribution, not just the output space.

Method 4: Deep Ensembles for OOD

Train $M$ models. Variance in predictions across ensemble members signals OOD: if all models agree (low variance), the input is likely in-distribution. If models disagree (high variance), the input may be OOD.

OOD Detection Method Comparison (AUROC on standard benchmarks):
─────────────────────────────────────────────────────────────────
Method                  AUROC    Speed      Retraining?
────────────────────    ─────    ─────      ───────────
MSP (baseline)          87%      Fast       No
Energy score            90%      Fast       No (or fine-tune energy)
Mahalanobis             93%      Moderate   No (fit on features)
Deep ensembles          95%      Slow       Yes (train M models)

OOD Detection Code

import torch
import torch.nn as nn
import numpy as np
from sklearn.metrics import roc_auc_score


class OODDetector:
    """
    Collection of OOD detection methods.
    """

    def __init__(self, model: nn.Module, device: torch.device = torch.device("cpu")):
        self.model = model
        self.device = device
        self.model.eval()

    def msp_score(self, x: torch.Tensor) -> np.ndarray:
        """Maximum Softmax Probability - higher = more in-distribution."""
        with torch.no_grad():
            logits = self.model(x.to(self.device))
            probs = torch.softmax(logits, dim=-1)
            return probs.max(dim=-1).values.cpu().numpy()

    def energy_score(self, x: torch.Tensor, T: float = 1.0) -> np.ndarray:
        """
        Energy score: E(x) = -T * log sum_k exp(f_k(x) / T)
        Return -E(x) so higher = more in-distribution.
        """
        with torch.no_grad():
            logits = self.model(x.to(self.device))
            # logsumexp for numerical stability
            energy = -T * torch.logsumexp(logits / T, dim=-1)
            # Negate: lower energy = more in-distribution
            return (-energy).cpu().numpy()

    def fit_mahalanobis(
        self,
        train_loader,
        n_classes: int,
        feature_dim: int,
    ) -> None:
        """
        Fit class-conditional Gaussians on penultimate layer features.
        Requires model to have a `get_features(x)` method.
        """
        all_features = [[] for _ in range(n_classes)]

        with torch.no_grad():
            for x, y in train_loader:
                feats = self.model.get_features(x.to(self.device))
                for c in range(n_classes):
                    mask = (y == c)
                    if mask.sum() > 0:
                        all_features[c].append(feats[mask].cpu())

        # Class means
        self.class_means = []
        all_f = []
        for c in range(n_classes):
            feats_c = torch.cat(all_features[c], dim=0)
            self.class_means.append(feats_c.mean(0))
            all_f.append(feats_c)

        # Tied covariance
        all_f = torch.cat(all_f, dim=0)
        diff  = all_f - all_f.mean(0, keepdim=True)
        self.precision = torch.linalg.pinv(diff.T @ diff / len(all_f))

    def mahalanobis_score(self, x: torch.Tensor) -> np.ndarray:
        """Mahalanobis distance to nearest class center."""
        with torch.no_grad():
            feats = self.model.get_features(x.to(self.device)).cpu()

        scores = []
        for feat in feats:
            dists = []
            for mu in self.class_means:
                d = feat - mu
                dist = -(d @ self.precision @ d).item()
                dists.append(dist)
            scores.append(max(dists))
        return np.array(scores)


def evaluate_ood_detection(
    in_scores: np.ndarray,
    ood_scores: np.ndarray,
    method_name: str = "Method",
) -> float:
    """Evaluate OOD detection via AUROC."""
    y_true = np.concatenate([np.ones(len(in_scores)), np.zeros(len(ood_scores))])
    scores = np.concatenate([in_scores, ood_scores])
    auroc  = roc_auc_score(y_true, scores)
    print(f"{method_name}: AUROC = {auroc:.4f}")
    return auroc


# Simulated demo
np.random.seed(42)
in_scores_msp  = np.random.beta(8, 2, 500)   # high confidence in-distribution
ood_scores_msp = np.random.beta(2, 5, 500)   # lower confidence OOD

evaluate_ood_detection(in_scores_msp, ood_scores_msp, method_name="MSP")

in_scores_energy  = -np.random.normal(-3.0, 0.5, 500)  # low energy in-dist
ood_scores_energy = -np.random.normal(-1.0, 0.8, 500)  # higher energy OOD
evaluate_ood_detection(in_scores_energy, ood_scores_energy, method_name="Energy Score")

The Full Calibration Pipeline in Production

:::danger Never Calibrate on the Test Set The temperature $T$ (and all calibration parameters) must be fitted on a held-out validation set that is separate from both the training set and the final test set. Using the test set to fit calibration parameters leads to optimistic ECE estimates and leakage. Use a three-way split: train / calibration / test. :::

:::warning Calibration Is Not a Fix for Poor Models Calibration adjusts confidence levels but does not improve accuracy. A model with 60% accuracy and perfect calibration is still wrong 40% of the time. Calibration makes a model's uncertainty estimates trustworthy - it does not make it smarter. Always fix accuracy issues before worrying about calibration. :::

Practical Rules of Thumb

Situation	Recommended Action
Simple baseline, any model	Temperature scaling - always try first
ECE > 5% after T-scaling	Add Platt scaling or isotonic regression
Need per-class calibration	Fit separate temperatures per class
Deployment with OOD risk	Add energy score threshold for flagging
Safety-critical (medical, autonomous)	Use Deep Ensembles + energy OOD detector
LLMs, generative models	Use conformal prediction (next lesson)

YouTube Resources

Resource	What You Will Learn
Guo et al. 2017 - On Calibration of Modern Neural Networks	The foundational paper - read alongside the video
Yannic Kilcher - Calibration Paper Walkthrough	Temperature scaling explained visually
Energy-Based OOD Detection (Liu et al. 2020)	Energy score derivation and experiments
MIT 6.S191 - Uncertainty in Deep Learning	Full lecture covering calibration, OOD, BNNs

Interview Q&A

Q1: What is ECE and how is it different from MCE?

Answer: Both metrics measure calibration - the alignment between predicted confidence and observed accuracy.

ECE (Expected Calibration Error) is the average miscalibration, weighted by bin size:

$\text{ECE} = \sum_{m=1}^M \frac{|B_m|}{n} |\text{acc}(B_m) - \text{conf}(B_m)|$

It measures the expected difference between confidence and accuracy across the confidence distribution. A model with ECE = 0.05 is, on average, 5 percentage points off in its confidence.

MCE (Maximum Calibration Error) is the worst-case miscalibration across any bin:

$\text{MCE} = \max_m |\text{acc}(B_m) - \text{conf}(B_m)|$

ECE is appropriate for most use cases - it gives an overall picture. MCE is appropriate for safety-critical systems where even a single confidence band being badly miscalibrated is unacceptable. Example: in a medical system, having high confidence in the wrong direction at the 80%-90% confidence band could be life-threatening - MCE catches this even if ECE is low because that bin is small.

Q2: How does temperature scaling work, and why does it not change accuracy?

Answer: Temperature scaling divides all logits by a scalar $T > 0$ before the softmax: $\hat{p}_k = \text{softmax}(z/T)_k$ .

For classification, the predicted class is $\arg\max_k z_k = \arg\max_k z_k/T$ - the argmax is invariant to positive scaling. Temperature only affects the spread of the softmax distribution:

$T > 1$ : logits are compressed → softer distribution → lower max probability → fixes overconfidence.
$T < 1$ : logits are amplified → sharper distribution → higher max probability → fixes underconfidence.

Accuracy is unchanged because the predicted class does not change. Only the confidence level changes.

The optimal $T$ is fit by minimizing NLL on the validation set. NLL rewards models for being well-calibrated - assigning high probability to correct predictions and low probability to incorrect ones. The one-parameter optimization is convex and trivially solvable in milliseconds.

Limitation: temperature scaling applies the same scaling to all inputs. It cannot fix position-dependent miscalibration (e.g., overconfident on some regions of input space and underconfident on others).

Q3: What are the main OOD detection methods and how do they compare?

Answer: The main methods in increasing order of sophistication:

Maximum Softmax Probability (MSP): use $\max_k \hat{p}_k$ as the in-distribution score. Simple but weak - softmax can be high for OOD inputs when logits are large.
Energy Score: $-T \log \sum_k \exp(f_k(x)/T)$ . Lower energy = more in-distribution. Theoretically motivated - aligns with the log-likelihood under an energy-based model. Outperforms MSP by ~3% AUROC on standard benchmarks.
Mahalanobis Distance: fit class-conditional Gaussians on penultimate layer features. Compute distance to the nearest class in feature space. Captures the geometric structure of the feature distribution - outperforms energy on many benchmarks.
Deep Ensembles: variance across ensemble members signals OOD. The most reliable but requires $M\times$ training cost.

In practice: start with energy score (free, no retraining). Add Mahalanobis if energy is insufficient. Use ensembles only when calibration and OOD quality are the primary product requirement.

Q4: When would you prioritize calibration over accuracy?

Answer: Calibration matters most when the downstream decision depends on the probability estimate, not just the class label. Specifically:

Risk-stratified decisions: a clinical model predicting sepsis risk is used to prioritize patients for ICU beds. The decision depends on the absolute probability, not just "high vs. low." A miscalibrated 90% may be only 60% in reality - the wrong patients get prioritized.
Cost-sensitive classification: when the cost of false positives and false negatives is asymmetric and varies by confidence. Calibrated probabilities allow optimal thresholding via expected cost.
Probability ensembles: if multiple model outputs are combined, miscalibrated individual probabilities produce incorrect ensemble probabilities.
Conformal prediction: conformal prediction sets (next lesson) rely on the ordering of model scores. Miscalibrated probabilities can still work with conformal prediction - but calibration improves efficiency (tighter sets).

When calibration is less critical: when you only need a ranked list (search ranking, recommendation), or when the threshold is fixed and only relative ordering matters. In these cases, accuracy and AUC are more relevant than ECE.

Q5: Why is out-of-distribution detection fundamentally hard?

Answer: OOD detection is fundamentally hard because:

High-dimensional input spaces: in high dimensions (images, text), the "in-distribution" region is a tiny manifold in a vast space. Drawing the boundary precisely requires either explicit density estimation (hard) or implicit proxies (energy, Mahalanobis).
Softmax concentration: the softmax function can produce confident predictions for any input with large logit magnitude. Deep networks can extrapolate logit magnitude far outside the training distribution.
Semantic vs. covariate shift: OOD inputs can shift in distribution while remaining semantically in-distribution (a different imaging scanner produces different pixel statistics but the same anatomy). The model needs to distinguish "different distribution" from "semantically OOD."
No negative training data: standard training uses only in-distribution data. The model has no concept of "outside." Outlier exposure (training on known OOD examples) helps but requires knowing which distributions to expose.
Impossibility results: it can be shown that for any OOD detector, there exists a distribution that it will classify as in-distribution with high confidence. Perfect OOD detection is impossible without distributional assumptions.

The practical approach: combine multiple OOD signals (energy + Mahalanobis + ensemble variance), use monitoring in production to detect distribution shift over time, and design the system to fail safely when uncertainty is high.

Key Takeaways

Calibration: a model's confidence score should equal its empirical accuracy. Neural networks are systematically overconfident without explicit calibration.
ECE measures weighted average miscalibration. MCE measures worst-case. Both should be reported for safety-critical systems.
Reliability diagram: the standard visual tool - plot accuracy vs. confidence per bin. Perfect calibration is the diagonal.
Temperature scaling: one parameter, one validation set, near-zero cost. Reduces ECE from ~10% to ~1% in most cases. Always try this first.
OOD detection: energy score is the modern baseline (fast, no retraining). Mahalanobis for higher accuracy. Ensembles for the best performance.
Calibration and OOD detection are complementary - calibration fixes confidence for in-distribution inputs; OOD detection flags when the model should not be trusted at all.
Never fit calibration parameters on the test set. Always use a three-way train/calibration/test split.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Uncertainty Quantification demo on the EngineersOfAI Playground - no code required.

:::

The Real Interview Moment​

What Is Calibration?​

The Formal Definition​

Why Neural Networks Are Overconfident​

Measuring Calibration​

Reliability Diagram​

Expected Calibration Error (ECE)​

Maximum Calibration Error (MCE)​

Negative Log-Likelihood (NLL)​

Mermaid: Calibration Pipeline​

Post-Hoc Calibration Methods​

Method 1: Temperature Scaling​

Method 2: Platt Scaling​

Method 3: Isotonic Regression​

Comparison​

Code: Calibration Toolkit​

Reliability Diagram and ECE from Scratch​

Temperature Scaling Implementation​

Platt Scaling​

Out-of-Distribution Detection​

Method 1: Maximum Softmax Probability (MSP)​

Method 2: Energy Score (Liu et al. 2020)​

Method 3: Mahalanobis Distance (Lee et al. 2018)​

Method 4: Deep Ensembles for OOD​

OOD Detection Code​

The Full Calibration Pipeline in Production​

Practical Rules of Thumb​

YouTube Resources​

Interview Q&A​

Q1: What is ECE and how is it different from MCE?​

Q2: How does temperature scaling work, and why does it not change accuracy?​

Q3: What are the main OOD detection methods and how do they compare?​

Q4: When would you prioritize calibration over accuracy?​

Q5: Why is out-of-distribution detection fundamentally hard?​

Key Takeaways​