Skip to main content

import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem';

Bayesian Neural Networks - Uncertainty Quantification for Deep Learning

Reading time: 45–55 minutes Interview relevance: Very High - BNNs appear in MLE, AI Eng, and Research Eng rounds Target roles: Machine Learning Engineer, AI Engineer, Research Engineer, MLOps Engineer


The Real Interview Moment

It is 2021. A startup has just deployed a self-driving perception model to a small fleet of test vehicles. The model has passed every benchmark - 98.4% accuracy on the KITTI dataset, excellent performance on all held-out test splits. The team is proud.

On a Tuesday afternoon, vehicle 7 encounters an unusual scene at a crosswalk: a life-size cardboard cutout of a person, placed there by a local artist. The standard neural network processes the input through its deterministic forward pass and returns its verdict: "pedestrian, confidence 87%." The vehicle slows.

A second vehicle, running an experimental Bayesian neural network on the same backbone, also processes the scene. Its verdict: "pedestrian, confidence 60%, uncertainty interval wide." The uncertainty flag triggers a fallback protocol - the vehicle slows further and sends the scene for human review.

The reviewer immediately recognizes the cardboard cutout. She notes in the system log: "Model B knew what it didn't know."

This is not a story about accuracy. Model A was more confident and would have made the same correct decision. This is a story about calibration - does your model's confidence reflect its actual knowledge? A model that says 87% when it is genuinely uncertain is dangerous at deployment scale, because you cannot trust when to trust it.

Bayesian neural networks are the principled answer to this problem. Instead of learning a single set of weights, a BNN learns a distribution over weights. Predictions come from integrating over that distribution - and the width of that distribution tells you how much the model knows.


Why Point Estimates Are Dangerous

A standard neural network produces a single weight vector W\mathbf{W}^* via maximum likelihood estimation:

W=argmaxWlogp(DW)\mathbf{W}^* = \arg\max_\mathbf{W} \log p(\mathcal{D}|\mathbf{W})

This is a point estimate. It has no notion of uncertainty in the parameters themselves. When the model encounters an input far from the training distribution, the softmax output can still be high-confidence - because the logits are large, not because the model has genuine evidence.

Consider the three failure modes of point estimates:

Failure ModeDescriptionReal-World Cost
Overconfident OODHigh confidence on inputs the model has never seenMissed safety flags, adversarial vulnerability
Miscalibrated softmaxSoftmax is not a probability - it is a normalized scoreWrong risk thresholds in clinical, financial settings
No parameter uncertaintyCannot distinguish "unseen region" from "well-trained region"Silent failures at deployment boundary

The Bayesian solution is to maintain a posterior distribution over weights rather than a point estimate.


The Bayesian Deep Learning Objective

Posterior Inference

Place a prior p(W)p(\mathbf{W}) over the network weights. Given data D={(xi,yi)}i=1N\mathcal{D} = \{(x_i, y_i)\}_{i=1}^N, compute the posterior:

p(WD)=p(DW)p(W)p(D)p(\mathbf{W}|\mathcal{D}) = \frac{p(\mathcal{D}|\mathbf{W})\, p(\mathbf{W})}{p(\mathcal{D})}

The denominator p(D)=p(DW)p(W)dWp(\mathcal{D}) = \int p(\mathcal{D}|\mathbf{W})\, p(\mathbf{W})\, d\mathbf{W} is the marginal likelihood - an integral over all possible weight configurations. For a network with millions of parameters, this is utterly intractable.

Predictive Distribution

Given a new input xx^*, the Bayesian prediction integrates over all possible weight settings, weighted by their posterior probability:

p(yx,D)=p(yx,W)p(WD)dWp(y^*|x^*, \mathcal{D}) = \int p(y^*|x^*, \mathbf{W})\, p(\mathbf{W}|\mathcal{D})\, d\mathbf{W}

This integral is also intractable - but it is what we want. It says: average over all plausible models, weighted by how much the data supports them. The result is a proper predictive distribution, not a point prediction.

The two types of uncertainty this captures:

Epistemic uncertainty Aleatoric uncertainty
───────────────────── ────────────────────
Model uncertainty Data uncertainty
Reducible with more data Irreducible - inherent noise
Comes from p(W|D) width Comes from p(y|x,W)
"We don't know enough" "The world is noisy"

Mermaid: The BNN Inference Pipeline


Method 1: Variational Inference

The ELBO

Since we cannot compute p(WD)p(\mathbf{W}|\mathcal{D}) exactly, variational inference (VI) approximates it with a tractable family qϕ(W)q_\phi(\mathbf{W}), parameterized by ϕ\phi. We minimize the KL divergence:

ϕ=argminϕDKL(qϕ(W)p(WD))\phi^* = \arg\min_\phi D_{\text{KL}}(q_\phi(\mathbf{W}) \| p(\mathbf{W}|\mathcal{D}))

Expanding this (using logp(D)=const\log p(\mathcal{D}) = \text{const} w.r.t. ϕ\phi):

DKL(qϕp(WD))=L(ϕ)+logp(D)D_{\text{KL}}(q_\phi \| p(\mathbf{W}|\mathcal{D})) = -\mathcal{L}(\phi) + \log p(\mathcal{D})

Maximizing the Evidence Lower BOund (ELBO) is equivalent:

L(ϕ)=Eqϕ(W)[logp(DW)]expected log-likelihoodDKL(qϕ(W)p(W))KL regularization\mathcal{L}(\phi) = \underbrace{\mathbb{E}_{q_\phi(\mathbf{W})}[\log p(\mathcal{D}|\mathbf{W})]}_{\text{expected log-likelihood}} - \underbrace{D_{\text{KL}}(q_\phi(\mathbf{W}) \| p(\mathbf{W}))}_{\text{KL regularization}}

Intuition:

  • First term: weights should explain the data well (fit).
  • Second term: the approximate posterior should not deviate too far from the prior (regularization).

This is a familiar trade-off - but derived from first principles of Bayesian inference.

Mean Field Assumption

The most common choice is a mean-field variational family: each weight is independent Gaussian:

qϕ(W)=iN(wi;μi,σi2)q_\phi(\mathbf{W}) = \prod_{i} \mathcal{N}(w_i; \mu_i, \sigma_i^2)

Parameters to learn: ϕ={μi,σi}\phi = \{\mu_i, \sigma_i\} for every weight. A network with NN weights now has 2N2N variational parameters. The KL term has a closed form for Gaussians.


Method 2: Bayes by Backprop (Blundell et al. 2015)

Bayes by Backprop is the canonical algorithm for training mean-field BNNs via gradient descent. The key challenge: the ELBO contains Eqϕ[]\mathbb{E}_{q_\phi}[\cdot], and we need gradients through this expectation.

The Reparameterization Trick

Sample εN(0,I)\varepsilon \sim \mathcal{N}(0, I), then:

w=μ+σεw = \mu + \sigma \cdot \varepsilon

Now ww is a deterministic function of (μ,σ,ε)(\mu, \sigma, \varepsilon). Gradients flow through μ\mu and σ\sigma via backpropagation. The stochasticity is isolated in ε\varepsilon.

ϕEqϕ[f(w)]=EεN(0,I)[ϕf(μ+σε)]\nabla_\phi \mathbb{E}_{q_\phi}[f(w)] = \mathbb{E}_{\varepsilon \sim \mathcal{N}(0,I)}\left[\nabla_\phi f(\mu + \sigma \cdot \varepsilon)\right]

Estimate with Monte Carlo (one sample is usually enough per step):

ϕLϕ[logp(Dμ+σε)DKL(qϕp)]\nabla_\phi \mathcal{L} \approx \nabla_\phi \left[\log p(\mathcal{D}|\mu + \sigma \cdot \varepsilon) - D_{\text{KL}}(q_\phi \| p)\right]

KL for Gaussians - Closed Form

For mean-field Gaussians vs. standard normal prior p(wi)=N(0,1)p(w_i) = \mathcal{N}(0, 1):

DKL(N(μi,σi2)N(0,1))=12(σi2+μi21logσi2)D_{\text{KL}}(\mathcal{N}(\mu_i, \sigma_i^2) \| \mathcal{N}(0, 1)) = \frac{1}{2}\left(\sigma_i^2 + \mu_i^2 - 1 - \log\sigma_i^2\right)

Summing over all weights:

DKL(qϕp)=12i(σi2+μi21logσi2)D_{\text{KL}}(q_\phi \| p) = \frac{1}{2}\sum_i \left(\sigma_i^2 + \mu_i^2 - 1 - \log\sigma_i^2\right)

No sampling needed for this term - it is analytically computed and differentiable.

Practical Parameterization

To ensure σi>0\sigma_i > 0, parameterize via softplus:

σi=log(1+exp(ρi))\sigma_i = \log(1 + \exp(\rho_i))

Learn ρi\rho_i (unconstrained), convert to σi\sigma_i during the forward pass.


Method 3: Monte Carlo Dropout (Gal & Ghahramani 2016)

In 2016, Yarin Gal and Zoubin Ghahramani showed that a standard neural network trained with dropout is mathematically equivalent to a specific type of variational inference over a deep Gaussian process. The key insight:

:::tip Key Insight Keeping dropout active at test time and running multiple forward passes approximates sampling from the posterior predictive distribution. :::

This requires no architecture changes to an existing model - just keep dropout on at inference time.

MC Dropout Prediction

Run TT stochastic forward passes with different dropout masks:

y^t=f(x,Wt),Wtdropout mask\hat{y}_t = f(x^*, \mathbf{W}_t), \quad \mathbf{W}_t \sim \text{dropout mask}

Mean prediction:

yˉ=1Tt=1Ty^t\bar{y} = \frac{1}{T}\sum_{t=1}^T \hat{y}_t

Total predictive variance (combining epistemic and aleatoric):

Var[y]1Tt=1Ty^t2(1Tt=1Ty^t)2+σ2\text{Var}[y^*] \approx \frac{1}{T}\sum_{t=1}^T \hat{y}_t^2 - \left(\frac{1}{T}\sum_{t=1}^T \hat{y}_t\right)^2 + \sigma^2

where σ2\sigma^2 is the aleatoric noise term (if modeled explicitly).

For classification, the predictive entropy of the mean prediction measures total uncertainty:

H[pˉ]=cpˉclogpˉcH[\bar{p}] = -\sum_c \bar{p}_c \log \bar{p}_c

The mutual information between predictions and weights measures epistemic uncertainty specifically:

I[y,WD,x]=H[pˉ]1Tt=1TH[pt]I[y^*, \mathbf{W}|\mathcal{D}, x^*] = H[\bar{p}] - \frac{1}{T}\sum_{t=1}^T H[p_t]


Method 4: Deep Ensembles (Lakshminarayanan et al. 2017)

The simplest approach that works surprisingly well: train MM independently initialized models, each with a different random seed. At prediction time, average their outputs.

yˉ=1Mm=1Mf(x;Wm)\bar{y} = \frac{1}{M}\sum_{m=1}^M f(x^*; \mathbf{W}_m)

Var[y]1Mm=1M(f(x;Wm)yˉ)2\text{Var}[y^*] \approx \frac{1}{M}\sum_{m=1}^M (f(x^*; \mathbf{W}_m) - \bar{y})^2

Why does this work? Each optimization run finds a different local minimum in the loss landscape. These minima are functionally diverse - they make different mistakes. The ensemble captures functional diversity, which serves as a proxy for posterior uncertainty.

Key result from the paper: Deep ensembles with M=5M=5 consistently outperform MC Dropout with T=50T=50 on calibration and OOD detection metrics. The method is simple, parallelizable, and production-ready.

MethodParametersInference CostCalibrationOOD DetectionRetraining Needed
Point estimateNN1xPoorPoorNo
MC DropoutNNT×T\timesMediumMediumNo
Bayes by Backprop2N2NT×T\timesGoodGoodYes
Deep EnsemblesMNMNM×M\timesBestBestYes (x M)
Last-layer LaplaceK2\approx K^2~1xGoodGoodPartial

Code: Complete Implementations

MC Dropout in PyTorch

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
from typing import Tuple


class MCDropoutNet(nn.Module):
"""
Neural network with MC Dropout for uncertainty estimation.
Dropout is kept active at both train and test time.
"""

def __init__(
self,
input_dim: int,
hidden_dim: int,
output_dim: int,
dropout_rate: float = 0.1,
):
super().__init__()
self.dropout_rate = dropout_rate
self.net = nn.Sequential(
nn.Linear(input_dim, hidden_dim),
nn.ReLU(),
nn.Dropout(p=dropout_rate),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Dropout(p=dropout_rate),
nn.Linear(hidden_dim, output_dim),
)

def forward(self, x: torch.Tensor) -> torch.Tensor:
return self.net(x)

def predict_with_uncertainty(
self,
x: torch.Tensor,
n_passes: int = 50,
) -> Tuple[torch.Tensor, torch.Tensor]:
"""
Run T stochastic forward passes.
Returns (mean, variance) of predictions.
"""
# Keep model in train() so Dropout stays active
self.train()
predictions = []
with torch.no_grad():
for _ in range(n_passes):
pred = self.forward(x)
predictions.append(pred)

predictions = torch.stack(predictions, dim=0) # [T, batch, output]
mean = predictions.mean(dim=0)
variance = predictions.var(dim=0)
return mean, variance

def predict_classification_uncertainty(
self,
x: torch.Tensor,
n_passes: int = 50,
) -> dict:
"""
Predictive entropy and mutual information for classification tasks.
"""
self.train()
all_probs = []
with torch.no_grad():
for _ in range(n_passes):
logits = self.forward(x)
probs = F.softmax(logits, dim=-1)
all_probs.append(probs)

all_probs = torch.stack(all_probs, dim=0) # [T, batch, C]
mean_probs = all_probs.mean(dim=0) # [batch, C]

eps = 1e-8
# Total uncertainty: predictive entropy H[E_W[p(y|x,W)]]
pred_entropy = -(mean_probs * (mean_probs + eps).log()).sum(dim=-1)

# Aleatoric component: E_W[H[p(y|x,W)]]
entropy_per_pass = -(all_probs * (all_probs + eps).log()).sum(dim=-1)
mean_entropy = entropy_per_pass.mean(dim=0)

# Epistemic component: mutual information
mutual_info = pred_entropy - mean_entropy

return {
"mean_probs": mean_probs,
"predictive_entropy": pred_entropy, # total uncertainty
"aleatoric_uncertainty": mean_entropy,
"epistemic_uncertainty": mutual_info,
}


def train_mc_dropout():
torch.manual_seed(42)
X_train = torch.randn(500, 10)
y_train = (
X_train[:, 0] * 2 + X_train[:, 1] - 0.5 + 0.1 * torch.randn(500)
).unsqueeze(1)

X_in = torch.randn(20, 10)
X_ood = torch.randn(20, 10) * 5 # far from training distribution

model = MCDropoutNet(input_dim=10, hidden_dim=128, output_dim=1, dropout_rate=0.1)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.MSELoss()

model.train()
for epoch in range(200):
optimizer.zero_grad()
loss = criterion(model(X_train), y_train)
loss.backward()
optimizer.step()

mean_in, var_in = model.predict_with_uncertainty(X_in, n_passes=100)
mean_ood, var_ood = model.predict_with_uncertainty(X_ood, n_passes=100)

print(f"In-distribution - mean variance: {var_in.mean().item():.4f}")
print(f"Out-of-distribution - mean variance: {var_ood.mean().item():.4f}")
# OOD variance should be substantially higher


train_mc_dropout()

Bayes by Backprop

import torch
import torch.nn as nn
import torch.nn.functional as F
import math


class BayesianLinear(nn.Module):
"""
Bayesian linear layer with mean-field Gaussian variational posterior.

Weight: w ~ N(mu_w, sigma_w^2)
Bias: b ~ N(mu_b, sigma_b^2)
Prior: p(w) = N(0, prior_std^2)
"""

def __init__(self, in_features: int, out_features: int, prior_std: float = 1.0):
super().__init__()
self.in_features = in_features
self.out_features = out_features
self.prior_std = prior_std
self.prior_log_var = 2 * math.log(prior_std)

# Variational parameters: mu and rho (sigma = softplus(rho) > 0)
self.weight_mu = nn.Parameter(
torch.Tensor(out_features, in_features).normal_(0, 0.1)
)
self.weight_rho = nn.Parameter(
torch.Tensor(out_features, in_features).fill_(-3.0)
)
self.bias_mu = nn.Parameter(torch.Tensor(out_features).normal_(0, 0.1))
self.bias_rho = nn.Parameter(torch.Tensor(out_features).fill_(-3.0))

def forward(self, x: torch.Tensor) -> torch.Tensor:
weight_sigma = F.softplus(self.weight_rho)
bias_sigma = F.softplus(self.bias_rho)

# Reparameterization: w = mu + sigma * eps, eps ~ N(0, I)
weight = self.weight_mu + weight_sigma * torch.randn_like(self.weight_mu)
bias = self.bias_mu + bias_sigma * torch.randn_like(self.bias_mu)
return F.linear(x, weight, bias)

def kl_loss(self) -> torch.Tensor:
"""KL(N(mu, sigma^2) || N(0, prior_std^2)) - closed form."""
weight_sigma = F.softplus(self.weight_rho)
bias_sigma = F.softplus(self.bias_rho)

def _kl(mu, sigma):
return 0.5 * (
(sigma ** 2 + mu ** 2) / (self.prior_std ** 2)
- 1
- 2 * sigma.log()
+ self.prior_log_var
).sum()

return _kl(self.weight_mu, weight_sigma) + _kl(self.bias_mu, bias_sigma)


class BayesianMLP(nn.Module):
def __init__(
self,
input_dim: int,
hidden_dim: int,
output_dim: int,
prior_std: float = 1.0,
):
super().__init__()
self.l1 = BayesianLinear(input_dim, hidden_dim, prior_std)
self.l2 = BayesianLinear(hidden_dim, hidden_dim, prior_std)
self.l3 = BayesianLinear(hidden_dim, output_dim, prior_std)

def forward(self, x):
return self.l3(F.relu(self.l2(F.relu(self.l1(x)))))

def kl_loss(self):
return self.l1.kl_loss() + self.l2.kl_loss() + self.l3.kl_loss()


def elbo_loss(
model: BayesianMLP,
x: torch.Tensor,
y: torch.Tensor,
n_samples: int = 1,
n_train: int = 1000,
noise_std: float = 0.1,
) -> torch.Tensor:
"""
Negative ELBO = -E_q[log p(D|W)] + KL(q||p)
For regression: log p(D|W) = -||y - f(x)||^2 / (2*noise_std^2) - const
"""
log_lik = 0.0
for _ in range(n_samples):
pred = model(x)
log_lik += -0.5 * ((y - pred) ** 2).sum() / (noise_std ** 2)
log_lik /= n_samples

# Mini-batch KL correction: scale KL to full-dataset size
kl = model.kl_loss() * (len(x) / n_train)
return -(log_lik - kl)


def train_bayesian_mlp():
torch.manual_seed(42)
N = 500
X = torch.randn(N, 5)
y = (X[:, 0] * 1.5 - X[:, 1] + 0.1 * torch.randn(N)).unsqueeze(1)

model = BayesianMLP(5, 64, 1, prior_std=1.0)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

for epoch in range(300):
model.train()
optimizer.zero_grad()
loss = elbo_loss(model, X, y, n_samples=1, n_train=N)
loss.backward()
optimizer.step()
if (epoch + 1) % 100 == 0:
print(f"Epoch {epoch + 1:3d} | ELBO loss: {loss.item():.4f}")

# Prediction with uncertainty
X_test = torch.randn(5, 5)
preds = torch.stack([model(X_test).detach() for _ in range(100)])
print("\nPrediction mean:\n", preds.mean(0))
print("Prediction std:\n", preds.std(0))


train_bayesian_mlp()

Deep Ensembles

import torch
import torch.nn as nn
from typing import List


class BaseNet(nn.Module):
"""Single deterministic member of an ensemble."""

def __init__(self, input_dim, hidden_dim, output_dim, seed=0):
super().__init__()
torch.manual_seed(seed)
self.net = nn.Sequential(
nn.Linear(input_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, output_dim),
)

def forward(self, x):
return self.net(x)


class DeepEnsemble:
"""
Deep Ensemble: M independently trained models.
Variance across members approximates epistemic uncertainty.
"""

def __init__(self, models: List[nn.Module]):
self.models = models

def predict(self, x: torch.Tensor) -> dict:
preds = []
for model in self.models:
model.eval()
with torch.no_grad():
preds.append(model(x))
preds = torch.stack(preds, dim=0) # [M, batch, output]
return {
"mean": preds.mean(0),
"variance": preds.var(0),
"std": preds.std(0),
"all_preds": preds,
}

@classmethod
def train_ensemble(
cls,
X_train: torch.Tensor,
y_train: torch.Tensor,
input_dim: int,
hidden_dim: int,
output_dim: int,
M: int = 5,
epochs: int = 200,
lr: float = 1e-3,
) -> "DeepEnsemble":
models = []
for m in range(M):
model = BaseNet(input_dim, hidden_dim, output_dim, seed=m * 42)
opt = torch.optim.Adam(model.parameters(), lr=lr)
crit = nn.MSELoss()
model.train()
for _ in range(epochs):
opt.zero_grad()
crit(model(X_train), y_train).backward()
opt.step()
print(f" Member {m + 1}/{M} trained.")
models.append(model)
return cls(models)


def demo_ensemble():
torch.manual_seed(0)
N = 400
X = torch.randn(N, 8)
y = (X[:, 0] - X[:, 2] * 0.5 + 0.05 * torch.randn(N)).unsqueeze(1)

X_in = torch.randn(10, 8)
X_ood = torch.randn(10, 8) * 8 # far from training distribution

ensemble = DeepEnsemble.train_ensemble(X, y, 8, 64, 1, M=5, epochs=200)
r_in = ensemble.predict(X_in)
r_ood = ensemble.predict(X_ood)
print(f"\nIn-distribution - mean std: {r_in['std'].mean().item():.4f}")
print(f"Out-of-distribution - mean std: {r_ood['std'].mean().item():.4f}")


demo_ensemble()

Uncertainty Calibration - Reliability Diagram and ECE

import numpy as np
from sklearn.calibration import calibration_curve


def expected_calibration_error(
y_true: np.ndarray,
y_prob: np.ndarray,
n_bins: int = 10,
) -> float:
"""
ECE = sum_m (|B_m| / n) * |acc(B_m) - conf(B_m)|
"""
bins = np.linspace(0, 1, n_bins + 1)
ece = 0.0
n = len(y_true)
for i in range(n_bins):
mask = (y_prob >= bins[i]) & (y_prob < bins[i + 1])
if mask.sum() == 0:
continue
acc = y_true[mask].mean()
conf = y_prob[mask].mean()
ece += (mask.sum() / n) * abs(acc - conf)
return ece


# Simulated comparison: overconfident standard net vs. calibrated MC Dropout
np.random.seed(42)
n = 500
y_true = np.random.randint(0, 2, size=n)

# Standard network - systematically overconfident
y_prob_standard = np.clip(
0.5 + (y_true - 0.5) * 0.6 + np.random.normal(0, 0.05, n), 0.01, 0.99
)
# MC Dropout - better calibrated
y_prob_mc = np.clip(
0.5 + (y_true - 0.5) * 0.45 + np.random.normal(0, 0.08, n), 0.01, 0.99
)

ece_std = expected_calibration_error(y_true, y_prob_standard)
ece_mc = expected_calibration_error(y_true, y_prob_mc)

print(f"ECE (standard network): {ece_std:.4f}")
print(f"ECE (MC Dropout): {ece_mc:.4f}")
# Lower ECE = better calibration

Production Trade-offs

When to Use Each Method

Do you need uncertainty estimates?

├─ NO → Standard neural network (fastest, simplest)

└─ YES → What is your inference budget?

├─ Tight (latency-critical) ──────── MC Dropout (T=10, minimal overhead)

├─ Moderate (batch inference) ─────── Deep Ensembles M=3–5
│ (parallel, most reliable)

├─ Post-hoc (no retraining budget) ── Last-layer Laplace approximation

└─ Research / safety-critical ──────── Full BNN (Bayes by Backprop)

Last-Layer Laplace Approximation

A practical middle ground: train a standard network, then apply the Laplace approximation only to the last layer. The last-layer posterior is Gaussian - tractable and fast. This gives decent uncertainty with near-zero additional training cost.

# pip install laplace-torch
from laplace import Laplace

# Train a standard model first, then wrap it:
la = Laplace(
model,
"regression",
subset_of_weights="last_layer",
hessian_structure="kron",
)
la.fit(train_loader)
la.optimize_prior_precision()

x_test = torch.randn(5, 10)
mean, var = la(x_test)

Mermaid: Production Decision Map


:::warning Common Mistake: Treating Softmax as Probability Softmax output is NOT a calibrated probability. A model can output [0.01, 0.02, 0.97] for an input it has never seen - because the logits happen to be large. Always calibrate (temperature scaling) or use a method with principled uncertainty before trusting softmax scores for high-stakes decisions. :::

:::danger Pitfall: MC Dropout at Evaluation Time If you call model.eval() before MC Dropout inference, PyTorch disables dropout and every forward pass will be identical - giving zero variance. Always call model.train() before MC Dropout stochastic passes, or use a custom Dropout subclass that ignores training mode. :::


YouTube Resources

ResourceWhat You Will Learn
Yarin Gal - Uncertainty in Deep LearningTheoretical foundation of MC Dropout as variational inference
Pieter Abbeel - Bayesian Deep Learning (UC Berkeley)Full lecture: priors, posteriors, variational inference in neural networks
Weights & Biases - Deep Ensembles TutorialHands-on: training and evaluating deep ensembles
Andrej Karpathy - Calibration and UncertaintyWhy confidence calibration matters in production

Interview Q&A

Q1: What is the difference between MC Dropout and Bayes by Backprop?

Answer: A proper BNN trained with Bayes by Backprop explicitly maintains a variational posterior qϕ(W)q_\phi(\mathbf{W}) over every weight and optimizes the ELBO. It doubles the parameter count (mean + variance per weight) and requires retraining from scratch. MC Dropout approximates a specific variational distribution implicitly - Gal and Ghahramani (2016) showed it corresponds to a Bernoulli variational distribution. The practical differences:

  • MC Dropout requires no architecture changes to an existing model - just keep dropout active at test time.
  • BbB doubles the parameter count and requires retraining from scratch.
  • MC Dropout's uncertainty estimates are noisier - the Bernoulli approximation is less expressive than full Gaussian mean-field.
  • Deep Ensembles (not strictly Bayesian) outperform both in practice on calibration and OOD benchmarks (Lakshminarayanan et al. 2017).

For production: MC Dropout for quick retrofitting. Deep Ensembles when training cost is acceptable. Full BNNs for research or safety-critical settings.


Q2: Derive the ELBO and explain each term.

Answer: Starting from the KL divergence between the variational posterior and the true posterior:

DKL(qϕ(W)p(WD))=Eqϕ[logqϕ(W)p(WD)]0D_{\text{KL}}(q_\phi(\mathbf{W}) \| p(\mathbf{W}|\mathcal{D})) = \mathbb{E}_{q_\phi}\left[\log \frac{q_\phi(\mathbf{W})}{p(\mathbf{W}|\mathcal{D})}\right] \geq 0

Expanding using Bayes' theorem p(WD)=p(DW)p(W)/p(D)p(\mathbf{W}|\mathcal{D}) = p(\mathcal{D}|\mathbf{W})p(\mathbf{W})/p(\mathcal{D}):

=Eqϕ[logqϕ]Eqϕ[logp(DW)]Eqϕ[logp(W)]+logp(D)= \mathbb{E}_{q_\phi}[\log q_\phi] - \mathbb{E}_{q_\phi}[\log p(\mathcal{D}|\mathbf{W})] - \mathbb{E}_{q_\phi}[\log p(\mathbf{W})] + \log p(\mathcal{D})

Rearranging (since KL 0\geq 0, logp(D)L\log p(\mathcal{D}) \geq \mathcal{L}):

L(ϕ)=Eqϕ[logp(DW)]fit the dataDKL(qϕp)stay close to prior\mathcal{L}(\phi) = \underbrace{\mathbb{E}_{q_\phi}[\log p(\mathcal{D}|\mathbf{W})]}_{\text{fit the data}} - \underbrace{D_{\text{KL}}(q_\phi \| p)}_{\text{stay close to prior}}

First term: push the posterior to explain the data well (likelihood). Second term: prevent overfitting by pulling toward the prior. This is directly analogous to L2 regularization - MAP estimation with Gaussian prior is the limit where the KL term is evaluated at the mode rather than integrated.


Q3: Why does the reparameterization trick enable backpropagation through sampling?

Answer: The naive gradient ϕEqϕ[f(w)]\nabla_\phi \mathbb{E}_{q_\phi}[f(w)] requires differentiating through the sampling distribution, which is not directly possible with standard autograd. The REINFORCE estimator exists but has high variance.

The reparameterization trick rewrites wqϕw \sim q_\phi as a deterministic function of ϕ\phi and a fixed-distribution noise variable: w=g(ϕ,ε)w = g(\phi, \varepsilon), εp(ε)\varepsilon \sim p(\varepsilon) independent of ϕ\phi.

For Gaussians: w=μ+σεw = \mu + \sigma \cdot \varepsilon, εN(0,I)\varepsilon \sim \mathcal{N}(0, I).

Now: ϕEqϕ[f(w)]=Eε[ϕf(μ+σε)]\nabla_\phi \mathbb{E}_{q_\phi}[f(w)] = \mathbb{E}_\varepsilon[\nabla_\phi f(\mu + \sigma \varepsilon)]

The gradient moves inside the expectation and can be estimated with a single Monte Carlo sample. The computation graph flows through μ\mu and σ\sigma directly - standard backpropagation handles the rest. The key requirement: g(ϕ,ε)g(\phi, \varepsilon) must be differentiable w.r.t. ϕ\phi.


Q4: What are epistemic and aleatoric uncertainty, and why does the distinction matter?

Answer:

Aleatoric uncertainty (data uncertainty): inherent randomness in the observation process. Even with infinite data, the model would still be uncertain - the world itself is stochastic or some inputs are genuinely ambiguous. Example: predicting stock prices given news headlines. This is irreducible.

Epistemic uncertainty (model uncertainty): uncertainty due to limited data - the model has not seen enough examples to be confident. Example: a medical model that has seen few examples of a rare disease variant. This is reducible: gathering more data decreases it.

The distinction matters because:

  • Safety systems: epistemic uncertainty signals the model should defer to a human. Aleatoric uncertainty means the problem is hard - more data won't help.
  • Active learning: collect new labels in regions of high epistemic uncertainty only. Aleatoric uncertainty doesn't guide data collection.
  • OOD detection: inputs far from training distribution show high epistemic uncertainty.

In practice: epistemic \approx mutual information I[y,WD,x]I[y^*, \mathbf{W}|\mathcal{D}, x^*]. Aleatoric \approx expected entropy Eqϕ[H[p(yx,W)]]\mathbb{E}_{q_\phi}[H[p(y^*|x^*, \mathbf{W})]].


Q5: Why do Deep Ensembles outperform MC Dropout despite being less "Bayesian"?

Answer: Lakshminarayanan et al. 2017 showed empirically that ensembles with M=5M=5 consistently outperform MC Dropout with T=50T=50 on calibration and OOD detection. Several reasons:

  1. Functional diversity: each ensemble member converges to a different local minimum. These minima make different predictions in uncertain regions. MC Dropout samples from a single Bernoulli variational approximation - far less diverse.

  2. Loss landscape geometry: deep networks have many functionally distinct local minima separated by high barriers. Ensembles explore across these barriers. Variational inference tends to collapse to a single mode.

  3. No approximation gap: variational inference introduces a gap between qϕq_\phi and the true posterior. Ensembles have no such gap - they are purely empirical.

  4. Scalability: ensembles are trivially parallelizable. Each model trains independently on different GPUs.

Trade-off: ensembles require M×M\times training compute and memory. For very large models (LLMs, large vision transformers), this is often prohibitive - MC Dropout or Laplace approximation becomes more practical at scale.


Key Takeaways

  • BNNs replace point weight estimates with distributions, enabling principled uncertainty quantification.
  • The predictive integral p(yx,W)p(WD)dW\int p(y^*|x^*, \mathbf{W})\, p(\mathbf{W}|\mathcal{D})\, d\mathbf{W} is intractable - we approximate with VI, MC Dropout, or ensembles.
  • ELBO = expected log-likelihood minus KL divergence - the objective for variational BNNs.
  • Reparameterization trick: w=μ+σεw = \mu + \sigma \varepsilon makes sampling differentiable through μ\mu and σ\sigma.
  • MC Dropout: keep dropout at test time, run TT passes - zero architecture change, easy retrofit.
  • Deep Ensembles: train MM models independently - surprisingly competitive, production-proven.
  • Epistemic uncertainty is reducible; aleatoric is irreducible - this distinction drives active learning and safety protocols.
  • In practice: MC Dropout for quick wins, Deep Ensembles for production reliability, full BNNs for research.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Uncertainty Quantification demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.