Skip to main content

The Probabilistic Perspective on ML - Learning as Bayesian Inference

:::note Reading time: ~35 minutes | Interview relevance: Very High | Target roles: MLE, AI Engineer, Research Engineer, Data Scientist :::

The Real Interview Moment

It is 2002. A spam filter is being trained on 1,000 emails. The word "Viagra" appears in 450 of the 500 spam emails and in 2 of the 500 legitimate emails.

A frequentist model computes: "the word 'Viagra' appears in 90% of spam." This is a point estimate. When a new email arrives containing "Viagra," the model outputs a single number - spam probability: 0.91. Confident. Definitive. No acknowledgment that this estimate was derived from a relatively small sample, no acknowledgment that the true rate might be anywhere from 85% to 95%.

A Bayesian model reasons differently. It starts with a prior belief about spam word rates - based on linguistic expectations, it believes word frequencies in spam are probably high but with real uncertainty. After seeing the 500 spam emails, it updates that prior with the observed data. The result is not a single number but a probability distribution: "I believe the true rate is somewhere around 90%, but I'm only about 85% confident it's between 82% and 96%." When a new email arrives, it integrates over this entire distribution to produce a calibrated probability estimate - one that correctly reflects how much data it has seen.

In production, this distinction is often more valuable than the point estimate itself. A medical diagnostic model that outputs "cancer probability: 0.73" with no uncertainty is not the same as one that outputs "cancer probability: 0.73, 95% credible interval [0.61, 0.84], based on 12 similar cases in training data." The second model tells a doctor whether to trust the output. The first model is a black box that sometimes hallucinates confidence.

This lesson builds the Bayesian framework from the ground up. By the end, you will understand why MLE is a special case of Bayesian reasoning, why L2 regularisation is secretly a Gaussian prior, and why the full Bayesian approach is both more honest and more expensive than its alternatives. Every subsequent lesson in this module builds on these foundations.


Why the Probabilistic Framework Exists

Before Bayesian methods dominated machine learning, models learned parameters by optimising a single loss function. Find the weights that minimise cross-entropy. Find the slope and intercept that minimise squared error. The result is a single best-guess - a point in parameter space.

This approach works well when you have enormous amounts of data, when the model is well-specified, and when you only care about average-case performance on the training distribution. For ImageNet classification with millions of images and a well-designed architecture, maximum likelihood with a good optimizer gets you to state-of-the-art performance.

But the world is not always like ImageNet. Consider:

  • A drug interaction model trained on 200 clinical trials. Each parameter estimate is noisy. The model should reflect this uncertainty rather than acting as if its estimates are exact.
  • A robotics control system that has never encountered a particular terrain type. It should recognise that it is in unfamiliar territory and take cautious actions, rather than confidently applying a policy learned on different terrain.
  • A fraud detection system where an analyst needs to explain to a regulator not just "this transaction looks suspicious" but "here is our confidence level and here is what drove that assessment."

The probabilistic framework solves these problems by treating model parameters not as fixed values to be found, but as random variables over which we maintain a distribution. Learning is not optimisation - it is inference. We observe data and update our beliefs.


Bayes' Theorem: The Foundation

Bayes' theorem is a statement about conditional probabilities. Given two events AA and BB:

P(AB)=P(BA)P(A)P(B)P(A|B) = \frac{P(B|A)\,P(A)}{P(B)}

In machine learning, we replace the abstract events with model parameters θ\theta and observed data D\mathcal{D}:

P(θD)=P(Dθ)P(θ)P(D)\boxed{P(\theta | \mathcal{D}) = \frac{P(\mathcal{D} | \theta)\, P(\theta)}{P(\mathcal{D})}}

Each term has a name and a role:

TermNameMeaning
P(θD)P(\theta \| \mathcal{D})PosteriorUpdated beliefs about θ\theta after seeing data
P(Dθ)P(\mathcal{D} \| \theta)LikelihoodHow probable is the data under parameter setting θ\theta?
P(θ)P(\theta)PriorWhat do we believe about θ\theta before seeing data?
P(D)P(\mathcal{D})Evidence (marginal likelihood)How probable is the data under all possible θ\theta?

The denominator P(D)=P(Dθ)P(θ)dθP(\mathcal{D}) = \int P(\mathcal{D}|\theta)P(\theta)\,d\theta is a normalising constant that ensures the posterior integrates to 1. It is often computationally intractable - this is the central challenge of Bayesian inference.

Proportionally, what matters is:

P(θD)P(Dθ)P(θ)P(\theta | \mathcal{D}) \propto P(\mathcal{D} | \theta)\, P(\theta)

Posterior is proportional to likelihood times prior. Data updates beliefs. More data pushes the posterior closer to the truth. Better priors improve estimates when data is scarce.


The Three Levels of ML: MLE, MAP, Full Bayesian

Every machine learning algorithm implicitly sits at one of three levels of the Bayesian hierarchy. Understanding these levels is the most important conceptual move in this lesson.

Level 1 - Maximum Likelihood Estimation (MLE)

θ^MLE=argmaxθP(Dθ)=argmaxθi=1nlogP(xiθ)\hat{\theta}_{\text{MLE}} = \arg\max_\theta P(\mathcal{D}|\theta) = \arg\max_\theta \sum_{i=1}^n \log P(x_i|\theta)

MLE ignores the prior entirely. It finds the parameter setting that makes the observed data most probable. This is equivalent to:

  • Minimising cross-entropy loss (for classification with categorical distributions)
  • Minimising mean squared error (for regression with Gaussian noise)
  • Minimising negative log-likelihood (in general)

MLE is optimal when you have infinite data. In that regime, the likelihood overwhelms any prior you might have chosen, and MLE converges to the true parameters. But with finite data - especially scarce data - MLE can dramatically overfit. The parameters that maximise likelihood on a training set of 50 examples are not the parameters that generalise.

Level 2 - Maximum A Posteriori (MAP)

θ^MAP=argmaxθP(Dθ)P(θ)=argmaxθ[logP(Dθ)+logP(θ)]\hat{\theta}_{\text{MAP}} = \arg\max_\theta P(\mathcal{D}|\theta)\,P(\theta) = \arg\max_\theta \left[\log P(\mathcal{D}|\theta) + \log P(\theta)\right]

MAP adds the prior to the optimisation objective. You still get a single point estimate, but the prior pulls that estimate toward what you believed before seeing data.

The connection to regularisation is exact:

Gaussian prior P(θ)=N(0,τ2I)P(\theta) = \mathcal{N}(0, \tau^2 I) gives:

logP(θ)=12τ2θ2+const\log P(\theta) = -\frac{1}{2\tau^2}\|\theta\|^2 + \text{const}

So MAP becomes:

θ^MAP=argmaxθ[logP(Dθ)12τ2θ2]\hat{\theta}_{\text{MAP}} = \arg\max_\theta \left[\log P(\mathcal{D}|\theta) - \frac{1}{2\tau^2}\|\theta\|^2\right]

This is exactly L2 regularisation (ridge regression / weight decay) with λ=1/τ2\lambda = 1/\tau^2.

Laplace prior P(θ)exp(λτθ)P(\theta) \propto \exp(-\frac{\lambda}{\tau}|\theta|) gives L1 regularisation (Lasso). The sparsity-inducing property of L1 corresponds to the heavy tails of the Laplace distribution - it strongly prefers sparse solutions.

Every regularised model is doing MAP inference with an implicit prior. When someone asks "why do we use L2 regularisation?" the Bayesian answer is: "because we believe the true weights are drawn from a zero-mean Gaussian."

Level 3 - Full Bayesian Inference

Rather than collapsing the posterior to a single point, full Bayesian inference maintains the entire distribution. Predictions integrate over all plausible parameter settings:

P(yx,D)=P(yx,θ)P(θD)dθP(y^*|x^*, \mathcal{D}) = \int P(y^*|x^*, \theta)\,P(\theta|\mathcal{D})\,d\theta

This is Bayesian model averaging. Instead of betting on a single model, you average over a committee of all models weighted by their posterior probability. The result is:

  1. Better calibrated - uncertainty in parameters propagates to uncertainty in predictions
  2. More robust - not committed to a single model
  3. Computationally intractable in general - the integral over continuous high-dimensional θ\theta is usually not available in closed form

This is why most of the Bayesian ML literature is about approximation: variational inference, MCMC sampling, Laplace approximation, and MC Dropout are all ways to approximate this integral.


The Bayesian Update: Learning as Sequential Belief Revision

One of the most beautiful properties of the Bayesian framework is that it supports sequential updating. Today's posterior becomes tomorrow's prior. As new data arrives, beliefs sharpen:

Prior → [observe data batch 1] → Posterior 1
Posterior 1 → [observe data batch 2] → Posterior 2
Posterior 2 → [observe data batch 3] → Posterior 3
...

This is how human expertise works. A radiologist looking at their first X-ray has broad uncertainty (weak prior). After ten years of practice (vast data), their posterior over diagnoses is sharp and well-calibrated.

The posterior after all data is exactly the same regardless of whether you process all data at once or batch by batch - this is a consequence of Bayes' theorem and the conditional independence of observations given parameters.


Conjugate Priors: When the Math Works Out

In general, the posterior P(θD)P(\theta|\mathcal{D}) does not have a closed form. But for certain pairs of prior and likelihood, it does. These are called conjugate pairs - the posterior is in the same family as the prior.

Beta-Binomial: Click Rates and Conversion

Setting: You observe kk conversions out of nn ad impressions. What is the true click-through rate θ\theta?

Likelihood: Binomial - P(kθ)=(nk)θk(1θ)nkP(k|\theta) = \binom{n}{k}\theta^k(1-\theta)^{n-k}

Prior: Beta - P(θ)=Beta(α,β)θα1(1θ)β1P(\theta) = \text{Beta}(\alpha, \beta) \propto \theta^{\alpha-1}(1-\theta)^{\beta-1}

The parameters α\alpha and β\beta encode prior beliefs:

  • α=β=1\alpha = \beta = 1: uniform prior - all click rates equally likely
  • α=10,β=90\alpha = 10, \beta = 90: strong prior belief that CTR is around 10%
  • α+β\alpha + \beta: the "effective sample size" of the prior - how many pseudo-observations it represents

Posterior: Also Beta -

P(θk,n)=Beta(α+k,  β+nk)P(\theta|k,n) = \text{Beta}(\alpha + k,\; \beta + n - k)

The posterior mean is:

E[θk,n]=α+kα+β+n\mathbb{E}[\theta|k,n] = \frac{\alpha + k}{\alpha + \beta + n}

This is exactly the Laplace smoothed estimate! With α=β=1\alpha = \beta = 1: k+1n+2\frac{k+1}{n+2}. The prior adds pseudo-observations that prevent the estimate from collapsing to 0 or 1 when data is sparse.

Production use: A/B testing platform at scale. For each new ad variant with only 10 impressions and 1 click, MLE gives CTR = 10% - a wildly unreliable estimate. Beta-Binomial posterior mean gives 1+110+2=16.7%\frac{1 + 1}{10 + 2} = 16.7\% with a wide credible interval communicating the uncertainty. As impressions accumulate, the interval narrows and the estimate converges to truth.

Gaussian-Gaussian: The Linear Regression Case

Setting: Regression with Gaussian noise and Gaussian prior on weights.

Likelihood: P(yθ)=N(Xθ,σ2I)P(y|\theta) = \mathcal{N}(X\theta, \sigma^2 I)

Prior: P(θ)=N(0,τ2I)P(\theta) = \mathcal{N}(0, \tau^2 I)

Posterior: Also Gaussian (derived in Lesson 03):

P(θy,X)=N(mN,SN)P(\theta|y, X) = \mathcal{N}(m_N, S_N)

This gives Bayesian linear regression with closed-form posterior. The posterior mean is the ridge regression solution.

Dirichlet-Categorical: Topic Modelling

Setting: Documents are mixtures of topics; each topic is a distribution over words.

Prior: Dirichlet P(θ)=Dir(α)P(\theta) = \text{Dir}(\alpha) - a distribution over probability vectors

Likelihood: Categorical/Multinomial

Posterior: Also Dirichlet - P(θword counts)=Dir(α+counts)P(\theta|\text{word counts}) = \text{Dir}(\alpha + \text{counts})

This is the foundation of Latent Dirichlet Allocation (LDA) - the original topic model.


Epistemic vs Aleatoric Uncertainty

This distinction is fundamental to applied Bayesian ML and appears in almost every ML system design interview.

Epistemic uncertainty (also called "model uncertainty") is uncertainty about model parameters. It arises because we have finite data and cannot perfectly identify the true data-generating process. It is:

  • Reducible: gathering more data shrinks it
  • High in regions where training data is sparse
  • The type of uncertainty that Bayesian methods explicitly model

Aleatoric uncertainty (also called "data uncertainty") is irreducible noise inherent in the data. It arises from:

  • Measurement error (sensor noise, labelling noise)
  • Fundamental stochasticity in the process being modelled
  • Missing features - variables that affect the outcome but are not observed

It cannot be reduced by gathering more data. Even with infinite training data, the prediction on a new point still has aleatoric uncertainty because the process itself is noisy.

Example: Medical imaging. A chest X-ray model predicts pneumonia.

  • Epistemic: "I have only seen 50 X-rays that look like this" - the model is uncertain because this is a rare presentation. Collect more data on this presentation and epistemic uncertainty drops.
  • Aleatoric: "Radiologists disagree on this X-ray 40% of the time" - even expert humans cannot agree. No amount of additional data resolves this fundamental ambiguity.

A well-designed clinical system should communicate both types separately. "I'm confident in my diagnosis, but this is a hard case even for experts" is a different message from "I haven't seen many cases like this."


From Scratch: Coin Flip Bayesian Updating

Let us build intuition with code. A coin has unknown bias θ\theta. We observe flips and update our belief.

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# ─── Bayesian coin flip: Beta-Binomial conjugate model ──────────────────────
np.random.seed(42)

TRUE_BIAS = 0.7 # true (unknown) coin bias
N_FLIPS = 100 # total observations we will collect
PRIOR_ALPHA = 2.0 # prior: weak belief coin is slightly heads-biased
PRIOR_BETA = 2.0 # symmetric → mean = 0.5, but weak

# Generate observations
flips = np.random.binomial(1, TRUE_BIAS, N_FLIPS)

# Bayesian update at each step
checkpoints = [1, 5, 10, 25, 50, 100]
theta_grid = np.linspace(0, 1, 500)

print(f"{'Flips':>6} | {'Heads':>5} | {'Post Mean':>10} | {'95% CI':>20} | {'Width':>8}")
print("-" * 60)

alpha, beta = PRIOR_ALPHA, PRIOR_BETA

for n in checkpoints:
heads = flips[:n].sum()
tails = n - heads
alpha_n = alpha + heads
beta_n = beta + tails

posterior = stats.beta(alpha_n, beta_n)
mean = posterior.mean()
lo, hi = posterior.ppf(0.025), posterior.ppf(0.975)
print(f"{n:>6} | {heads:>5} | {mean:>10.4f} | [{lo:.4f}, {hi:.4f}] | {hi-lo:>8.4f}")

# ─── Compare MLE vs MAP vs Bayesian posterior mean ──────────────────────────
print("\n--- Estimator Comparison After 10 Flips ---")
n_small = 10
heads_small = flips[:n_small].sum()

mle = heads_small / n_small
map_est = (PRIOR_ALPHA + heads_small - 1) / (PRIOR_ALPHA + PRIOR_BETA + n_small - 2)
bayes_mean = (PRIOR_ALPHA + heads_small) / (PRIOR_ALPHA + PRIOR_BETA + n_small)

print(f"MLE: {mle:.4f} (no regularisation)")
print(f"MAP: {map_est:.4f} (mode of posterior)")
print(f"Bayesian mean: {bayes_mean:.4f} (mean of posterior)")
print(f"True bias: {TRUE_BIAS:.4f}")

# ─── MLE vs MAP comparison across dataset sizes ─────────────────────────────
print("\n--- MLE vs Bayesian Mean vs True Bias Across Dataset Sizes ---")
sizes = [5, 10, 20, 50, 100, 500, 1000]
np.random.seed(0)
for sz in sizes:
data = np.random.binomial(1, TRUE_BIAS, sz)
h = data.sum()
mle_est = h / sz
bayes_est = (PRIOR_ALPHA + h) / (PRIOR_ALPHA + PRIOR_BETA + sz)
print(f"n={sz:>5}: MLE={mle_est:.3f}, Bayes={bayes_est:.3f}, True={TRUE_BIAS:.3f}")

Output (representative):

Flips | Heads | Post Mean | 95% CI | Width
------------------------------------------------------------
1 | 0 | 0.3750 | [0.0253, 0.7955] | 0.7702
5 | 2 | 0.4444 | [0.1566, 0.7584] | 0.6018
10 | 6 | 0.5357 | [0.2706, 0.7932] | 0.5226
25 | 18 | 0.6552 | [0.4563, 0.8299] | 0.3736
50 | 36 | 0.6792 | [0.5380, 0.8044] | 0.2664
100 | 72 | 0.6923 | [0.5997, 0.7782] | 0.1785

Key observations:

  1. With 1 flip (tails), posterior mean is 0.375 - the prior pulls it toward 0.5, not 0.0 (which MLE would give)
  2. Credible interval narrows as data accumulates
  3. At n=100, Bayesian mean is nearly identical to MLE - large data washes out the prior

MAP as Regularised MLE: The Exact Derivation

This is the connection every ML engineer should understand deeply. Let us derive it.

Setup: Logistic regression for binary classification. Parameters θRd\theta \in \mathbb{R}^d. Data D={(xi,yi)}i=1n\mathcal{D} = \{(x_i, y_i)\}_{i=1}^n.

MLE objective: θ^MLE=argmaxθi=1nlogP(yixi,θ)\hat{\theta}_{\text{MLE}} = \arg\max_\theta \sum_{i=1}^n \log P(y_i|x_i, \theta)

For logistic regression with σ(x)=1/(1+ex)\sigma(x) = 1/(1+e^{-x}):

=argmaxθi=1n[yilogσ(θxi)+(1yi)log(1σ(θxi))]= \arg\max_\theta \sum_{i=1}^n \left[y_i \log \sigma(\theta^\top x_i) + (1-y_i)\log(1 - \sigma(\theta^\top x_i))\right]

MAP with Gaussian prior θN(0,τ2I)\theta \sim \mathcal{N}(0, \tau^2 I):

θ^MAP=argmaxθ[i=1nlogP(yixi,θ)+logP(θ)]\hat{\theta}_{\text{MAP}} = \arg\max_\theta \left[\sum_{i=1}^n \log P(y_i|x_i, \theta) + \log P(\theta)\right]

=argmaxθ[i=1nlogP(yixi,θ)θ22τ2]= \arg\max_\theta \left[\sum_{i=1}^n \log P(y_i|x_i, \theta) - \frac{\|\theta\|^2}{2\tau^2}\right]

=argminθ[i=1nlogP(yixi,θ)+12τ2θ2]= \arg\min_\theta \left[-\sum_{i=1}^n \log P(y_i|x_i, \theta) + \frac{1}{2\tau^2}\|\theta\|^2\right]

This is exactly L2-regularised logistic regression with λ=1/(2τ2)\lambda = 1/(2\tau^2). The regularisation strength λ\lambda encodes the strength of the Gaussian prior. A tighter prior (smaller τ2\tau^2) means stronger regularisation.

import numpy as np
from scipy.special import expit # sigmoid
from scipy.optimize import minimize

np.random.seed(42)

# ─── Synthetic binary classification ────────────────────────────────────────
N, D = 200, 10
X = np.random.randn(N, D)
theta_true = np.random.randn(D) * 0.5
y = (expit(X @ theta_true) > 0.5).astype(float)

def neg_log_likelihood(theta, X, y):
"""Standard logistic regression NLL (MLE objective)."""
logits = X @ theta
# Numerically stable computation
nll = np.logaddexp(0, -logits * (2*y - 1))
return nll.sum()

def neg_log_posterior(theta, X, y, tau_sq=1.0):
"""MAP objective: NLL + Gaussian prior regularisation."""
nll = neg_log_likelihood(theta, X, y)
# Gaussian prior: -log P(theta) = ||theta||^2 / (2*tau^2) + const
prior_term = np.sum(theta**2) / (2 * tau_sq)
return nll + prior_term

# Gradient of NLL
def grad_nll(theta, X, y):
probs = expit(X @ theta)
return X.T @ (probs - y)

# Gradient of MAP objective
def grad_map(theta, X, y, tau_sq=1.0):
return grad_nll(theta, X, y) + theta / tau_sq

# ─── Fit MLE ────────────────────────────────────────────────────────────────
theta0 = np.zeros(D)
result_mle = minimize(
neg_log_likelihood, theta0, args=(X, y),
jac=grad_nll, method='L-BFGS-B'
)
theta_mle = result_mle.x

# ─── Fit MAP with various prior strengths ───────────────────────────────────
tau_sq_values = [0.1, 1.0, 10.0, 100.0]
print(f"{'tau^2':>8} | {'lambda':>8} | {'||theta||':>10} | {'Interpretation':>25}")
print("-" * 60)

for tau_sq in tau_sq_values:
result_map = minimize(
neg_log_posterior, theta0, args=(X, y, tau_sq),
jac=grad_map, kwargs={'tau_sq': tau_sq}, method='L-BFGS-B'
)
# Fix: pass tau_sq via args tuple
result_map = minimize(
lambda t: neg_log_posterior(t, X, y, tau_sq),
theta0, method='L-BFGS-B'
)
theta_map = result_map.x
lam = 1 / (2 * tau_sq)
norm = np.linalg.norm(theta_map)
interp = "Strong regularisation" if tau_sq < 1 else ("Weak regularisation" if tau_sq > 10 else "Moderate")
print(f"{tau_sq:>8.1f} | {lam:>8.3f} | {norm:>10.4f} | {interp:>25}")

print(f"\n{'MLE':>8} | {'0.000':>8} | {np.linalg.norm(theta_mle):>10.4f} | {'No regularisation':>25}")

Probabilistic Graphical Models: The Big Picture

Bayesian reasoning is not just about posteriors over weights. It is a framework for reasoning about complex joint distributions over many variables. Probabilistic graphical models (PGMs) represent these joint distributions using graphs where nodes are random variables and edges represent conditional dependencies.

The key factorisation:

P(x1,x2,,xn)=i=1nP(xiParents(xi))P(x_1, x_2, \ldots, x_n) = \prod_{i=1}^n P(x_i | \text{Parents}(x_i))

This is the chain rule of probability, structured by the graph's topology.

In this Bayesian network, "True Disease" is a latent variable - unobserved. We observe symptoms and test results and infer the posterior over the disease status. The treatment decision then acts on this posterior, not on a point estimate.

This is exactly the Bayesian view of a neural network: weights θ\theta are latent variables, training data D\mathcal{D} is observed, and the posterior P(θD)P(\theta|\mathcal{D}) is what we want to reason about when making predictions.


Posterior Predictive: Making Predictions Under Uncertainty

Once we have a posterior P(θD)P(\theta|\mathcal{D}), how do we make predictions?

The posterior predictive distribution marginalises over all possible parameter settings:

P(yx,D)=P(yx,θ)P(θD)dθ\boxed{P(y^*|x^*, \mathcal{D}) = \int P(y^*|x^*, \theta)\,P(\theta|\mathcal{D})\,d\theta}

This integral is the heart of Bayesian machine learning. It says: to predict yy^* at new input xx^*, average over every possible model, weighted by how probable that model is given the training data.

For a simple coin flip example with a single parameter θ\theta:

P(headsD)=01θBeta(θαn,βn)dθ=αnαn+βnP(\text{heads}|\mathcal{D}) = \int_0^1 \theta \cdot \text{Beta}(\theta|\alpha_n, \beta_n)\,d\theta = \frac{\alpha_n}{\alpha_n + \beta_n}

The posterior mean. For Gaussian processes (Lesson 02) and Bayesian linear regression (Lesson 03), this integral has closed form. For neural networks, it requires approximation.


Approximating the Posterior: An Overview

When the integral is intractable, three main approximation strategies exist:

1. Laplace Approximation: Fit a Gaussian centred at the MAP estimate, with covariance given by the inverse Hessian of the log-posterior. Fast but assumes the posterior is unimodal and Gaussian.

2. Markov Chain Monte Carlo (MCMC): Generate samples from the posterior by constructing a Markov chain whose stationary distribution is the posterior. Asymptotically exact, but slow and difficult to scale to large models.

3. Variational Inference: Approximate the posterior with a simpler distribution q(θ)q(\theta) from a tractable family, minimising the KL divergence from qq to the true posterior. Fast and scalable, but biased.

The subsequent lessons in this module cover the specific cases where these approximations are used: Bayesian neural networks use variational inference, Gaussian processes have exact posteriors, and Bayesian optimisation uses GP posteriors as surrogate models.


Calibration: Are Your Probabilities Trustworthy?

A model is calibrated if its predicted probabilities match empirical frequencies. A calibrated model that says "70% chance of rain" should be right about 70% of the time it says that.

Bayesian models are better calibrated than point-estimate models by design - they account for parameter uncertainty in their predictions. But they are not automatically perfectly calibrated, especially when the posterior is approximated.

Expected Calibration Error (ECE):

ECE=b=1BBbnacc(Bb)conf(Bb)\text{ECE} = \sum_{b=1}^{B} \frac{|B_b|}{n} \left|\text{acc}(B_b) - \text{conf}(B_b)\right|

where predictions are grouped into bins BbB_b by confidence, and we measure the gap between average confidence and average accuracy in each bin.

import numpy as np

def expected_calibration_error(probs, labels, n_bins=10):
"""
Compute ECE for binary classification.
probs: predicted probabilities for positive class
labels: true binary labels
"""
bin_edges = np.linspace(0, 1, n_bins + 1)
ece = 0.0
n = len(labels)

for i in range(n_bins):
lo, hi = bin_edges[i], bin_edges[i + 1]
mask = (probs >= lo) & (probs < hi)
if mask.sum() == 0:
continue
bin_probs = probs[mask]
bin_labels = labels[mask]
bin_conf = bin_probs.mean()
bin_acc = bin_labels.mean()
bin_size = mask.sum()
ece += (bin_size / n) * abs(bin_acc - bin_conf)

return ece

# Simulate a well-calibrated model vs an overconfident model
np.random.seed(42)
n = 1000
true_labels = np.random.binomial(1, 0.6, n)

# Well-calibrated: probabilities match true rates
calibrated_probs = np.clip(true_labels * 0.6 + np.random.beta(2, 2, n) * 0.4, 0, 1)

# Overconfident: probabilities pushed toward 0 and 1
overconfident_probs = np.clip(calibrated_probs ** 0.3, 0, 1)

print(f"Calibrated model ECE: {expected_calibration_error(calibrated_probs, true_labels):.4f}")
print(f"Overconfident model ECE: {expected_calibration_error(overconfident_probs, true_labels):.4f}")

Production Engineering Notes

:::tip When to use Bayesian methods Use full Bayesian inference when: dataset has fewer than ~10,000 examples, uncertainty quantification is critical for downstream decisions, or you are doing sequential experimentation (active learning, Bayesian optimisation). For large-scale production with millions of examples, temperature scaling or conformal prediction (Lesson 07) gives calibration with much lower cost. :::

:::warning Prior selection is not neutral Every prior is a modelling assumption. A "non-informative" prior is not truly non-informative - it is informative in some parameterisation. Sensitivity analysis (how much do conclusions change if the prior changes?) is essential for credible Bayesian analyses in high-stakes applications. :::

:::danger MAP estimate can be worse than MLE in multimodal settings If the true posterior is multimodal, the MAP estimate might be at a local mode that is far from the data-generating parameters. Full posterior inference (or at least multiple restarts) is required. This is particularly relevant in deep learning where the loss landscape has many local optima. :::

Computational costs at a glance:

MethodCostMemoryParallelisable
MLE / MAPTraining cost onlyO(1) in parametersYes
Laplace approximation+ Hessian computation: O(d^2)O(d^2)Yes
MCMC (HMC)100-10,000x trainingO(1) per sampleChains in parallel
Variational inference2-5x trainingO(variational params)Yes
MC DropoutK forward passes at inferenceO(1)Yes, over K

YouTube Resources

  • Bayesian Statistics Made Simple - Allen Downey (PyCon talk): concrete intuition with coin flips, accessible to any Python engineer
  • Probabilistic Machine Learning - Kevin Murphy (MIT lecture series): graduate-level depth, covers all of MLE/MAP/Bayesian systematically
  • A Student's Guide to Bayesian Statistics - Ben Lambert: the clearest introductory treatment of conjugate priors and MCMC
  • Bayesian Deep Learning - Yarin Gal (NIPS 2016 workshop): foundational talk connecting Bayesian inference to modern deep learning

Interview Q&A

Q1: What is the difference between MLE, MAP, and full Bayesian inference?

MLE finds the parameter setting that maximises the likelihood of the observed data - it ignores any prior beliefs and gives a point estimate. With small datasets, MLE tends to overfit. MAP adds a prior and finds the mode of the posterior - it is equivalent to regularised MLE (L2 regularisation corresponds to a Gaussian prior, L1 to a Laplace prior). Full Bayesian inference maintains the entire posterior distribution over parameters rather than collapsing to a point. Predictions integrate over all plausible parameter settings, giving calibrated uncertainty estimates. The trade-off is computational: MLE and MAP require optimisation (fast), while full Bayesian inference usually requires approximate inference (MCMC or variational) which is much slower.

Q2: How is L2 regularisation connected to Bayesian inference?

L2 regularisation (ridge regression, weight decay) is exactly MAP inference with a zero-mean isotropic Gaussian prior on the weights. The regularisation coefficient λ\lambda is 1/τ21/\tau^2 where τ2\tau^2 is the prior variance. A stronger prior (smaller τ2\tau^2, larger λ\lambda) pulls the weights harder toward zero. This gives a principled interpretation: you are encoding a prior belief that the true weights are small. Similarly, L1 regularisation corresponds to a Laplace prior, which has heavier tails and induces sparsity because the Laplace distribution concentrates probability mass at zero.

Q3: What is the difference between epistemic and aleatoric uncertainty, and why does it matter in production?

Epistemic uncertainty is uncertainty about model parameters - it arises from having finite data and is reducible with more observations. Aleatoric uncertainty is irreducible noise in the data itself - sensor noise, measurement error, fundamental stochasticity. In production, the distinction matters for decision-making. High epistemic uncertainty suggests the model needs more data or that the input is out-of-distribution. High aleatoric uncertainty suggests the problem is inherently hard regardless of how much data you collect. A medical AI system should communicate these differently: "I haven't seen many cases like this" (epistemic) vs "even experts disagree on this presentation" (aleatoric). Conflating them leads to poor decisions - trying to collect more data to resolve aleatoric uncertainty is futile.

Q4: What are conjugate priors, and why are they useful?

A conjugate prior is a prior distribution whose posterior, after observing data from the corresponding likelihood, is in the same family as the prior. For example, a Beta prior combined with a Binomial likelihood gives a Beta posterior. Conjugate priors are useful because they give exact closed-form posteriors - no MCMC or variational approximation needed. This makes inference fast, exact, and interpretable. Common conjugate pairs: Beta-Binomial (click rates, A/B tests), Gaussian-Gaussian (linear regression, Kalman filter), Dirichlet-Categorical (topic models, LDA). The limitation is that conjugate priors constrain the expressiveness of your prior beliefs - you must choose a prior from the conjugate family even if your domain knowledge suggests a different shape.

Q5: When would you choose Bayesian methods over standard deep learning?

Bayesian methods shine in four scenarios: (1) Small datasets - when you have fewer than roughly 1,000-10,000 examples, the prior acts as regularisation and posterior uncertainty is meaningful. (2) Active learning and sequential experimentation - drug discovery, robotics, hyperparameter search - where uncertainty drives exploration and every experiment is expensive. (3) Safety-critical applications - medical diagnosis, autonomous vehicles - where calibrated uncertainty is required for safe deployment. (4) Out-of-distribution detection - Bayesian models naturally assign high uncertainty to inputs unlike anything in the training distribution. Standard deep learning wins when you have large datasets, need low inference latency, and can post-hoc calibrate with temperature scaling. The computational cost of full Bayesian inference rarely justifies itself at scale when cheaper calibration methods exist.


Common Mistakes

:::danger Confusing probability with frequency A Bayesian probability is a degree of belief, not a long-run frequency. "The probability this patient has cancer is 0.73" means "given everything we know, we believe there is a 73% chance." It does not mean "if we tested this exact patient 1,000 times, 73% of results would be positive cancer." This distinction matters for how you communicate uncertainty to stakeholders. :::

:::warning Using an uninformative prior when you have domain knowledge Defaulting to a uniform or very broad prior wastes information. In A/B testing, you know CTRs are rarely above 10% - a Beta(2,18) prior encoding this knowledge dramatically improves estimates from sparse data. "I don't know" is rarely truly justified; almost always you know something. :::

:::warning Forgetting to check sensitivity to prior choice If your conclusions change dramatically when you change the prior from Beta(1,1) to Beta(2,2), your posterior is prior-dominated - you need more data. Always report credible intervals, not just posterior means, and consider running analyses with multiple plausible priors. :::

:::danger MAP is not the posterior The MAP estimate is a single point - the mode of the posterior distribution. It does not represent the full posterior. If the posterior is skewed, the MAP can be far from the posterior mean. If the posterior is multimodal, the MAP is at one mode and may not represent the typical behaviour of the model at all. Never refer to MAP inference as "Bayesian" - it discards most of the information in the posterior. :::


Practice Problems

  1. You observe 3 heads and 7 tails from a possibly biased coin. Starting from a Beta(2,2) prior, compute the posterior distribution. What is the posterior mean? How does it compare to the MLE estimate? What would the posterior look like after 30 heads and 70 tails?

  2. Derive that L1 regularisation corresponds to a Laplace prior. Write out the Laplace distribution, take the log, and show that MAP with a Laplace prior gives the Lasso objective.

  3. A spam filter observes 10,000 emails, of which 3,000 are spam. The word "meeting" appears in 100 spam emails and 4,000 legitimate emails. Using Bayes' theorem, compute the probability that an email containing the word "meeting" is spam. What prior did you need to assume?

  4. Explain why the posterior predictive distribution P(yx,θ)P(θD)dθ\int P(y^*|x^*, \theta)P(\theta|\mathcal{D})\,d\theta gives better calibrated uncertainty than just using P(yx,θ^MLE)P(y^*|x^*, \hat{\theta}_{\text{MLE}}). What information is the MLE prediction discarding?

  5. A neural network with 10 million parameters is trained on 50,000 images. Describe qualitatively what the posterior P(θD)P(\theta|\mathcal{D}) over the 10 million weights looks like. Why is MCMC impractical? What approximation would you use instead?

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Prior to Posterior demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.