Skip to main content

Probabilistic View of Machine Learning

Reading time: ~40 minutes | Level: ML Foundations | Role: MLE, Research Engineer, Data Scientist, AI Engineer

A senior interviewer asks you: "What is cross-entropy loss and why do we use it?" Most candidates answer: "It penalises confident wrong predictions more than unconfident ones." That's a start. But the interviewer probes deeper: "What probability model are you implicitly assuming when you minimise cross-entropy? And what happens to your model's behaviour if that assumption is violated?"

This is the probabilistic view of ML. Every loss function encodes a probabilistic assumption about how data was generated. Every optimizer is secretly maximising a likelihood. Every regulariser encodes a prior over parameters. When you know this framework, you understand why techniques work - and when to expect them to fail.

What You Will Learn

  • The generative model: how probabilistic thinking frames ML as density estimation
  • Maximum Likelihood Estimation (MLE) - derivation for regression and classification
  • Why minimising MSE ≡ maximising Gaussian likelihood
  • Why minimising cross-entropy ≡ maximising Bernoulli/Categorical likelihood
  • Maximum A Posteriori (MAP) estimation - MLE + prior
  • Why L2 regularisation ≡ Gaussian prior, L1 ≡ Laplace prior
  • Probability calibration - the gap between predicted probabilities and empirical frequencies
  • Bayesian reasoning in deep learning - MC Dropout, deep ensembles
  • Uncertainty types: aleatoric vs epistemic
  • Eight interview Q&As at senior research/MLE level

Part 1 - The Probabilistic Framework

From Patterns to Distributions

Traditional ML framing: "Find a function f:XYf: X \to Y that minimises training error."

Probabilistic framing: "Find parameters θ\theta such that the model distribution Pθ(YX)P_\theta(Y | X) best explains the observed data."

Generative view of supervised learning:

1. Nature draws a sample: (x, y) from true distribution P*(x, y)
2. You observe n samples: D = {(x₁,y₁), ..., (xₙ,yₙ)}
3. You posit a parametric model: P_θ(y|x)
4. Goal: find θ that makes D most probable under your model

This is Maximum Likelihood Estimation (MLE).

Why this framework is powerful:

  • It makes assumptions explicit (what distribution are you assuming?)
  • It gives you confidence estimates, not just point predictions
  • It connects every loss function to a probability model
  • It tells you what regularisation means (prior beliefs)
  • It quantifies uncertainty in predictions

Part 2 - Maximum Likelihood Estimation (MLE)

The Core Idea

Given observed data D={(xi,yi)}i=1n\mathcal{D} = \{(x_i, y_i)\}_{i=1}^n, and a parametric model Pθ(yx)P_\theta(y | x), MLE finds:

θ^MLE=argmaxθi=1nPθ(yixi)\hat{\theta}_{\text{MLE}} = \arg\max_\theta \prod_{i=1}^n P_\theta(y_i | x_i)

We take the log (log-likelihood is easier to optimise - sums instead of products):

θ^MLE=argmaxθi=1nlogPθ(yixi)\hat{\theta}_{\text{MLE}} = \arg\max_\theta \sum_{i=1}^n \log P_\theta(y_i | x_i)

Minimising the negative log-likelihood (NLL) is the standard training objective.

Case 1: Regression - Gaussian Likelihood → MSE

Assumption: y=fθ(x)+εy = f_\theta(x) + \varepsilon where εN(0,σ2)\varepsilon \sim \mathcal{N}(0, \sigma^2).

Then Pθ(yx)=N(y;fθ(x),σ2)P_\theta(y | x) = \mathcal{N}(y; f_\theta(x), \sigma^2).

logPθ(yixi)=12log(2πσ2)(yifθ(xi))22σ2\log P_\theta(y_i | x_i) = -\frac{1}{2}\log(2\pi\sigma^2) - \frac{(y_i - f_\theta(x_i))^2}{2\sigma^2}

Summing over nn samples and removing constants with respect to θ\theta:

θ^MLE=argmaxθi=1nlogPθ(yixi)=argminθi=1n(yifθ(xi))2\hat{\theta}_{\text{MLE}} = \arg\max_\theta \sum_{i=1}^n \log P_\theta(y_i | x_i) = \arg\min_\theta \sum_{i=1}^n (y_i - f_\theta(x_i))^2

MLE with Gaussian likelihoodMinimising MSE\boxed{\text{MLE with Gaussian likelihood} \equiv \text{Minimising MSE}}

import numpy as np
from scipy.stats import norm
import matplotlib.pyplot as plt

# Demonstrate: MLE with Gaussian assumption = MSE minimisation
np.random.seed(42)
x = np.linspace(0, 10, 50)
y_true = 2 * x + 1
y_obs = y_true + np.random.normal(0, 2, 50) # σ=2

# Method 1: Minimise MSE (standard linear regression)
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(x.reshape(-1, 1), y_obs)
mse_predictions = lr.predict(x.reshape(-1, 1))

# Method 2: Maximise Gaussian log-likelihood (same result)
sigma = np.std(y_obs - mse_predictions)
log_likelihoods = norm.logpdf(y_obs, loc=mse_predictions, scale=sigma)
total_ll = log_likelihoods.sum()

print(f"MSE regression: slope={lr.coef_[0]:.4f}, intercept={lr.intercept_:.4f}")
print(f"Total log-likelihood at MLE solution: {total_ll:.4f}")
print(f"NLL = -log_likelihood = {-total_ll:.4f}")
print(f"Scaled NLL = {-total_ll/50:.4f} ≈ MSE / (2σ²) + const")

# Visualise the Gaussian assumption
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
ax1.scatter(x, y_obs, alpha=0.5, label='Observed')
ax1.plot(x, mse_predictions, 'r-', label='MLE/MSE fit')
ax1.set_title('Gaussian Regression: y ~ N(f(x), σ²)')
ax1.legend()

ax2.hist(y_obs - mse_predictions, bins=20, density=True, alpha=0.7, label='Residuals')
x_range = np.linspace(-8, 8, 200)
ax2.plot(x_range, norm.pdf(x_range, 0, sigma), 'r-', label=f'N(0, {sigma:.2f}²)')
ax2.set_title('Residual Distribution (should be Gaussian for MLE=MSE)')
ax2.legend()
plt.tight_layout()

Case 2: Binary Classification - Bernoulli Likelihood → Cross-Entropy

Assumption: yixiBernoulli(pθ(xi))y_i | x_i \sim \text{Bernoulli}(p_\theta(x_i)) where pθ(xi)=σ(fθ(xi))p_\theta(x_i) = \sigma(f_\theta(x_i)) (sigmoid output).

Pθ(yixi)=pθ(xi)yi(1pθ(xi))1yiP_\theta(y_i | x_i) = p_\theta(x_i)^{y_i} (1 - p_\theta(x_i))^{1 - y_i}

Taking the negative log-likelihood:

logPθ(yixi)=yilogpθ(xi)(1yi)log(1pθ(xi))-\log P_\theta(y_i | x_i) = -y_i \log p_\theta(x_i) - (1 - y_i) \log(1 - p_\theta(x_i))

Summing over all samples:

L=1ni=1n[yilogp^i+(1yi)log(1p^i)]\mathcal{L} = -\frac{1}{n}\sum_{i=1}^n \left[y_i \log \hat{p}_i + (1-y_i) \log(1 - \hat{p}_i)\right]

MLE with Bernoulli likelihoodBinary Cross-Entropy Loss\boxed{\text{MLE with Bernoulli likelihood} \equiv \text{Binary Cross-Entropy Loss}}

import numpy as np
from scipy.special import expit # sigmoid

# Demonstrate: cross-entropy IS negative log-likelihood for Bernoulli model
np.random.seed(42)
n = 1000

# True data generating process
X_clf = np.random.randn(n, 2)
true_logits = 2 * X_clf[:, 0] - 1.5 * X_clf[:, 1]
true_probs = expit(true_logits)
y_clf = np.random.binomial(1, true_probs)

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss

model = LogisticRegression(C=1e6, max_iter=500) # large C ≈ no regularisation = pure MLE
model.fit(X_clf, y_clf)
p_hat = model.predict_proba(X_clf)[:, 1]

# Cross-entropy (sklearn)
ce_sklearn = log_loss(y_clf, p_hat)

# Manual NLL
nll_manual = -np.mean(
y_clf * np.log(p_hat + 1e-15) + (1 - y_clf) * np.log(1 - p_hat + 1e-15)
)

print(f"sklearn log_loss: {ce_sklearn:.6f}")
print(f"Manual NLL: {nll_manual:.6f}")
print(f"Difference: {abs(ce_sklearn - nll_manual):.2e} (numerical precision)")

Case 3: Multi-Class - Categorical Likelihood → Softmax Cross-Entropy

Pθ(y=kx)=efθ(x)kj=1Kefθ(x)jP_\theta(y = k | x) = \frac{e^{f_\theta(x)_k}}{\sum_{j=1}^K e^{f_\theta(x)_j}}

logPθ(yi=kxi)=logp^i,k=cross-entropy loss-\log P_\theta(y_i = k | x_i) = -\log \hat{p}_{i,k} = \text{cross-entropy loss}

import torch
import torch.nn.functional as F

# Softmax cross-entropy = NLL of Categorical distribution
logits = torch.tensor([[2.0, 1.0, -0.5],
[0.5, 3.0, 0.1],
[-1.0, 0.5, 2.5]]) # (batch=3, classes=3)
labels = torch.tensor([0, 1, 2]) # true classes

# Method 1: PyTorch cross-entropy
ce_loss = F.cross_entropy(logits, labels)

# Method 2: Manual NLL
probs = F.softmax(logits, dim=-1)
log_probs = torch.log(probs)
nll_manual = -log_probs[torch.arange(3), labels].mean()

print(f"F.cross_entropy: {ce_loss.item():.6f}")
print(f"Manual NLL: {nll_manual.item():.6f}")

# Demonstrate the assumed distribution
print("\nPredicted probabilities (Categorical distribution parameters):")
for i in range(3):
p = probs[i].detach().numpy()
print(f" Sample {i}: P(C=0)={p[0]:.3f}, P(C=1)={p[1]:.3f}, P(C=2)={p[2]:.3f}")
print(f" True class: {labels[i].item()}, NLL: {-torch.log(probs[i][labels[i]]).item():.4f}")

Part 3 - Loss Functions and Their Implied Distributions

Loss Function │ Distribution Assumption │ Optimal Prediction
───────────────────────┼─────────────────────────────────┼───────────────────
MSE (L2 loss) │ Gaussian noise (ε ~ N(0,σ²)) │ Conditional mean E[y|x]
MAE (L1 loss) │ Laplace noise (ε ~ Laplace) │ Conditional median
Huber loss │ Mix Gaussian + Laplace │ Robust conditional mean
Binary cross-entropy │ Bernoulli (y ∈ {0,1}) │ P(y=1|x)
Softmax cross-entropy │ Categorical (y ∈ {1..K}) │ P(y=k|x)
KL divergence │ Any distribution P vs Q │ P = Q (perfect match)
Poisson NLL │ Poisson (count data y ∈ ℕ) │ Rate parameter λ(x)
Beta NLL │ Beta (y ∈ [0,1]) │ Mean of Beta(α,β)
Ordinal cross-entropy │ Ordered categorical │ Cumulative probabilities

When the Assumed Distribution Doesn't Match

import numpy as np
from scipy.stats import poisson
import matplotlib.pyplot as plt

"""
Example: count data (bookings per hour) is Poisson-distributed.
If you use MSE loss (Gaussian assumption), predictions may be negative
and the loss weights large counts disproportionately.
Poisson NLL handles this correctly.
"""
np.random.seed(42)
# True rate is time-dependent
times = np.linspace(1, 24, 1000) # hours of day
true_rate = 5 + 20 * np.exp(-0.5 * ((times - 12) / 3)**2) # lunch peak
y_counts = np.array([poisson.rvs(lam) for lam in true_rate])

from sklearn.linear_model import PoissonRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error
import warnings
warnings.filterwarnings('ignore')

X_times = times.reshape(-1, 1)

# Wrong: Gaussian assumption (standard linear regression)
from sklearn.linear_model import LinearRegression
lr = LinearRegression().fit(X_times, y_counts)
y_pred_lr = lr.predict(X_times)

# Right: Poisson assumption
pr = PoissonRegressor(alpha=0.01).fit(X_times, y_counts)
y_pred_pr = pr.predict(X_times)

print("Count data prediction comparison:")
print(f" Linear Reg (Gaussian) MAE: {mean_absolute_error(y_counts, y_pred_lr):.3f}")
print(f" Poisson Reg MAE: {mean_absolute_error(y_counts, y_pred_pr):.3f}")
print(f" Linear Reg negative predictions: {(y_pred_lr < 0).sum()} ({(y_pred_lr<0).mean()*100:.1f}%)")
print(f" Poisson Reg min prediction: {y_pred_pr.min():.4f} (always > 0)")

Part 4 - MAP Estimation: Adding a Prior

MLE finds parameters that maximise the likelihood. But if you have prior beliefs about what parameters should look like, you can incorporate them through a prior distribution P(θ)P(\theta).

Maximum A Posteriori (MAP):

θ^MAP=argmaxθP(θD)=argmaxθP(Dθ)likelihoodP(θ)prior\hat{\theta}_{\text{MAP}} = \arg\max_\theta P(\theta | \mathcal{D}) = \arg\max_\theta \underbrace{P(\mathcal{D} | \theta)}_{\text{likelihood}} \cdot \underbrace{P(\theta)}_{\text{prior}}

=argmaxθ[logP(Dθ)+logP(θ)]= \arg\max_\theta \left[\log P(\mathcal{D} | \theta) + \log P(\theta)\right]

=argmaxθ[log-likelihood+log-prior]= \arg\max_\theta \left[\text{log-likelihood} + \text{log-prior}\right]

MAPMLE+regularisation(log-prior=regulariser)\boxed{\text{MAP} \equiv \text{MLE} + \text{regularisation} \quad (\text{log-prior} = \text{regulariser})}

L2 Regularisation ≡ Gaussian Prior

A Gaussian prior P(θ)=N(0,τ2)P(\theta) = \mathcal{N}(0, \tau^2) gives:

logP(θ)=12τ2θ22+const\log P(\theta) = -\frac{1}{2\tau^2} \|\theta\|_2^2 + \text{const}

θ^MAP=argminθ[NLL+λ2θ22]where λ=σ2τ2\hat{\theta}_{\text{MAP}} = \arg\min_\theta \left[\text{NLL} + \frac{\lambda}{2} \|\theta\|_2^2\right] \quad \text{where } \lambda = \frac{\sigma^2}{\tau^2}

This is exactly Ridge regression / L2 regularisation!

L1 Regularisation ≡ Laplace Prior

A Laplace prior P(θ)=Laplace(0,b)P(\theta) = \text{Laplace}(0, b) gives:

logP(θ)=θb+const\log P(\theta) = -\frac{|\theta|}{b} + \text{const}

θ^MAP=argminθ[NLL+λθ1]\hat{\theta}_{\text{MAP}} = \arg\min_\theta \left[\text{NLL} + \lambda \|\theta\|_1\right]

This is exactly Lasso / L1 regularisation! The Laplace prior has heavier tails than Gaussian, which corresponds to the sparsity-inducing property of L1 - it strongly prefers most weights to be exactly zero.

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm, laplace
from sklearn.linear_model import Ridge, Lasso, LinearRegression
from sklearn.datasets import make_regression
from sklearn.metrics import mean_squared_error

# Visualise Gaussian vs Laplace prior
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

theta_range = np.linspace(-5, 5, 1000)
for sigma in [0.5, 1.0, 2.0]:
axes[0].plot(theta_range, norm.pdf(theta_range, 0, sigma),
label=f'N(0, {sigma}²)')
for b in [0.5, 1.0, 2.0]:
axes[1].plot(theta_range, laplace.pdf(theta_range, 0, b),
label=f'Laplace(0, {b})')

axes[0].set_title('Gaussian Prior → L2/Ridge Regularisation\n(smooth - large weights discouraged)')
axes[1].set_title('Laplace Prior → L1/Lasso Regularisation\n(sharp peak - promotes sparsity/zero weights)')
for ax in axes:
ax.legend()
ax.set_xlabel('θ (parameter value)')
ax.set_ylabel('Prior density P(θ)')
ax.axvline(0, color='gray', linestyle=':', alpha=0.7)

plt.tight_layout()

# Demonstrate regularisation as MAP
np.random.seed(42)
n, p = 100, 50 # more features than effective signal
X, y, coef = make_regression(n_samples=n, n_features=p, n_informative=10,
noise=15, coef=True, random_state=42)

models = {
'MLE (no reg)': LinearRegression(),
'MAP Gaussian (Ridge, λ=1)': Ridge(alpha=1.0),
'MAP Gaussian (Ridge, λ=10)': Ridge(alpha=10.0),
'MAP Laplace (Lasso, λ=0.1)': Lasso(alpha=0.1),
'MAP Laplace (Lasso, λ=1)': Lasso(alpha=1.0),
}

train_size = 80
X_train, X_test = X[:train_size], X[train_size:]
y_train, y_test = y[:train_size], y[train_size:]

print(f"\n{'Model':<35} {'Train RMSE':>12} {'Test RMSE':>12} {'Non-zero coefs':>15}")
print("-" * 80)
for name, model in models.items():
model.fit(X_train, y_train)
train_rmse = np.sqrt(mean_squared_error(y_train, model.predict(X_train)))
test_rmse = np.sqrt(mean_squared_error(y_test, model.predict(X_test)))
nnz = np.sum(np.abs(model.coef_) > 1e-6)
print(f"{name:<35} {train_rmse:>12.4f} {test_rmse:>12.4f} {nnz:>15}")

The Bayesian Posterior

MLE and MAP both give point estimates. Full Bayesian inference gives the entire posterior distribution over parameters:

P(θD)=P(Dθ)P(θ)P(D)P(\theta | \mathcal{D}) = \frac{P(\mathcal{D} | \theta) P(\theta)}{P(\mathcal{D})}

The normalising constant P(D)=P(Dθ)P(θ)dθP(\mathcal{D}) = \int P(\mathcal{D}|\theta)P(\theta)d\theta is the model evidence - intractable for deep neural networks, motivating approximate inference methods (variational inference, Monte Carlo).

Part 5 - Probability Calibration

A model's predicted probability p^\hat{p} should match the empirical frequency of the event. If your classifier says 80% probability, it should be correct 80% of the time among all predictions with that score.

Well-calibrated model:
For all p ∈ [0,1]: P(Y=1 | ŷ = p) = p
"When I say 70%, it happens 70% of the time"

Overconfident model:
Predicts p=0.9 when actual frequency is 0.7
"When I say 90%, only 70% is right"
→ Neural networks are typically overconfident

Underconfident model:
Predicts p=0.6 when actual frequency is 0.8
→ Random forests tend to be underconfident

Reliability Diagrams

import numpy as np
import matplotlib.pyplot as plt
from sklearn.calibration import calibration_curve, CalibratedClassifierCV
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.metrics import log_loss, brier_score_loss

X, y = make_classification(n_samples=3000, n_features=20,
n_informative=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

models = {
'Logistic Regression': LogisticRegression(max_iter=500),
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
'SVM (uncalibrated)': SVC(probability=True),
'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
}

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))
ax1.plot([0, 1], [0, 1], 'k--', label='Perfect calibration')

for name, model in models.items():
model.fit(X_train, y_train)
p_pred = model.predict_proba(X_test)[:, 1]

fraction_pos, mean_pred = calibration_curve(y_test, p_pred, n_bins=10)

brier = brier_score_loss(y_test, p_pred)
logloss = log_loss(y_test, p_pred)

ax1.plot(mean_pred, fraction_pos, 's-', label=f'{name} (Brier={brier:.3f})')
print(f"{name}:")
print(f" Log Loss: {logloss:.4f}")
print(f" Brier Score: {brier:.4f}")

ax1.set_xlabel('Mean Predicted Probability')
ax1.set_ylabel('Fraction of Positives')
ax1.set_title('Calibration Curves (Reliability Diagram)')
ax1.legend(loc='upper left')

# Show calibration improvement
rf_uncal = RandomForestClassifier(n_estimators=100, random_state=42)
rf_cal = CalibratedClassifierCV(
RandomForestClassifier(n_estimators=100, random_state=42),
method='isotonic', cv=3
)

rf_uncal.fit(X_train, y_train)
rf_cal.fit(X_train, y_train)

for name, clf in [('RF Uncalibrated', rf_uncal), ('RF Calibrated', rf_cal)]:
p = clf.predict_proba(X_test)[:, 1]
frac, mean = calibration_curve(y_test, p, n_bins=10)
ax2.plot(mean, frac, 's-', label=name)

ax2.plot([0, 1], [0, 1], 'k--', label='Perfect')
ax2.set_xlabel('Mean Predicted Probability')
ax2.set_ylabel('Fraction of Positives')
ax2.set_title('Before vs After Calibration (Random Forest)')
ax2.legend()
plt.tight_layout()

Calibration Methods

from sklearn.calibration import CalibratedClassifierCV

# 1. Platt Scaling (sigmoid calibration)
# Fits a logistic regression on top of raw scores
# Works well when calibration error is systematic (monotone)
platt_calibrated = CalibratedClassifierCV(
GradientBoostingClassifier(n_estimators=100),
method='sigmoid', # Platt scaling
cv=5
)
platt_calibrated.fit(X_train, y_train)

# 2. Isotonic Regression
# Non-parametric, more flexible than Platt
# Risk of overfitting with small datasets
isotonic_calibrated = CalibratedClassifierCV(
GradientBoostingClassifier(n_estimators=100),
method='isotonic', # Isotonic regression
cv=5
)
isotonic_calibrated.fit(X_train, y_train)

for name, clf in [('Platt', platt_calibrated), ('Isotonic', isotonic_calibrated)]:
p = clf.predict_proba(X_test)[:, 1]
print(f"{name} calibration - Brier: {brier_score_loss(y_test, p):.4f}, "
f"LogLoss: {log_loss(y_test, p):.4f}")

Brier Score - The Calibration Metric

Brier=1ni=1n(yip^i)2\text{Brier} = \frac{1}{n}\sum_{i=1}^n (y_i - \hat{p}_i)^2

The Brier score is like MSE for probabilities. A perfect model has Brier = 0. A model that always predicts 0.5 has Brier = 0.25. It decomposes into calibration and refinement:

Brier=CalibrationBias in probabilities+ResolutionSpread of predictionsUncertaintyInherent difficulty\text{Brier} = \underbrace{\text{Calibration}}_{\text{Bias in probabilities}} + \underbrace{\text{Resolution}}_{\text{Spread of predictions}} - \underbrace{\text{Uncertainty}}_{\text{Inherent difficulty}}

from sklearn.metrics import brier_score_loss

# Brier score decomposition
def brier_decomposition(y_true, y_prob, n_bins=10):
"""Decompose Brier score into calibration + resolution - uncertainty."""
bins = np.linspace(0, 1, n_bins + 1)
bin_centers = (bins[:-1] + bins[1:]) / 2
p_bar = y_true.mean()

calibration = 0
resolution = 0
n = len(y_true)

for i in range(n_bins):
in_bin = (y_prob >= bins[i]) & (y_prob < bins[i+1])
if in_bin.sum() == 0:
continue
n_k = in_bin.sum()
o_k = y_true[in_bin].mean() # fraction of positives in bin
f_k = y_prob[in_bin].mean() # mean predicted prob in bin

calibration += n_k * (f_k - o_k)**2
resolution += n_k * (o_k - p_bar)**2

calibration /= n
resolution /= n
uncertainty = p_bar * (1 - p_bar)

print(f"Brier Decomposition:")
print(f" Calibration (lower=better): {calibration:.4f}")
print(f" Resolution (higher=better): {resolution:.4f}")
print(f" Uncertainty (baseline): {uncertainty:.4f}")
print(f" Brier Score = Cal - Res + Unc = {calibration - resolution + uncertainty:.4f}")
print(f" Direct Brier Score: {brier_score_loss(y_true, y_prob):.4f}")

rf_preds = rf_uncal.predict_proba(X_test)[:, 1]
brier_decomposition(y_test, rf_preds)

Part 6 - Aleatoric vs Epistemic Uncertainty

Understanding uncertainty types is critical for real-world ML deployment:

ALEATORIC UNCERTAINTY (irreducible):
→ Inherent randomness in the data
→ Cannot be reduced with more data
→ Example: rolling a fair die - we can never be certain of the outcome
→ Example: predicting a patient's exact blood pressure from demographics
(there's inherent biological variation)
→ In regression: captured by σ in N(μ(x), σ(x)²) - heteroscedastic models

EPISTEMIC UNCERTAINTY (reducible):
→ Uncertainty due to limited data / unknown parameters
→ Can be reduced with more training data
→ Example: predicting outcomes in an unusual region of input space
→ Example: edge cases / rare events in training data
→ In Bayesian models: width of the posterior P(θ|D)

The key question: "Am I uncertain because the world is noisy,
or because I haven't seen enough data?"
import numpy as np
import matplotlib.pyplot as plt

# Illustrate: aleatoric vs epistemic uncertainty
np.random.seed(42)

def true_function(x):
return np.sin(x) + 0.1 * x

# Sparse training data in [0, 3] and [7, 10], none in [4, 6]
x_train_1 = np.linspace(0, 3, 30)
x_train_2 = np.linspace(7, 10, 30)
x_train = np.concatenate([x_train_1, x_train_2])
y_train = true_function(x_train) + np.random.normal(0, 0.3, len(x_train))

x_test = np.linspace(0, 10, 200)
y_true = true_function(x_test)

# Simple uncertainty: predict with ensemble of models
# (epistemic uncertainty → ensemble members disagree in data-sparse regions)
from sklearn.linear_model import BayesianRidge
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures

# Bootstrap ensemble
n_ensemble = 20
ensemble_preds = np.zeros((n_ensemble, len(x_test)))

for i in range(n_ensemble):
boot_idx = np.random.choice(len(x_train), len(x_train), replace=True)
x_b = x_train[boot_idx]
y_b = y_train[boot_idx]
pipe = Pipeline([
('poly', PolynomialFeatures(degree=5)),
('ridge', BayesianRidge())
])
pipe.fit(x_b.reshape(-1, 1), y_b)
ensemble_preds[i] = pipe.predict(x_test.reshape(-1, 1))

mean_pred = ensemble_preds.mean(axis=0)
std_pred = ensemble_preds.std(axis=0) # epistemic uncertainty

fig, ax = plt.subplots(figsize=(14, 6))
ax.plot(x_test, y_true, 'k-', linewidth=2, label='True function', zorder=5)
ax.scatter(x_train, y_train, c='black', s=20, zorder=10, label='Training data')
ax.plot(x_test, mean_pred, 'b-', linewidth=2, label='Ensemble mean')
ax.fill_between(x_test, mean_pred - 2*std_pred, mean_pred + 2*std_pred,
alpha=0.3, color='blue', label='±2σ (epistemic uncertainty)')

# Mark data-sparse region (high epistemic uncertainty)
ax.axvspan(3.5, 6.5, alpha=0.1, color='red', label='No training data here')
ax.set_title('Epistemic Uncertainty: High in Data-Sparse Regions, Low Where Data Is Dense')
ax.legend()
ax.set_xlabel('x')
plt.tight_layout()

print("Uncertainty comparison:")
print(f" Mean std in [0,3] (data-rich): {std_pred[(x_test < 3)].mean():.4f}")
print(f" Mean std in [4,6] (data-poor): {std_pred[(x_test > 4) & (x_test < 6)].mean():.4f}")
print(f" Mean std in [7,10] (data-rich): {std_pred[(x_test > 7)].mean():.4f}")

Part 7 - Bayesian Deep Learning: Practical Approximations

Full Bayesian inference over neural network weights is intractable. In practice, two approximations dominate:

MC Dropout

Gal & Ghahramani (2016) showed that dropout at inference time approximates Bayesian inference:

import torch
import torch.nn as nn
import numpy as np
import matplotlib.pyplot as plt

class BayesianMLP(nn.Module):
"""
MLP with dropout at inference time for Bayesian uncertainty estimation.
(MC Dropout - approximates Bayesian inference)
"""
def __init__(self, input_dim: int, hidden_dim: int, output_dim: int,
dropout_rate: float = 0.2):
super().__init__()
self.net = nn.Sequential(
nn.Linear(input_dim, hidden_dim),
nn.ReLU(),
nn.Dropout(dropout_rate), # active at BOTH train and test
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Dropout(dropout_rate),
nn.Linear(hidden_dim, output_dim)
)

def forward(self, x: torch.Tensor) -> torch.Tensor:
return self.net(x)

def mc_dropout_predict(model: nn.Module, x: torch.Tensor,
n_samples: int = 50) -> tuple:
"""
Enable dropout at inference (train mode) for uncertainty estimation.
Returns: (mean_prediction, uncertainty_std)
"""
model.train() # keep dropout active
with torch.no_grad():
preds = torch.stack([model(x) for _ in range(n_samples)])
model.eval()

mean = preds.mean(dim=0)
uncertainty = preds.std(dim=0)
return mean, uncertainty

# Build and train on sparse 1D data
np.random.seed(42)
torch.manual_seed(42)

n_train = 100
x_data = torch.linspace(-3, 3, n_train).unsqueeze(1)
y_data = torch.sin(x_data) + 0.1 * torch.randn_like(x_data)

bayesian_model = BayesianMLP(input_dim=1, hidden_dim=64, output_dim=1, dropout_rate=0.2)
optimizer = torch.optim.Adam(bayesian_model.parameters(), lr=1e-3)
criterion = nn.MSELoss()

bayesian_model.train()
for epoch in range(500):
optimizer.zero_grad()
y_pred = bayesian_model(x_data)
loss = criterion(y_pred, y_data)
loss.backward()
optimizer.step()

# Predict on test range (including extrapolation)
x_test_mc = torch.linspace(-5, 5, 200).unsqueeze(1)
mean_pred, uncertainty = mc_dropout_predict(bayesian_model, x_test_mc, n_samples=100)

x_np = x_test_mc.numpy().flatten()
mean_np = mean_pred.numpy().flatten()
std_np = uncertainty.numpy().flatten()

fig, ax = plt.subplots(figsize=(14, 6))
ax.scatter(x_data.numpy(), y_data.numpy(), c='black', s=20, alpha=0.6, label='Train data', zorder=5)
ax.plot(x_np, mean_np, 'b-', linewidth=2, label='MC Dropout mean')
ax.fill_between(x_np, mean_np - 2*std_np, mean_np + 2*std_np,
alpha=0.3, label='±2σ (MC Dropout uncertainty)')
ax.axvspan(3.1, 5.0, alpha=0.1, color='red', label='Extrapolation region')
ax.set_title('MC Dropout: Uncertainty grows outside training region')
ax.legend()
plt.tight_layout()

Deep Ensembles

Training multiple independent models and averaging their predictions gives a practical and effective uncertainty estimate:

import torch
import torch.nn as nn
import numpy as np
from typing import List, Tuple

class NNEnsemble:
"""
Deep ensemble for uncertainty estimation.
Each model is trained from a different random initialisation.
Epistemic uncertainty = disagreement between ensemble members.
"""

def __init__(self, n_models: int = 5, hidden_dim: int = 64, dropout: float = 0.1):
self.n_models = n_models
self.hidden_dim = hidden_dim
self.models: List[nn.Module] = []

def _build_model(self, input_dim: int) -> nn.Module:
return nn.Sequential(
nn.Linear(input_dim, self.hidden_dim),
nn.ReLU(),
nn.Linear(self.hidden_dim, self.hidden_dim),
nn.ReLU(),
nn.Linear(self.hidden_dim, 2) # predict mean + log_variance
)

def fit(self, X: torch.Tensor, y: torch.Tensor, epochs: int = 300):
input_dim = X.shape[1]
self.models = []

for i in range(self.n_models):
torch.manual_seed(i * 42)
model = self._build_model(input_dim)
opt = torch.optim.Adam(model.parameters(), lr=1e-3)

for _ in range(epochs):
opt.zero_grad()
out = model(X)
mu = out[:, 0:1]
logv = out[:, 1:2]
# Gaussian NLL loss (learns both mean and variance)
loss = 0.5 * (logv + ((y - mu)**2) / logv.exp()).mean()
loss.backward()
opt.step()

model.eval()
self.models.append(model)

def predict(self, X: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
"""Returns: (mean_prediction, total_uncertainty_std)"""
means = []
vars_ = []

with torch.no_grad():
for model in self.models:
out = model(X)
means.append(out[:, 0])
vars_.append(out[:, 1].exp()) # aleatoric variance per model

means_t = torch.stack(means) # (n_models, n_samples)
vars_t = torch.stack(vars_) # (n_models, n_samples)

# Mixture of Gaussians approximation
ensemble_mean = means_t.mean(dim=0)
aleatoric_var = vars_t.mean(dim=0) # avg aleatoric
epistemic_var = means_t.var(dim=0) # disagreement
total_std = (aleatoric_var + epistemic_var).sqrt()

return ensemble_mean, total_std, aleatoric_var.sqrt(), epistemic_var.sqrt()

Part 8 - The KL Divergence and Entropy Connection

KL Divergence

DKL(PQ)=xP(x)logP(x)Q(x)=EP[logP(X)Q(X)]D_{\text{KL}}(P \| Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)} = \mathbb{E}_P\left[\log \frac{P(X)}{Q(X)}\right]

Interpretation: how much information is lost when we use QQ to approximate PP. Always 0\geq 0, equals 0 iff P=QP = Q.

Cross-Entropy = Entropy + KL Divergence

H(P,Q)=H(P)+DKL(PQ)H(P, Q) = H(P) + D_{\text{KL}}(P \| Q)

H(P,Q)=xP(x)logQ(x)H(P, Q) = -\sum_x P(x) \log Q(x)

When minimising cross-entropy loss, PP = true label distribution (one-hot), QQ = model's predicted probabilities. Since H(P)H(P) is fixed (the true labels don't change), minimising H(P,Q)H(P, Q) is equivalent to minimising DKL(PQ)D_{\text{KL}}(P \| Q):

Minimising cross-entropyMinimising KL from true distribution to model\boxed{\text{Minimising cross-entropy} \equiv \text{Minimising KL from true distribution to model}}

import numpy as np

def kl_divergence(p: np.ndarray, q: np.ndarray, epsilon: float = 1e-10) -> float:
"""KL(P||Q) - KL divergence from P to Q."""
p = np.clip(p, epsilon, 1)
q = np.clip(q, epsilon, 1)
return np.sum(p * np.log(p / q))

def cross_entropy(p: np.ndarray, q: np.ndarray, epsilon: float = 1e-10) -> float:
"""H(P, Q) = -sum P * log Q"""
q = np.clip(q, epsilon, 1)
return -np.sum(p * np.log(q))

def entropy(p: np.ndarray, epsilon: float = 1e-10) -> float:
"""H(P) = -sum P * log P"""
p = np.clip(p, epsilon, 1)
return -np.sum(p * np.log(p))

# Verify: H(P, Q) = H(P) + KL(P||Q)
p_true = np.array([1.0, 0.0, 0.0]) # one-hot: class 0
q_model = np.array([0.7, 0.2, 0.1]) # model predictions

ce = cross_entropy(p_true, q_model)
h = entropy(p_true)
kl = kl_divergence(p_true, q_model)

print(f"H(P): {h:.4f}")
print(f"KL(P||Q): {kl:.4f}")
print(f"H(P,Q): {ce:.4f}")
print(f"H(P)+KL(P||Q):{h + kl:.4f}") # should equal H(P,Q)

# One-hot → H(P) = 0 → H(P,Q) = KL(P||Q) = -log q[true_class]
print(f"\nFor one-hot labels: H(P)=0, so H(P,Q) = KL(P||Q) = -log(q[true]) = {-np.log(0.7):.4f}")

Part 9 - Label Smoothing

Hard one-hot labels create overconfident models. Label smoothing prevents this by softening the target distribution:

ysmooth=(1ϵ)yone-hot+ϵKy_{\text{smooth}} = (1 - \epsilon) \cdot y_{\text{one-hot}} + \frac{\epsilon}{K}

import torch
import torch.nn.functional as F

def label_smoothing_loss(logits: torch.Tensor, targets: torch.Tensor,
smoothing: float = 0.1) -> torch.Tensor:
"""
Cross-entropy with label smoothing.
Prevents model from being overconfident (probabilities near 0 or 1).
Equivalent to regularising the KL divergence from a uniform distribution.
"""
n_classes = logits.shape[-1]
log_probs = F.log_softmax(logits, dim=-1)

# Hard targets: one-hot
nll_loss = F.nll_loss(log_probs, targets)

# Smooth targets: uniform over all classes
smooth_loss = -log_probs.mean(dim=-1).mean()

loss = (1 - smoothing) * nll_loss + smoothing * smooth_loss
return loss

# Compare standard vs label smoothing
logits = torch.randn(32, 10) # batch=32, classes=10
targets = torch.randint(0, 10, (32,))

ce_standard = F.cross_entropy(logits, targets)
ce_smoothed = label_smoothing_loss(logits, targets, smoothing=0.1)

print(f"Standard CE loss: {ce_standard.item():.4f}")
print(f"Label-smoothed CE loss: {ce_smoothed.item():.4f}")

# In PyTorch 1.10+, use built-in label smoothing
ce_builtin = F.cross_entropy(logits, targets, label_smoothing=0.1)
print(f"PyTorch built-in: {ce_builtin.item():.4f}")

YouTube Resources

VideoChannelFocus
MLE and MAP estimationStatQuestMLE intuition and derivation
Cross Entropy LossStatQuestCross-entropy as NLL
Probability CalibrationritvikmathCalibration curves
Bayesian Deep LearningYannic KilcherMC Dropout and uncertainty
KL DivergenceIntuitively ExplainedKL divergence visual intuition

Interview Questions

Q1: What probability distribution does minimising MSE implicitly assume? What are the consequences if that assumption is violated?

Minimising MSE is equivalent to MLE under a Gaussian noise assumption: y=fθ(x)+εy = f_\theta(x) + \varepsilon, εN(0,σ2)\varepsilon \sim \mathcal{N}(0, \sigma^2). Consequences when violated: (1) If errors are heavy-tailed (Laplace or t-distributed), MSE over-weights outliers - Huber loss or MAE would be more appropriate. (2) If errors are heteroscedastic (variance depends on xx), MSE treats all samples equally but they shouldn't be - Weighted Least Squares or a model that predicts variance is better. (3) If the target is bounded (e.g., probabilities in [0,1]), MSE can produce predictions outside the valid range - use a model with appropriate output activation (sigmoid/softmax) and corresponding NLL loss.

Q2: What is the probabilistic interpretation of L2 regularisation?

L2 regularisation (Ridge) is MAP estimation with a Gaussian prior over parameters: P(θ)=N(0,τ2I)P(\theta) = \mathcal{N}(0, \tau^2 I). The MAP objective is: argmaxθ[logP(Dθ)+logP(θ)]=argminθ[NLL+λ2θ22]\arg\max_\theta [\log P(D|\theta) + \log P(\theta)] = \arg\min_\theta [\text{NLL} + \frac{\lambda}{2}\|\theta\|_2^2] where λ=σ2/τ2\lambda = \sigma^2/\tau^2 (ratio of noise variance to prior variance). A smaller prior variance τ2\tau^2 (stronger belief that parameters are near zero) corresponds to larger λ\lambda (stronger regularisation). Similarly, L1 regularisation corresponds to a Laplace prior, which has heavier tails and a sharper peak at zero - this is why L1 promotes sparsity: the Laplace distribution assigns high probability to weights being exactly zero.

Q3: Why are neural networks typically poorly calibrated, and how do you fix it?

Neural networks trained with cross-entropy on modern deep architectures (with batch normalisation, weight decay) tend to be overconfident - they produce probabilities near 0 or 1 more often than the empirical frequency justifies. The reasons: (1) Cross-entropy training encourages the model to be confident to reduce loss; (2) Batch normalisation and weight decay interact in ways that push logit magnitudes high; (3) Overfitting to training distribution affects probability scaling. Fixes: (1) Temperature scaling - divide logits by a temperature T>1T > 1 before softmax (single parameter, cheap); (2) Platt scaling - fit a logistic regression on top of model scores; (3) Isotonic regression - non-parametric, more flexible; (4) Label smoothing - prevents the model from being maximally confident during training; (5) MC Dropout / deep ensembles - produce better-calibrated uncertainty estimates.

Q4: What is the difference between aleatoric and epistemic uncertainty? Give an example of each in a medical ML context.

Aleatoric uncertainty is irreducible - it's inherent randomness in the data that cannot be resolved with more data. Epistemic uncertainty is reducible - it arises from limited data or model assumptions, and decreases as more relevant data is collected. In medical ML: aleatoric - two identical patients (same age, labs, vitals) may have different outcomes due to genetic variation, unobserved factors, or random biological processes; no amount of data resolves this. Epistemic - a rare disease with 50 training cases: the model is uncertain about rare presentations because it hasn't seen enough examples; gathering 5,000 cases would dramatically reduce this uncertainty. Models that distinguish both types (heteroscedastic NNs for aleatoric, ensembles/MC Dropout for epistemic) are more trustworthy for deployment, especially in safety-critical applications where knowing "I don't know" is crucial.

Q5: Derive why binary cross-entropy is the correct loss for logistic regression from a probabilistic perspective.

Logistic regression assumes Pθ(y=1x)=σ(xTθ)P_\theta(y=1|x) = \sigma(x^T\theta) where σ\sigma is the sigmoid function. The likelihood of the entire dataset under Bernoulli independence is:

P(Dθ)=i=1np^iyi(1p^i)1yiP(D|\theta) = \prod_{i=1}^n \hat{p}_i^{y_i}(1-\hat{p}_i)^{1-y_i}

Taking the negative log-likelihood (NLL):

logP(Dθ)=i=1n[yilogp^i+(1yi)log(1p^i)]-\log P(D|\theta) = -\sum_{i=1}^n [y_i \log \hat{p}_i + (1-y_i)\log(1-\hat{p}_i)]

This is exactly the binary cross-entropy loss. Minimising BCE = finding the MLE solution for the Bernoulli model. There's no "arbitrary" choice here - BCE is the theoretically correct objective given the Bernoulli assumption. Choosing MSE instead for classification would correspond to assuming Gaussian noise on binary outputs, which doesn't match the data-generating process and leads to worse probability estimates.

Q6: What is the Brier score and how does it differ from log loss for evaluating probabilistic predictions?

The Brier score is 1ni(yip^i)2\frac{1}{n}\sum_i(y_i - \hat{p}_i)^2 - the MSE between predicted probabilities and binary outcomes. Log loss is 1ni[yilogp^i+(1yi)log(1p^i)]-\frac{1}{n}\sum_i[y_i\log\hat{p}_i + (1-y_i)\log(1-\hat{p}_i)]. Key differences: (1) Log loss is unbounded - a confident wrong prediction (predicts 0.99, true label is 0) gives loss log(0.01)4.6\log(0.01) \approx 4.6 while Brier gives (00.99)20.98(0-0.99)^2 \approx 0.98. Log loss penalises confident mistakes much more severely. (2) Brier score is bounded [0, 1] and more interpretable. (3) Log loss has a direct probabilistic interpretation (NLL of Bernoulli), while Brier score is a proper scoring rule from a more general family. For model comparison: use log loss to penalise confident errors; use Brier score for overall probability accuracy. Both decompose into calibration and resolution components.

Q7: Explain label smoothing from a probabilistic perspective. When is it useful and when does it hurt?

Label smoothing replaces one-hot targets yone-hoty_{\text{one-hot}} with ysmooth=(1ϵ)yone-hot+ϵ/Ky_{\text{smooth}} = (1-\epsilon) \cdot y_{\text{one-hot}} + \epsilon/K. Probabilistically, it changes the target distribution from the empirical (true class = 100% probability) to a mixture: mostly the true class but with ϵ\epsilon probability spread uniformly. This prevents the model from minimising CE by pushing logit margins to infinity (which it could with hard labels). Equivalently, label smoothing adds a KL divergence term from the model's distribution to a uniform distribution: L=(1ϵ)H(Phard,Q)+ϵH(Puniform,Q)\mathcal{L} = (1-\epsilon) H(P_{\text{hard}}, Q) + \epsilon H(P_{\text{uniform}}, Q). Useful when: the model tends to be overconfident, for knowledge distillation (teacher labels are already soft), for datasets with label noise. Hurts when: you use the model for calibration-sensitive applications (label smoothing systematically changes the temperature of predictions); for tasks requiring calibrated probabilities, apply temperature scaling after training instead.

Q8: How would you implement predictive uncertainty estimation for a production classification model without retraining from scratch?

Three practical approaches in order of implementation cost: (1) Temperature scaling: fit a single temperature parameter TT on validation data such that σ(logit/T)\sigma(\text{logit}/T) is well-calibrated. One parameter, no retraining, takes seconds. Best for reducing overconfidence without changing ranking. (2) MC Dropout: if the model has dropout layers, enable dropout at inference and run N=50N=50 forward passes. Average predictions for the mean, use standard deviation as uncertainty. Works without retraining if dropout is already present; slightly increases inference latency by N×N\times. (3) Deep ensemble: train M=5M=5 models from different random seeds. Average their softmax outputs. High-quality uncertainty estimates, but requires M×M\times training compute and serving infrastructure. For production, I'd implement (1) immediately (calibration fix), then evaluate if (3) is worth the cost given the latency/cost constraints.

:::tip Role-Specific Angles MLE Interview: MLE derivation for regression/classification, connection between loss functions and distributions, MAP = MLE + prior = regularisation, calibration and Brier score Research Engineer: Aleatoric vs epistemic uncertainty, MC Dropout theory, deep ensembles, label smoothing, KL divergence decomposition Data Scientist: Probability calibration, reliability diagrams, calibration curves, Platt vs isotonic scaling AI Engineer: Label smoothing implementation in PyTorch, uncertainty quantification in production APIs, confidence thresholds for rejection :::

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Probability Distributions demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.