Probabilistic View of Machine Learning
Reading time: ~40 minutes | Level: ML Foundations | Role: MLE, Research Engineer, Data Scientist, AI Engineer
A senior interviewer asks you: "What is cross-entropy loss and why do we use it?" Most candidates answer: "It penalises confident wrong predictions more than unconfident ones." That's a start. But the interviewer probes deeper: "What probability model are you implicitly assuming when you minimise cross-entropy? And what happens to your model's behaviour if that assumption is violated?"
This is the probabilistic view of ML. Every loss function encodes a probabilistic assumption about how data was generated. Every optimizer is secretly maximising a likelihood. Every regulariser encodes a prior over parameters. When you know this framework, you understand why techniques work - and when to expect them to fail.
What You Will Learn
- The generative model: how probabilistic thinking frames ML as density estimation
- Maximum Likelihood Estimation (MLE) - derivation for regression and classification
- Why minimising MSE ≡ maximising Gaussian likelihood
- Why minimising cross-entropy ≡ maximising Bernoulli/Categorical likelihood
- Maximum A Posteriori (MAP) estimation - MLE + prior
- Why L2 regularisation ≡ Gaussian prior, L1 ≡ Laplace prior
- Probability calibration - the gap between predicted probabilities and empirical frequencies
- Bayesian reasoning in deep learning - MC Dropout, deep ensembles
- Uncertainty types: aleatoric vs epistemic
- Eight interview Q&As at senior research/MLE level
Part 1 - The Probabilistic Framework
From Patterns to Distributions
Traditional ML framing: "Find a function that minimises training error."
Probabilistic framing: "Find parameters such that the model distribution best explains the observed data."
Generative view of supervised learning:
1. Nature draws a sample: (x, y) from true distribution P*(x, y)
2. You observe n samples: D = {(x₁,y₁), ..., (xₙ,yₙ)}
3. You posit a parametric model: P_θ(y|x)
4. Goal: find θ that makes D most probable under your model
This is Maximum Likelihood Estimation (MLE).
Why this framework is powerful:
- It makes assumptions explicit (what distribution are you assuming?)
- It gives you confidence estimates, not just point predictions
- It connects every loss function to a probability model
- It tells you what regularisation means (prior beliefs)
- It quantifies uncertainty in predictions
Part 2 - Maximum Likelihood Estimation (MLE)
The Core Idea
Given observed data , and a parametric model , MLE finds:
We take the log (log-likelihood is easier to optimise - sums instead of products):
Minimising the negative log-likelihood (NLL) is the standard training objective.
Case 1: Regression - Gaussian Likelihood → MSE
Assumption: where .
Then .
Summing over samples and removing constants with respect to :
import numpy as np
from scipy.stats import norm
import matplotlib.pyplot as plt
# Demonstrate: MLE with Gaussian assumption = MSE minimisation
np.random.seed(42)
x = np.linspace(0, 10, 50)
y_true = 2 * x + 1
y_obs = y_true + np.random.normal(0, 2, 50) # σ=2
# Method 1: Minimise MSE (standard linear regression)
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(x.reshape(-1, 1), y_obs)
mse_predictions = lr.predict(x.reshape(-1, 1))
# Method 2: Maximise Gaussian log-likelihood (same result)
sigma = np.std(y_obs - mse_predictions)
log_likelihoods = norm.logpdf(y_obs, loc=mse_predictions, scale=sigma)
total_ll = log_likelihoods.sum()
print(f"MSE regression: slope={lr.coef_[0]:.4f}, intercept={lr.intercept_:.4f}")
print(f"Total log-likelihood at MLE solution: {total_ll:.4f}")
print(f"NLL = -log_likelihood = {-total_ll:.4f}")
print(f"Scaled NLL = {-total_ll/50:.4f} ≈ MSE / (2σ²) + const")
# Visualise the Gaussian assumption
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
ax1.scatter(x, y_obs, alpha=0.5, label='Observed')
ax1.plot(x, mse_predictions, 'r-', label='MLE/MSE fit')
ax1.set_title('Gaussian Regression: y ~ N(f(x), σ²)')
ax1.legend()
ax2.hist(y_obs - mse_predictions, bins=20, density=True, alpha=0.7, label='Residuals')
x_range = np.linspace(-8, 8, 200)
ax2.plot(x_range, norm.pdf(x_range, 0, sigma), 'r-', label=f'N(0, {sigma:.2f}²)')
ax2.set_title('Residual Distribution (should be Gaussian for MLE=MSE)')
ax2.legend()
plt.tight_layout()
Case 2: Binary Classification - Bernoulli Likelihood → Cross-Entropy
Assumption: where (sigmoid output).
Taking the negative log-likelihood:
Summing over all samples:
import numpy as np
from scipy.special import expit # sigmoid
# Demonstrate: cross-entropy IS negative log-likelihood for Bernoulli model
np.random.seed(42)
n = 1000
# True data generating process
X_clf = np.random.randn(n, 2)
true_logits = 2 * X_clf[:, 0] - 1.5 * X_clf[:, 1]
true_probs = expit(true_logits)
y_clf = np.random.binomial(1, true_probs)
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss
model = LogisticRegression(C=1e6, max_iter=500) # large C ≈ no regularisation = pure MLE
model.fit(X_clf, y_clf)
p_hat = model.predict_proba(X_clf)[:, 1]
# Cross-entropy (sklearn)
ce_sklearn = log_loss(y_clf, p_hat)
# Manual NLL
nll_manual = -np.mean(
y_clf * np.log(p_hat + 1e-15) + (1 - y_clf) * np.log(1 - p_hat + 1e-15)
)
print(f"sklearn log_loss: {ce_sklearn:.6f}")
print(f"Manual NLL: {nll_manual:.6f}")
print(f"Difference: {abs(ce_sklearn - nll_manual):.2e} (numerical precision)")
Case 3: Multi-Class - Categorical Likelihood → Softmax Cross-Entropy
import torch
import torch.nn.functional as F
# Softmax cross-entropy = NLL of Categorical distribution
logits = torch.tensor([[2.0, 1.0, -0.5],
[0.5, 3.0, 0.1],
[-1.0, 0.5, 2.5]]) # (batch=3, classes=3)
labels = torch.tensor([0, 1, 2]) # true classes
# Method 1: PyTorch cross-entropy
ce_loss = F.cross_entropy(logits, labels)
# Method 2: Manual NLL
probs = F.softmax(logits, dim=-1)
log_probs = torch.log(probs)
nll_manual = -log_probs[torch.arange(3), labels].mean()
print(f"F.cross_entropy: {ce_loss.item():.6f}")
print(f"Manual NLL: {nll_manual.item():.6f}")
# Demonstrate the assumed distribution
print("\nPredicted probabilities (Categorical distribution parameters):")
for i in range(3):
p = probs[i].detach().numpy()
print(f" Sample {i}: P(C=0)={p[0]:.3f}, P(C=1)={p[1]:.3f}, P(C=2)={p[2]:.3f}")
print(f" True class: {labels[i].item()}, NLL: {-torch.log(probs[i][labels[i]]).item():.4f}")
Part 3 - Loss Functions and Their Implied Distributions
Loss Function │ Distribution Assumption │ Optimal Prediction
───────────────────────┼─────────────────────────────────┼───────────────────
MSE (L2 loss) │ Gaussian noise (ε ~ N(0,σ²)) │ Conditional mean E[y|x]
MAE (L1 loss) │ Laplace noise (ε ~ Laplace) │ Conditional median
Huber loss │ Mix Gaussian + Laplace │ Robust conditional mean
Binary cross-entropy │ Bernoulli (y ∈ {0,1}) │ P(y=1|x)
Softmax cross-entropy │ Categorical (y ∈ {1..K}) │ P(y=k|x)
KL divergence │ Any distribution P vs Q │ P = Q (perfect match)
Poisson NLL │ Poisson (count data y ∈ ℕ) │ Rate parameter λ(x)
Beta NLL │ Beta (y ∈ [0,1]) │ Mean of Beta(α,β)
Ordinal cross-entropy │ Ordered categorical │ Cumulative probabilities
When the Assumed Distribution Doesn't Match
import numpy as np
from scipy.stats import poisson
import matplotlib.pyplot as plt
"""
Example: count data (bookings per hour) is Poisson-distributed.
If you use MSE loss (Gaussian assumption), predictions may be negative
and the loss weights large counts disproportionately.
Poisson NLL handles this correctly.
"""
np.random.seed(42)
# True rate is time-dependent
times = np.linspace(1, 24, 1000) # hours of day
true_rate = 5 + 20 * np.exp(-0.5 * ((times - 12) / 3)**2) # lunch peak
y_counts = np.array([poisson.rvs(lam) for lam in true_rate])
from sklearn.linear_model import PoissonRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error
import warnings
warnings.filterwarnings('ignore')
X_times = times.reshape(-1, 1)
# Wrong: Gaussian assumption (standard linear regression)
from sklearn.linear_model import LinearRegression
lr = LinearRegression().fit(X_times, y_counts)
y_pred_lr = lr.predict(X_times)
# Right: Poisson assumption
pr = PoissonRegressor(alpha=0.01).fit(X_times, y_counts)
y_pred_pr = pr.predict(X_times)
print("Count data prediction comparison:")
print(f" Linear Reg (Gaussian) MAE: {mean_absolute_error(y_counts, y_pred_lr):.3f}")
print(f" Poisson Reg MAE: {mean_absolute_error(y_counts, y_pred_pr):.3f}")
print(f" Linear Reg negative predictions: {(y_pred_lr < 0).sum()} ({(y_pred_lr<0).mean()*100:.1f}%)")
print(f" Poisson Reg min prediction: {y_pred_pr.min():.4f} (always > 0)")
Part 4 - MAP Estimation: Adding a Prior
MLE finds parameters that maximise the likelihood. But if you have prior beliefs about what parameters should look like, you can incorporate them through a prior distribution .
Maximum A Posteriori (MAP):
L2 Regularisation ≡ Gaussian Prior
A Gaussian prior gives:
This is exactly Ridge regression / L2 regularisation!
L1 Regularisation ≡ Laplace Prior
A Laplace prior gives:
This is exactly Lasso / L1 regularisation! The Laplace prior has heavier tails than Gaussian, which corresponds to the sparsity-inducing property of L1 - it strongly prefers most weights to be exactly zero.
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm, laplace
from sklearn.linear_model import Ridge, Lasso, LinearRegression
from sklearn.datasets import make_regression
from sklearn.metrics import mean_squared_error
# Visualise Gaussian vs Laplace prior
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
theta_range = np.linspace(-5, 5, 1000)
for sigma in [0.5, 1.0, 2.0]:
axes[0].plot(theta_range, norm.pdf(theta_range, 0, sigma),
label=f'N(0, {sigma}²)')
for b in [0.5, 1.0, 2.0]:
axes[1].plot(theta_range, laplace.pdf(theta_range, 0, b),
label=f'Laplace(0, {b})')
axes[0].set_title('Gaussian Prior → L2/Ridge Regularisation\n(smooth - large weights discouraged)')
axes[1].set_title('Laplace Prior → L1/Lasso Regularisation\n(sharp peak - promotes sparsity/zero weights)')
for ax in axes:
ax.legend()
ax.set_xlabel('θ (parameter value)')
ax.set_ylabel('Prior density P(θ)')
ax.axvline(0, color='gray', linestyle=':', alpha=0.7)
plt.tight_layout()
# Demonstrate regularisation as MAP
np.random.seed(42)
n, p = 100, 50 # more features than effective signal
X, y, coef = make_regression(n_samples=n, n_features=p, n_informative=10,
noise=15, coef=True, random_state=42)
models = {
'MLE (no reg)': LinearRegression(),
'MAP Gaussian (Ridge, λ=1)': Ridge(alpha=1.0),
'MAP Gaussian (Ridge, λ=10)': Ridge(alpha=10.0),
'MAP Laplace (Lasso, λ=0.1)': Lasso(alpha=0.1),
'MAP Laplace (Lasso, λ=1)': Lasso(alpha=1.0),
}
train_size = 80
X_train, X_test = X[:train_size], X[train_size:]
y_train, y_test = y[:train_size], y[train_size:]
print(f"\n{'Model':<35} {'Train RMSE':>12} {'Test RMSE':>12} {'Non-zero coefs':>15}")
print("-" * 80)
for name, model in models.items():
model.fit(X_train, y_train)
train_rmse = np.sqrt(mean_squared_error(y_train, model.predict(X_train)))
test_rmse = np.sqrt(mean_squared_error(y_test, model.predict(X_test)))
nnz = np.sum(np.abs(model.coef_) > 1e-6)
print(f"{name:<35} {train_rmse:>12.4f} {test_rmse:>12.4f} {nnz:>15}")
The Bayesian Posterior
MLE and MAP both give point estimates. Full Bayesian inference gives the entire posterior distribution over parameters:
The normalising constant is the model evidence - intractable for deep neural networks, motivating approximate inference methods (variational inference, Monte Carlo).
Part 5 - Probability Calibration
A model's predicted probability should match the empirical frequency of the event. If your classifier says 80% probability, it should be correct 80% of the time among all predictions with that score.
Well-calibrated model:
For all p ∈ [0,1]: P(Y=1 | ŷ = p) = p
"When I say 70%, it happens 70% of the time"
Overconfident model:
Predicts p=0.9 when actual frequency is 0.7
"When I say 90%, only 70% is right"
→ Neural networks are typically overconfident
Underconfident model:
Predicts p=0.6 when actual frequency is 0.8
→ Random forests tend to be underconfident
Reliability Diagrams
import numpy as np
import matplotlib.pyplot as plt
from sklearn.calibration import calibration_curve, CalibratedClassifierCV
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.metrics import log_loss, brier_score_loss
X, y = make_classification(n_samples=3000, n_features=20,
n_informative=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
models = {
'Logistic Regression': LogisticRegression(max_iter=500),
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
'SVM (uncalibrated)': SVC(probability=True),
'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
}
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))
ax1.plot([0, 1], [0, 1], 'k--', label='Perfect calibration')
for name, model in models.items():
model.fit(X_train, y_train)
p_pred = model.predict_proba(X_test)[:, 1]
fraction_pos, mean_pred = calibration_curve(y_test, p_pred, n_bins=10)
brier = brier_score_loss(y_test, p_pred)
logloss = log_loss(y_test, p_pred)
ax1.plot(mean_pred, fraction_pos, 's-', label=f'{name} (Brier={brier:.3f})')
print(f"{name}:")
print(f" Log Loss: {logloss:.4f}")
print(f" Brier Score: {brier:.4f}")
ax1.set_xlabel('Mean Predicted Probability')
ax1.set_ylabel('Fraction of Positives')
ax1.set_title('Calibration Curves (Reliability Diagram)')
ax1.legend(loc='upper left')
# Show calibration improvement
rf_uncal = RandomForestClassifier(n_estimators=100, random_state=42)
rf_cal = CalibratedClassifierCV(
RandomForestClassifier(n_estimators=100, random_state=42),
method='isotonic', cv=3
)
rf_uncal.fit(X_train, y_train)
rf_cal.fit(X_train, y_train)
for name, clf in [('RF Uncalibrated', rf_uncal), ('RF Calibrated', rf_cal)]:
p = clf.predict_proba(X_test)[:, 1]
frac, mean = calibration_curve(y_test, p, n_bins=10)
ax2.plot(mean, frac, 's-', label=name)
ax2.plot([0, 1], [0, 1], 'k--', label='Perfect')
ax2.set_xlabel('Mean Predicted Probability')
ax2.set_ylabel('Fraction of Positives')
ax2.set_title('Before vs After Calibration (Random Forest)')
ax2.legend()
plt.tight_layout()
Calibration Methods
from sklearn.calibration import CalibratedClassifierCV
# 1. Platt Scaling (sigmoid calibration)
# Fits a logistic regression on top of raw scores
# Works well when calibration error is systematic (monotone)
platt_calibrated = CalibratedClassifierCV(
GradientBoostingClassifier(n_estimators=100),
method='sigmoid', # Platt scaling
cv=5
)
platt_calibrated.fit(X_train, y_train)
# 2. Isotonic Regression
# Non-parametric, more flexible than Platt
# Risk of overfitting with small datasets
isotonic_calibrated = CalibratedClassifierCV(
GradientBoostingClassifier(n_estimators=100),
method='isotonic', # Isotonic regression
cv=5
)
isotonic_calibrated.fit(X_train, y_train)
for name, clf in [('Platt', platt_calibrated), ('Isotonic', isotonic_calibrated)]:
p = clf.predict_proba(X_test)[:, 1]
print(f"{name} calibration - Brier: {brier_score_loss(y_test, p):.4f}, "
f"LogLoss: {log_loss(y_test, p):.4f}")
Brier Score - The Calibration Metric
The Brier score is like MSE for probabilities. A perfect model has Brier = 0. A model that always predicts 0.5 has Brier = 0.25. It decomposes into calibration and refinement:
from sklearn.metrics import brier_score_loss
# Brier score decomposition
def brier_decomposition(y_true, y_prob, n_bins=10):
"""Decompose Brier score into calibration + resolution - uncertainty."""
bins = np.linspace(0, 1, n_bins + 1)
bin_centers = (bins[:-1] + bins[1:]) / 2
p_bar = y_true.mean()
calibration = 0
resolution = 0
n = len(y_true)
for i in range(n_bins):
in_bin = (y_prob >= bins[i]) & (y_prob < bins[i+1])
if in_bin.sum() == 0:
continue
n_k = in_bin.sum()
o_k = y_true[in_bin].mean() # fraction of positives in bin
f_k = y_prob[in_bin].mean() # mean predicted prob in bin
calibration += n_k * (f_k - o_k)**2
resolution += n_k * (o_k - p_bar)**2
calibration /= n
resolution /= n
uncertainty = p_bar * (1 - p_bar)
print(f"Brier Decomposition:")
print(f" Calibration (lower=better): {calibration:.4f}")
print(f" Resolution (higher=better): {resolution:.4f}")
print(f" Uncertainty (baseline): {uncertainty:.4f}")
print(f" Brier Score = Cal - Res + Unc = {calibration - resolution + uncertainty:.4f}")
print(f" Direct Brier Score: {brier_score_loss(y_true, y_prob):.4f}")
rf_preds = rf_uncal.predict_proba(X_test)[:, 1]
brier_decomposition(y_test, rf_preds)
Part 6 - Aleatoric vs Epistemic Uncertainty
Understanding uncertainty types is critical for real-world ML deployment:
ALEATORIC UNCERTAINTY (irreducible):
→ Inherent randomness in the data
→ Cannot be reduced with more data
→ Example: rolling a fair die - we can never be certain of the outcome
→ Example: predicting a patient's exact blood pressure from demographics
(there's inherent biological variation)
→ In regression: captured by σ in N(μ(x), σ(x)²) - heteroscedastic models
EPISTEMIC UNCERTAINTY (reducible):
→ Uncertainty due to limited data / unknown parameters
→ Can be reduced with more training data
→ Example: predicting outcomes in an unusual region of input space
→ Example: edge cases / rare events in training data
→ In Bayesian models: width of the posterior P(θ|D)
The key question: "Am I uncertain because the world is noisy,
or because I haven't seen enough data?"
import numpy as np
import matplotlib.pyplot as plt
# Illustrate: aleatoric vs epistemic uncertainty
np.random.seed(42)
def true_function(x):
return np.sin(x) + 0.1 * x
# Sparse training data in [0, 3] and [7, 10], none in [4, 6]
x_train_1 = np.linspace(0, 3, 30)
x_train_2 = np.linspace(7, 10, 30)
x_train = np.concatenate([x_train_1, x_train_2])
y_train = true_function(x_train) + np.random.normal(0, 0.3, len(x_train))
x_test = np.linspace(0, 10, 200)
y_true = true_function(x_test)
# Simple uncertainty: predict with ensemble of models
# (epistemic uncertainty → ensemble members disagree in data-sparse regions)
from sklearn.linear_model import BayesianRidge
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
# Bootstrap ensemble
n_ensemble = 20
ensemble_preds = np.zeros((n_ensemble, len(x_test)))
for i in range(n_ensemble):
boot_idx = np.random.choice(len(x_train), len(x_train), replace=True)
x_b = x_train[boot_idx]
y_b = y_train[boot_idx]
pipe = Pipeline([
('poly', PolynomialFeatures(degree=5)),
('ridge', BayesianRidge())
])
pipe.fit(x_b.reshape(-1, 1), y_b)
ensemble_preds[i] = pipe.predict(x_test.reshape(-1, 1))
mean_pred = ensemble_preds.mean(axis=0)
std_pred = ensemble_preds.std(axis=0) # epistemic uncertainty
fig, ax = plt.subplots(figsize=(14, 6))
ax.plot(x_test, y_true, 'k-', linewidth=2, label='True function', zorder=5)
ax.scatter(x_train, y_train, c='black', s=20, zorder=10, label='Training data')
ax.plot(x_test, mean_pred, 'b-', linewidth=2, label='Ensemble mean')
ax.fill_between(x_test, mean_pred - 2*std_pred, mean_pred + 2*std_pred,
alpha=0.3, color='blue', label='±2σ (epistemic uncertainty)')
# Mark data-sparse region (high epistemic uncertainty)
ax.axvspan(3.5, 6.5, alpha=0.1, color='red', label='No training data here')
ax.set_title('Epistemic Uncertainty: High in Data-Sparse Regions, Low Where Data Is Dense')
ax.legend()
ax.set_xlabel('x')
plt.tight_layout()
print("Uncertainty comparison:")
print(f" Mean std in [0,3] (data-rich): {std_pred[(x_test < 3)].mean():.4f}")
print(f" Mean std in [4,6] (data-poor): {std_pred[(x_test > 4) & (x_test < 6)].mean():.4f}")
print(f" Mean std in [7,10] (data-rich): {std_pred[(x_test > 7)].mean():.4f}")
Part 7 - Bayesian Deep Learning: Practical Approximations
Full Bayesian inference over neural network weights is intractable. In practice, two approximations dominate:
MC Dropout
Gal & Ghahramani (2016) showed that dropout at inference time approximates Bayesian inference:
import torch
import torch.nn as nn
import numpy as np
import matplotlib.pyplot as plt
class BayesianMLP(nn.Module):
"""
MLP with dropout at inference time for Bayesian uncertainty estimation.
(MC Dropout - approximates Bayesian inference)
"""
def __init__(self, input_dim: int, hidden_dim: int, output_dim: int,
dropout_rate: float = 0.2):
super().__init__()
self.net = nn.Sequential(
nn.Linear(input_dim, hidden_dim),
nn.ReLU(),
nn.Dropout(dropout_rate), # active at BOTH train and test
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Dropout(dropout_rate),
nn.Linear(hidden_dim, output_dim)
)
def forward(self, x: torch.Tensor) -> torch.Tensor:
return self.net(x)
def mc_dropout_predict(model: nn.Module, x: torch.Tensor,
n_samples: int = 50) -> tuple:
"""
Enable dropout at inference (train mode) for uncertainty estimation.
Returns: (mean_prediction, uncertainty_std)
"""
model.train() # keep dropout active
with torch.no_grad():
preds = torch.stack([model(x) for _ in range(n_samples)])
model.eval()
mean = preds.mean(dim=0)
uncertainty = preds.std(dim=0)
return mean, uncertainty
# Build and train on sparse 1D data
np.random.seed(42)
torch.manual_seed(42)
n_train = 100
x_data = torch.linspace(-3, 3, n_train).unsqueeze(1)
y_data = torch.sin(x_data) + 0.1 * torch.randn_like(x_data)
bayesian_model = BayesianMLP(input_dim=1, hidden_dim=64, output_dim=1, dropout_rate=0.2)
optimizer = torch.optim.Adam(bayesian_model.parameters(), lr=1e-3)
criterion = nn.MSELoss()
bayesian_model.train()
for epoch in range(500):
optimizer.zero_grad()
y_pred = bayesian_model(x_data)
loss = criterion(y_pred, y_data)
loss.backward()
optimizer.step()
# Predict on test range (including extrapolation)
x_test_mc = torch.linspace(-5, 5, 200).unsqueeze(1)
mean_pred, uncertainty = mc_dropout_predict(bayesian_model, x_test_mc, n_samples=100)
x_np = x_test_mc.numpy().flatten()
mean_np = mean_pred.numpy().flatten()
std_np = uncertainty.numpy().flatten()
fig, ax = plt.subplots(figsize=(14, 6))
ax.scatter(x_data.numpy(), y_data.numpy(), c='black', s=20, alpha=0.6, label='Train data', zorder=5)
ax.plot(x_np, mean_np, 'b-', linewidth=2, label='MC Dropout mean')
ax.fill_between(x_np, mean_np - 2*std_np, mean_np + 2*std_np,
alpha=0.3, label='±2σ (MC Dropout uncertainty)')
ax.axvspan(3.1, 5.0, alpha=0.1, color='red', label='Extrapolation region')
ax.set_title('MC Dropout: Uncertainty grows outside training region')
ax.legend()
plt.tight_layout()
Deep Ensembles
Training multiple independent models and averaging their predictions gives a practical and effective uncertainty estimate:
import torch
import torch.nn as nn
import numpy as np
from typing import List, Tuple
class NNEnsemble:
"""
Deep ensemble for uncertainty estimation.
Each model is trained from a different random initialisation.
Epistemic uncertainty = disagreement between ensemble members.
"""
def __init__(self, n_models: int = 5, hidden_dim: int = 64, dropout: float = 0.1):
self.n_models = n_models
self.hidden_dim = hidden_dim
self.models: List[nn.Module] = []
def _build_model(self, input_dim: int) -> nn.Module:
return nn.Sequential(
nn.Linear(input_dim, self.hidden_dim),
nn.ReLU(),
nn.Linear(self.hidden_dim, self.hidden_dim),
nn.ReLU(),
nn.Linear(self.hidden_dim, 2) # predict mean + log_variance
)
def fit(self, X: torch.Tensor, y: torch.Tensor, epochs: int = 300):
input_dim = X.shape[1]
self.models = []
for i in range(self.n_models):
torch.manual_seed(i * 42)
model = self._build_model(input_dim)
opt = torch.optim.Adam(model.parameters(), lr=1e-3)
for _ in range(epochs):
opt.zero_grad()
out = model(X)
mu = out[:, 0:1]
logv = out[:, 1:2]
# Gaussian NLL loss (learns both mean and variance)
loss = 0.5 * (logv + ((y - mu)**2) / logv.exp()).mean()
loss.backward()
opt.step()
model.eval()
self.models.append(model)
def predict(self, X: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
"""Returns: (mean_prediction, total_uncertainty_std)"""
means = []
vars_ = []
with torch.no_grad():
for model in self.models:
out = model(X)
means.append(out[:, 0])
vars_.append(out[:, 1].exp()) # aleatoric variance per model
means_t = torch.stack(means) # (n_models, n_samples)
vars_t = torch.stack(vars_) # (n_models, n_samples)
# Mixture of Gaussians approximation
ensemble_mean = means_t.mean(dim=0)
aleatoric_var = vars_t.mean(dim=0) # avg aleatoric
epistemic_var = means_t.var(dim=0) # disagreement
total_std = (aleatoric_var + epistemic_var).sqrt()
return ensemble_mean, total_std, aleatoric_var.sqrt(), epistemic_var.sqrt()
Part 8 - The KL Divergence and Entropy Connection
KL Divergence
Interpretation: how much information is lost when we use to approximate . Always , equals 0 iff .
Cross-Entropy = Entropy + KL Divergence
When minimising cross-entropy loss, = true label distribution (one-hot), = model's predicted probabilities. Since is fixed (the true labels don't change), minimising is equivalent to minimising :
import numpy as np
def kl_divergence(p: np.ndarray, q: np.ndarray, epsilon: float = 1e-10) -> float:
"""KL(P||Q) - KL divergence from P to Q."""
p = np.clip(p, epsilon, 1)
q = np.clip(q, epsilon, 1)
return np.sum(p * np.log(p / q))
def cross_entropy(p: np.ndarray, q: np.ndarray, epsilon: float = 1e-10) -> float:
"""H(P, Q) = -sum P * log Q"""
q = np.clip(q, epsilon, 1)
return -np.sum(p * np.log(q))
def entropy(p: np.ndarray, epsilon: float = 1e-10) -> float:
"""H(P) = -sum P * log P"""
p = np.clip(p, epsilon, 1)
return -np.sum(p * np.log(p))
# Verify: H(P, Q) = H(P) + KL(P||Q)
p_true = np.array([1.0, 0.0, 0.0]) # one-hot: class 0
q_model = np.array([0.7, 0.2, 0.1]) # model predictions
ce = cross_entropy(p_true, q_model)
h = entropy(p_true)
kl = kl_divergence(p_true, q_model)
print(f"H(P): {h:.4f}")
print(f"KL(P||Q): {kl:.4f}")
print(f"H(P,Q): {ce:.4f}")
print(f"H(P)+KL(P||Q):{h + kl:.4f}") # should equal H(P,Q)
# One-hot → H(P) = 0 → H(P,Q) = KL(P||Q) = -log q[true_class]
print(f"\nFor one-hot labels: H(P)=0, so H(P,Q) = KL(P||Q) = -log(q[true]) = {-np.log(0.7):.4f}")
Part 9 - Label Smoothing
Hard one-hot labels create overconfident models. Label smoothing prevents this by softening the target distribution:
import torch
import torch.nn.functional as F
def label_smoothing_loss(logits: torch.Tensor, targets: torch.Tensor,
smoothing: float = 0.1) -> torch.Tensor:
"""
Cross-entropy with label smoothing.
Prevents model from being overconfident (probabilities near 0 or 1).
Equivalent to regularising the KL divergence from a uniform distribution.
"""
n_classes = logits.shape[-1]
log_probs = F.log_softmax(logits, dim=-1)
# Hard targets: one-hot
nll_loss = F.nll_loss(log_probs, targets)
# Smooth targets: uniform over all classes
smooth_loss = -log_probs.mean(dim=-1).mean()
loss = (1 - smoothing) * nll_loss + smoothing * smooth_loss
return loss
# Compare standard vs label smoothing
logits = torch.randn(32, 10) # batch=32, classes=10
targets = torch.randint(0, 10, (32,))
ce_standard = F.cross_entropy(logits, targets)
ce_smoothed = label_smoothing_loss(logits, targets, smoothing=0.1)
print(f"Standard CE loss: {ce_standard.item():.4f}")
print(f"Label-smoothed CE loss: {ce_smoothed.item():.4f}")
# In PyTorch 1.10+, use built-in label smoothing
ce_builtin = F.cross_entropy(logits, targets, label_smoothing=0.1)
print(f"PyTorch built-in: {ce_builtin.item():.4f}")
YouTube Resources
| Video | Channel | Focus |
|---|---|---|
| MLE and MAP estimation | StatQuest | MLE intuition and derivation |
| Cross Entropy Loss | StatQuest | Cross-entropy as NLL |
| Probability Calibration | ritvikmath | Calibration curves |
| Bayesian Deep Learning | Yannic Kilcher | MC Dropout and uncertainty |
| KL Divergence | Intuitively Explained | KL divergence visual intuition |
Interview Questions
Q1: What probability distribution does minimising MSE implicitly assume? What are the consequences if that assumption is violated?
Minimising MSE is equivalent to MLE under a Gaussian noise assumption: , . Consequences when violated: (1) If errors are heavy-tailed (Laplace or t-distributed), MSE over-weights outliers - Huber loss or MAE would be more appropriate. (2) If errors are heteroscedastic (variance depends on ), MSE treats all samples equally but they shouldn't be - Weighted Least Squares or a model that predicts variance is better. (3) If the target is bounded (e.g., probabilities in [0,1]), MSE can produce predictions outside the valid range - use a model with appropriate output activation (sigmoid/softmax) and corresponding NLL loss.
Q2: What is the probabilistic interpretation of L2 regularisation?
L2 regularisation (Ridge) is MAP estimation with a Gaussian prior over parameters: . The MAP objective is: where (ratio of noise variance to prior variance). A smaller prior variance (stronger belief that parameters are near zero) corresponds to larger (stronger regularisation). Similarly, L1 regularisation corresponds to a Laplace prior, which has heavier tails and a sharper peak at zero - this is why L1 promotes sparsity: the Laplace distribution assigns high probability to weights being exactly zero.
Q3: Why are neural networks typically poorly calibrated, and how do you fix it?
Neural networks trained with cross-entropy on modern deep architectures (with batch normalisation, weight decay) tend to be overconfident - they produce probabilities near 0 or 1 more often than the empirical frequency justifies. The reasons: (1) Cross-entropy training encourages the model to be confident to reduce loss; (2) Batch normalisation and weight decay interact in ways that push logit magnitudes high; (3) Overfitting to training distribution affects probability scaling. Fixes: (1) Temperature scaling - divide logits by a temperature before softmax (single parameter, cheap); (2) Platt scaling - fit a logistic regression on top of model scores; (3) Isotonic regression - non-parametric, more flexible; (4) Label smoothing - prevents the model from being maximally confident during training; (5) MC Dropout / deep ensembles - produce better-calibrated uncertainty estimates.
Q4: What is the difference between aleatoric and epistemic uncertainty? Give an example of each in a medical ML context.
Aleatoric uncertainty is irreducible - it's inherent randomness in the data that cannot be resolved with more data. Epistemic uncertainty is reducible - it arises from limited data or model assumptions, and decreases as more relevant data is collected. In medical ML: aleatoric - two identical patients (same age, labs, vitals) may have different outcomes due to genetic variation, unobserved factors, or random biological processes; no amount of data resolves this. Epistemic - a rare disease with 50 training cases: the model is uncertain about rare presentations because it hasn't seen enough examples; gathering 5,000 cases would dramatically reduce this uncertainty. Models that distinguish both types (heteroscedastic NNs for aleatoric, ensembles/MC Dropout for epistemic) are more trustworthy for deployment, especially in safety-critical applications where knowing "I don't know" is crucial.
Q5: Derive why binary cross-entropy is the correct loss for logistic regression from a probabilistic perspective.
Logistic regression assumes where is the sigmoid function. The likelihood of the entire dataset under Bernoulli independence is:
Taking the negative log-likelihood (NLL):
This is exactly the binary cross-entropy loss. Minimising BCE = finding the MLE solution for the Bernoulli model. There's no "arbitrary" choice here - BCE is the theoretically correct objective given the Bernoulli assumption. Choosing MSE instead for classification would correspond to assuming Gaussian noise on binary outputs, which doesn't match the data-generating process and leads to worse probability estimates.
Q6: What is the Brier score and how does it differ from log loss for evaluating probabilistic predictions?
The Brier score is - the MSE between predicted probabilities and binary outcomes. Log loss is . Key differences: (1) Log loss is unbounded - a confident wrong prediction (predicts 0.99, true label is 0) gives loss while Brier gives . Log loss penalises confident mistakes much more severely. (2) Brier score is bounded [0, 1] and more interpretable. (3) Log loss has a direct probabilistic interpretation (NLL of Bernoulli), while Brier score is a proper scoring rule from a more general family. For model comparison: use log loss to penalise confident errors; use Brier score for overall probability accuracy. Both decompose into calibration and resolution components.
Q7: Explain label smoothing from a probabilistic perspective. When is it useful and when does it hurt?
Label smoothing replaces one-hot targets with . Probabilistically, it changes the target distribution from the empirical (true class = 100% probability) to a mixture: mostly the true class but with probability spread uniformly. This prevents the model from minimising CE by pushing logit margins to infinity (which it could with hard labels). Equivalently, label smoothing adds a KL divergence term from the model's distribution to a uniform distribution: . Useful when: the model tends to be overconfident, for knowledge distillation (teacher labels are already soft), for datasets with label noise. Hurts when: you use the model for calibration-sensitive applications (label smoothing systematically changes the temperature of predictions); for tasks requiring calibrated probabilities, apply temperature scaling after training instead.
Q8: How would you implement predictive uncertainty estimation for a production classification model without retraining from scratch?
Three practical approaches in order of implementation cost: (1) Temperature scaling: fit a single temperature parameter on validation data such that is well-calibrated. One parameter, no retraining, takes seconds. Best for reducing overconfidence without changing ranking. (2) MC Dropout: if the model has dropout layers, enable dropout at inference and run forward passes. Average predictions for the mean, use standard deviation as uncertainty. Works without retraining if dropout is already present; slightly increases inference latency by . (3) Deep ensemble: train models from different random seeds. Average their softmax outputs. High-quality uncertainty estimates, but requires training compute and serving infrastructure. For production, I'd implement (1) immediately (calibration fix), then evaluate if (3) is worth the cost given the latency/cost constraints.
:::tip Role-Specific Angles MLE Interview: MLE derivation for regression/classification, connection between loss functions and distributions, MAP = MLE + prior = regularisation, calibration and Brier score Research Engineer: Aleatoric vs epistemic uncertainty, MC Dropout theory, deep ensembles, label smoothing, KL divergence decomposition Data Scientist: Probability calibration, reliability diagrams, calibration curves, Platt vs isotonic scaling AI Engineer: Label smoothing implementation in PyTorch, uncertainty quantification in production APIs, confidence thresholds for rejection :::
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Probability Distributions demo on the EngineersOfAI Playground - no code required.
:::
