Probabilistic View of Machine Learning

Reading time: ~40 minutes | Level: ML Foundations | Role: MLE, Research Engineer, Data Scientist, AI Engineer

A senior interviewer asks you: "What is cross-entropy loss and why do we use it?" Most candidates answer: "It penalises confident wrong predictions more than unconfident ones." That's a start. But the interviewer probes deeper: "What probability model are you implicitly assuming when you minimise cross-entropy? And what happens to your model's behaviour if that assumption is violated?"

This is the probabilistic view of ML. Every loss function encodes a probabilistic assumption about how data was generated. Every optimizer is secretly maximising a likelihood. Every regulariser encodes a prior over parameters. When you know this framework, you understand why techniques work - and when to expect them to fail.

What You Will Learn

The generative model: how probabilistic thinking frames ML as density estimation
Maximum Likelihood Estimation (MLE) - derivation for regression and classification
Why minimising MSE ≡ maximising Gaussian likelihood
Why minimising cross-entropy ≡ maximising Bernoulli/Categorical likelihood
Maximum A Posteriori (MAP) estimation - MLE + prior
Why L2 regularisation ≡ Gaussian prior, L1 ≡ Laplace prior
Probability calibration - the gap between predicted probabilities and empirical frequencies
Bayesian reasoning in deep learning - MC Dropout, deep ensembles
Uncertainty types: aleatoric vs epistemic
Eight interview Q&As at senior research/MLE level

Part 1 - The Probabilistic Framework

From Patterns to Distributions

Traditional ML framing: "Find a function $f: X \to Y$ that minimises training error."

Probabilistic framing: "Find parameters $\theta$ such that the model distribution $P_\theta(Y | X)$ best explains the observed data."

Generative view of supervised learning:

1. Nature draws a sample: (x, y) from true distribution P*(x, y)
2. You observe n samples: D = {(x₁,y₁), ..., (xₙ,yₙ)}
3. You posit a parametric model: P_θ(y|x)
4. Goal: find θ that makes D most probable under your model

This is Maximum Likelihood Estimation (MLE).

Why this framework is powerful:

It makes assumptions explicit (what distribution are you assuming?)
It gives you confidence estimates, not just point predictions
It connects every loss function to a probability model
It tells you what regularisation means (prior beliefs)
It quantifies uncertainty in predictions

Part 2 - Maximum Likelihood Estimation (MLE)

The Core Idea

Given observed data $\mathcal{D} = \{(x_i, y_i)\}_{i=1}^n$ , and a parametric model $P_\theta(y | x)$ , MLE finds:

$\hat{\theta}_{\text{MLE}} = \arg\max_\theta \prod_{i=1}^n P_\theta(y_i | x_i)$

We take the log (log-likelihood is easier to optimise - sums instead of products):

$\hat{\theta}_{\text{MLE}} = \arg\max_\theta \sum_{i=1}^n \log P_\theta(y_i | x_i)$

Minimising the negative log-likelihood (NLL) is the standard training objective.

Case 1: Regression - Gaussian Likelihood → MSE

Assumption: $y = f_\theta(x) + \varepsilon$ where $\varepsilon \sim \mathcal{N}(0, \sigma^2)$ .

Then $P_\theta(y | x) = \mathcal{N}(y; f_\theta(x), \sigma^2)$ .

$\log P_\theta(y_i | x_i) = -\frac{1}{2}\log(2\pi\sigma^2) - \frac{(y_i - f_\theta(x_i))^2}{2\sigma^2}$

Summing over $n$ samples and removing constants with respect to $\theta$ :

$\hat{\theta}_{\text{MLE}} = \arg\max_\theta \sum_{i=1}^n \log P_\theta(y_i | x_i) = \arg\min_\theta \sum_{i=1}^n (y_i - f_\theta(x_i))^2$

$\boxed{\text{MLE with Gaussian likelihood} \equiv \text{Minimising MSE}}$

import numpy as np
from scipy.stats import norm
import matplotlib.pyplot as plt

# Demonstrate: MLE with Gaussian assumption = MSE minimisation
np.random.seed(42)
x = np.linspace(0, 10, 50)
y_true = 2 * x + 1
y_obs  = y_true + np.random.normal(0, 2, 50)  # σ=2

# Method 1: Minimise MSE (standard linear regression)
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(x.reshape(-1, 1), y_obs)
mse_predictions = lr.predict(x.reshape(-1, 1))

# Method 2: Maximise Gaussian log-likelihood (same result)
sigma = np.std(y_obs - mse_predictions)
log_likelihoods = norm.logpdf(y_obs, loc=mse_predictions, scale=sigma)
total_ll = log_likelihoods.sum()

print(f"MSE regression: slope={lr.coef_[0]:.4f}, intercept={lr.intercept_:.4f}")
print(f"Total log-likelihood at MLE solution: {total_ll:.4f}")
print(f"NLL = -log_likelihood = {-total_ll:.4f}")
print(f"Scaled NLL = {-total_ll/50:.4f} ≈ MSE / (2σ²) + const")

# Visualise the Gaussian assumption
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
ax1.scatter(x, y_obs, alpha=0.5, label='Observed')
ax1.plot(x, mse_predictions, 'r-', label='MLE/MSE fit')
ax1.set_title('Gaussian Regression: y ~ N(f(x), σ²)')
ax1.legend()

ax2.hist(y_obs - mse_predictions, bins=20, density=True, alpha=0.7, label='Residuals')
x_range = np.linspace(-8, 8, 200)
ax2.plot(x_range, norm.pdf(x_range, 0, sigma), 'r-', label=f'N(0, {sigma:.2f}²)')
ax2.set_title('Residual Distribution (should be Gaussian for MLE=MSE)')
ax2.legend()
plt.tight_layout()

Case 2: Binary Classification - Bernoulli Likelihood → Cross-Entropy

Assumption: $y_i | x_i \sim \text{Bernoulli}(p_\theta(x_i))$ where $p_\theta(x_i) = \sigma(f_\theta(x_i))$ (sigmoid output).

$P_\theta(y_i | x_i) = p_\theta(x_i)^{y_i} (1 - p_\theta(x_i))^{1 - y_i}$

Taking the negative log-likelihood:

$-\log P_\theta(y_i | x_i) = -y_i \log p_\theta(x_i) - (1 - y_i) \log(1 - p_\theta(x_i))$

Summing over all samples:

$\mathcal{L} = -\frac{1}{n}\sum_{i=1}^n \left[y_i \log \hat{p}_i + (1-y_i) \log(1 - \hat{p}_i)\right]$

$\boxed{\text{MLE with Bernoulli likelihood} \equiv \text{Binary Cross-Entropy Loss}}$

import numpy as np
from scipy.special import expit  # sigmoid

# Demonstrate: cross-entropy IS negative log-likelihood for Bernoulli model
np.random.seed(42)
n = 1000

# True data generating process
X_clf = np.random.randn(n, 2)
true_logits = 2 * X_clf[:, 0] - 1.5 * X_clf[:, 1]
true_probs  = expit(true_logits)
y_clf       = np.random.binomial(1, true_probs)

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss

model = LogisticRegression(C=1e6, max_iter=500)  # large C ≈ no regularisation = pure MLE
model.fit(X_clf, y_clf)
p_hat = model.predict_proba(X_clf)[:, 1]

# Cross-entropy (sklearn)
ce_sklearn = log_loss(y_clf, p_hat)

# Manual NLL
nll_manual = -np.mean(
    y_clf * np.log(p_hat + 1e-15) + (1 - y_clf) * np.log(1 - p_hat + 1e-15)
)

print(f"sklearn log_loss:  {ce_sklearn:.6f}")
print(f"Manual NLL:        {nll_manual:.6f}")
print(f"Difference:        {abs(ce_sklearn - nll_manual):.2e} (numerical precision)")

Case 3: Multi-Class - Categorical Likelihood → Softmax Cross-Entropy

$P_\theta(y = k | x) = \frac{e^{f_\theta(x)_k}}{\sum_{j=1}^K e^{f_\theta(x)_j}}$

$-\log P_\theta(y_i = k | x_i) = -\log \hat{p}_{i,k} = \text{cross-entropy loss}$

import torch
import torch.nn.functional as F

# Softmax cross-entropy = NLL of Categorical distribution
logits = torch.tensor([[2.0, 1.0, -0.5],
                        [0.5, 3.0,  0.1],
                        [-1.0, 0.5, 2.5]])  # (batch=3, classes=3)
labels = torch.tensor([0, 1, 2])           # true classes

# Method 1: PyTorch cross-entropy
ce_loss = F.cross_entropy(logits, labels)

# Method 2: Manual NLL
probs = F.softmax(logits, dim=-1)
log_probs = torch.log(probs)
nll_manual = -log_probs[torch.arange(3), labels].mean()

print(f"F.cross_entropy: {ce_loss.item():.6f}")
print(f"Manual NLL:      {nll_manual.item():.6f}")

# Demonstrate the assumed distribution
print("\nPredicted probabilities (Categorical distribution parameters):")
for i in range(3):
    p = probs[i].detach().numpy()
    print(f"  Sample {i}: P(C=0)={p[0]:.3f}, P(C=1)={p[1]:.3f}, P(C=2)={p[2]:.3f}")
    print(f"           True class: {labels[i].item()}, NLL: {-torch.log(probs[i][labels[i]]).item():.4f}")

Part 3 - Loss Functions and Their Implied Distributions

Loss Function          │ Distribution Assumption         │ Optimal Prediction
───────────────────────┼─────────────────────────────────┼───────────────────
MSE (L2 loss)          │ Gaussian noise (ε ~ N(0,σ²))   │ Conditional mean E[y|x]
MAE (L1 loss)          │ Laplace noise (ε ~ Laplace)     │ Conditional median
Huber loss             │ Mix Gaussian + Laplace          │ Robust conditional mean
Binary cross-entropy   │ Bernoulli (y ∈ {0,1})           │ P(y=1|x)
Softmax cross-entropy  │ Categorical (y ∈ {1..K})        │ P(y=k|x)
KL divergence          │ Any distribution P vs Q         │ P = Q (perfect match)
Poisson NLL            │ Poisson (count data y ∈ ℕ)      │ Rate parameter λ(x)
Beta NLL               │ Beta (y ∈ [0,1])                │ Mean of Beta(α,β)
Ordinal cross-entropy  │ Ordered categorical             │ Cumulative probabilities

When the Assumed Distribution Doesn't Match

import numpy as np
from scipy.stats import poisson
import matplotlib.pyplot as plt

"""
Example: count data (bookings per hour) is Poisson-distributed.
If you use MSE loss (Gaussian assumption), predictions may be negative
and the loss weights large counts disproportionately.
Poisson NLL handles this correctly.
"""
np.random.seed(42)
# True rate is time-dependent
times = np.linspace(1, 24, 1000)  # hours of day
true_rate = 5 + 20 * np.exp(-0.5 * ((times - 12) / 3)**2)  # lunch peak
y_counts = np.array([poisson.rvs(lam) for lam in true_rate])

from sklearn.linear_model import PoissonRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error
import warnings
warnings.filterwarnings('ignore')

X_times = times.reshape(-1, 1)

# Wrong: Gaussian assumption (standard linear regression)
from sklearn.linear_model import LinearRegression
lr = LinearRegression().fit(X_times, y_counts)
y_pred_lr = lr.predict(X_times)

# Right: Poisson assumption
pr = PoissonRegressor(alpha=0.01).fit(X_times, y_counts)
y_pred_pr = pr.predict(X_times)

print("Count data prediction comparison:")
print(f"  Linear Reg (Gaussian) MAE: {mean_absolute_error(y_counts, y_pred_lr):.3f}")
print(f"  Poisson Reg           MAE: {mean_absolute_error(y_counts, y_pred_pr):.3f}")
print(f"  Linear Reg negative predictions: {(y_pred_lr < 0).sum()} ({(y_pred_lr<0).mean()*100:.1f}%)")
print(f"  Poisson Reg min prediction:      {y_pred_pr.min():.4f} (always > 0)")

Part 4 - MAP Estimation: Adding a Prior

MLE finds parameters that maximise the likelihood. But if you have prior beliefs about what parameters should look like, you can incorporate them through a prior distribution $P(\theta)$ .

Maximum A Posteriori (MAP):

$\hat{\theta}_{\text{MAP}} = \arg\max_\theta P(\theta | \mathcal{D}) = \arg\max_\theta \underbrace{P(\mathcal{D} | \theta)}_{\text{likelihood}} \cdot \underbrace{P(\theta)}_{\text{prior}}$

$= \arg\max_\theta \left[\log P(\mathcal{D} | \theta) + \log P(\theta)\right]$

$= \arg\max_\theta \left[\text{log-likelihood} + \text{log-prior}\right]$

$\boxed{\text{MAP} \equiv \text{MLE} + \text{regularisation} \quad (\text{log-prior} = \text{regulariser})}$

L2 Regularisation ≡ Gaussian Prior

A Gaussian prior $P(\theta) = \mathcal{N}(0, \tau^2)$ gives:

$\log P(\theta) = -\frac{1}{2\tau^2} \|\theta\|_2^2 + \text{const}$

$\hat{\theta}_{\text{MAP}} = \arg\min_\theta \left[\text{NLL} + \frac{\lambda}{2} \|\theta\|_2^2\right] \quad \text{where } \lambda = \frac{\sigma^2}{\tau^2}$

This is exactly Ridge regression / L2 regularisation!

L1 Regularisation ≡ Laplace Prior

A Laplace prior $P(\theta) = \text{Laplace}(0, b)$ gives:

$\log P(\theta) = -\frac{|\theta|}{b} + \text{const}$

$\hat{\theta}_{\text{MAP}} = \arg\min_\theta \left[\text{NLL} + \lambda \|\theta\|_1\right]$

This is exactly Lasso / L1 regularisation! The Laplace prior has heavier tails than Gaussian, which corresponds to the sparsity-inducing property of L1 - it strongly prefers most weights to be exactly zero.

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm, laplace
from sklearn.linear_model import Ridge, Lasso, LinearRegression
from sklearn.datasets import make_regression
from sklearn.metrics import mean_squared_error

# Visualise Gaussian vs Laplace prior
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

theta_range = np.linspace(-5, 5, 1000)
for sigma in [0.5, 1.0, 2.0]:
    axes[0].plot(theta_range, norm.pdf(theta_range, 0, sigma),
                 label=f'N(0, {sigma}²)')
for b in [0.5, 1.0, 2.0]:
    axes[1].plot(theta_range, laplace.pdf(theta_range, 0, b),
                 label=f'Laplace(0, {b})')

axes[0].set_title('Gaussian Prior → L2/Ridge Regularisation\n(smooth - large weights discouraged)')
axes[1].set_title('Laplace Prior → L1/Lasso Regularisation\n(sharp peak - promotes sparsity/zero weights)')
for ax in axes:
    ax.legend()
    ax.set_xlabel('θ (parameter value)')
    ax.set_ylabel('Prior density P(θ)')
    ax.axvline(0, color='gray', linestyle=':', alpha=0.7)

plt.tight_layout()

# Demonstrate regularisation as MAP
np.random.seed(42)
n, p = 100, 50  # more features than effective signal
X, y, coef = make_regression(n_samples=n, n_features=p, n_informative=10,
                               noise=15, coef=True, random_state=42)

models = {
    'MLE (no reg)':    LinearRegression(),
    'MAP Gaussian (Ridge, λ=1)':  Ridge(alpha=1.0),
    'MAP Gaussian (Ridge, λ=10)': Ridge(alpha=10.0),
    'MAP Laplace (Lasso, λ=0.1)': Lasso(alpha=0.1),
    'MAP Laplace (Lasso, λ=1)':   Lasso(alpha=1.0),
}

train_size = 80
X_train, X_test = X[:train_size], X[train_size:]
y_train, y_test = y[:train_size], y[train_size:]

print(f"\n{'Model':<35} {'Train RMSE':>12} {'Test RMSE':>12} {'Non-zero coefs':>15}")
print("-" * 80)
for name, model in models.items():
    model.fit(X_train, y_train)
    train_rmse = np.sqrt(mean_squared_error(y_train, model.predict(X_train)))
    test_rmse  = np.sqrt(mean_squared_error(y_test, model.predict(X_test)))
    nnz = np.sum(np.abs(model.coef_) > 1e-6)
    print(f"{name:<35} {train_rmse:>12.4f} {test_rmse:>12.4f} {nnz:>15}")

The Bayesian Posterior

MLE and MAP both give point estimates. Full Bayesian inference gives the entire posterior distribution over parameters:

$P(\theta | \mathcal{D}) = \frac{P(\mathcal{D} | \theta) P(\theta)}{P(\mathcal{D})}$

The normalising constant $P(\mathcal{D}) = \int P(\mathcal{D}|\theta)P(\theta)d\theta$ is the model evidence - intractable for deep neural networks, motivating approximate inference methods (variational inference, Monte Carlo).

Part 5 - Probability Calibration

A model's predicted probability $\hat{p}$ should match the empirical frequency of the event. If your classifier says 80% probability, it should be correct 80% of the time among all predictions with that score.

Well-calibrated model:
  For all p ∈ [0,1]:  P(Y=1 | ŷ = p) = p
  "When I say 70%, it happens 70% of the time"

Overconfident model:
  Predicts p=0.9 when actual frequency is 0.7
  "When I say 90%, only 70% is right"
  → Neural networks are typically overconfident

Underconfident model:
  Predicts p=0.6 when actual frequency is 0.8
  → Random forests tend to be underconfident

Reliability Diagrams

import numpy as np
import matplotlib.pyplot as plt
from sklearn.calibration import calibration_curve, CalibratedClassifierCV
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.metrics import log_loss, brier_score_loss

X, y = make_classification(n_samples=3000, n_features=20,
                            n_informative=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

models = {
    'Logistic Regression': LogisticRegression(max_iter=500),
    'Random Forest':       RandomForestClassifier(n_estimators=100, random_state=42),
    'SVM (uncalibrated)':  SVC(probability=True),
    'Gradient Boosting':   GradientBoostingClassifier(n_estimators=100, random_state=42),
}

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))
ax1.plot([0, 1], [0, 1], 'k--', label='Perfect calibration')

for name, model in models.items():
    model.fit(X_train, y_train)
    p_pred = model.predict_proba(X_test)[:, 1]

    fraction_pos, mean_pred = calibration_curve(y_test, p_pred, n_bins=10)

    brier = brier_score_loss(y_test, p_pred)
    logloss = log_loss(y_test, p_pred)

    ax1.plot(mean_pred, fraction_pos, 's-', label=f'{name} (Brier={brier:.3f})')
    print(f"{name}:")
    print(f"  Log Loss: {logloss:.4f}")
    print(f"  Brier Score: {brier:.4f}")

ax1.set_xlabel('Mean Predicted Probability')
ax1.set_ylabel('Fraction of Positives')
ax1.set_title('Calibration Curves (Reliability Diagram)')
ax1.legend(loc='upper left')

# Show calibration improvement
rf_uncal = RandomForestClassifier(n_estimators=100, random_state=42)
rf_cal   = CalibratedClassifierCV(
    RandomForestClassifier(n_estimators=100, random_state=42),
    method='isotonic', cv=3
)

rf_uncal.fit(X_train, y_train)
rf_cal.fit(X_train, y_train)

for name, clf in [('RF Uncalibrated', rf_uncal), ('RF Calibrated', rf_cal)]:
    p = clf.predict_proba(X_test)[:, 1]
    frac, mean = calibration_curve(y_test, p, n_bins=10)
    ax2.plot(mean, frac, 's-', label=name)

ax2.plot([0, 1], [0, 1], 'k--', label='Perfect')
ax2.set_xlabel('Mean Predicted Probability')
ax2.set_ylabel('Fraction of Positives')
ax2.set_title('Before vs After Calibration (Random Forest)')
ax2.legend()
plt.tight_layout()

Calibration Methods

from sklearn.calibration import CalibratedClassifierCV

# 1. Platt Scaling (sigmoid calibration)
# Fits a logistic regression on top of raw scores
# Works well when calibration error is systematic (monotone)
platt_calibrated = CalibratedClassifierCV(
    GradientBoostingClassifier(n_estimators=100),
    method='sigmoid',  # Platt scaling
    cv=5
)
platt_calibrated.fit(X_train, y_train)

# 2. Isotonic Regression
# Non-parametric, more flexible than Platt
# Risk of overfitting with small datasets
isotonic_calibrated = CalibratedClassifierCV(
    GradientBoostingClassifier(n_estimators=100),
    method='isotonic',  # Isotonic regression
    cv=5
)
isotonic_calibrated.fit(X_train, y_train)

for name, clf in [('Platt', platt_calibrated), ('Isotonic', isotonic_calibrated)]:
    p = clf.predict_proba(X_test)[:, 1]
    print(f"{name} calibration - Brier: {brier_score_loss(y_test, p):.4f}, "
          f"LogLoss: {log_loss(y_test, p):.4f}")

Brier Score - The Calibration Metric

$\text{Brier} = \frac{1}{n}\sum_{i=1}^n (y_i - \hat{p}_i)^2$

The Brier score is like MSE for probabilities. A perfect model has Brier = 0. A model that always predicts 0.5 has Brier = 0.25. It decomposes into calibration and refinement:

$\text{Brier} = \underbrace{\text{Calibration}}_{\text{Bias in probabilities}} + \underbrace{\text{Resolution}}_{\text{Spread of predictions}} - \underbrace{\text{Uncertainty}}_{\text{Inherent difficulty}}$

from sklearn.metrics import brier_score_loss

# Brier score decomposition
def brier_decomposition(y_true, y_prob, n_bins=10):
    """Decompose Brier score into calibration + resolution - uncertainty."""
    bins = np.linspace(0, 1, n_bins + 1)
    bin_centers = (bins[:-1] + bins[1:]) / 2
    p_bar = y_true.mean()

    calibration = 0
    resolution  = 0
    n = len(y_true)

    for i in range(n_bins):
        in_bin = (y_prob >= bins[i]) & (y_prob < bins[i+1])
        if in_bin.sum() == 0:
            continue
        n_k   = in_bin.sum()
        o_k   = y_true[in_bin].mean()  # fraction of positives in bin
        f_k   = y_prob[in_bin].mean()  # mean predicted prob in bin

        calibration += n_k * (f_k - o_k)**2
        resolution  += n_k * (o_k - p_bar)**2

    calibration /= n
    resolution  /= n
    uncertainty  = p_bar * (1 - p_bar)

    print(f"Brier Decomposition:")
    print(f"  Calibration (lower=better):  {calibration:.4f}")
    print(f"  Resolution  (higher=better): {resolution:.4f}")
    print(f"  Uncertainty (baseline):      {uncertainty:.4f}")
    print(f"  Brier Score = Cal - Res + Unc = {calibration - resolution + uncertainty:.4f}")
    print(f"  Direct Brier Score:          {brier_score_loss(y_true, y_prob):.4f}")

rf_preds = rf_uncal.predict_proba(X_test)[:, 1]
brier_decomposition(y_test, rf_preds)

Part 6 - Aleatoric vs Epistemic Uncertainty

Understanding uncertainty types is critical for real-world ML deployment:

ALEATORIC UNCERTAINTY (irreducible):
  → Inherent randomness in the data
  → Cannot be reduced with more data
  → Example: rolling a fair die - we can never be certain of the outcome
  → Example: predicting a patient's exact blood pressure from demographics
    (there's inherent biological variation)
  → In regression: captured by σ in N(μ(x), σ(x)²) - heteroscedastic models

EPISTEMIC UNCERTAINTY (reducible):
  → Uncertainty due to limited data / unknown parameters
  → Can be reduced with more training data
  → Example: predicting outcomes in an unusual region of input space
  → Example: edge cases / rare events in training data
  → In Bayesian models: width of the posterior P(θ|D)

The key question: "Am I uncertain because the world is noisy,
                   or because I haven't seen enough data?"

import numpy as np
import matplotlib.pyplot as plt

# Illustrate: aleatoric vs epistemic uncertainty
np.random.seed(42)

def true_function(x):
    return np.sin(x) + 0.1 * x

# Sparse training data in [0, 3] and [7, 10], none in [4, 6]
x_train_1 = np.linspace(0, 3, 30)
x_train_2 = np.linspace(7, 10, 30)
x_train   = np.concatenate([x_train_1, x_train_2])
y_train   = true_function(x_train) + np.random.normal(0, 0.3, len(x_train))

x_test = np.linspace(0, 10, 200)
y_true = true_function(x_test)

# Simple uncertainty: predict with ensemble of models
# (epistemic uncertainty → ensemble members disagree in data-sparse regions)
from sklearn.linear_model import BayesianRidge
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures

# Bootstrap ensemble
n_ensemble = 20
ensemble_preds = np.zeros((n_ensemble, len(x_test)))

for i in range(n_ensemble):
    boot_idx = np.random.choice(len(x_train), len(x_train), replace=True)
    x_b = x_train[boot_idx]
    y_b = y_train[boot_idx]
    pipe = Pipeline([
        ('poly', PolynomialFeatures(degree=5)),
        ('ridge', BayesianRidge())
    ])
    pipe.fit(x_b.reshape(-1, 1), y_b)
    ensemble_preds[i] = pipe.predict(x_test.reshape(-1, 1))

mean_pred = ensemble_preds.mean(axis=0)
std_pred  = ensemble_preds.std(axis=0)  # epistemic uncertainty

fig, ax = plt.subplots(figsize=(14, 6))
ax.plot(x_test, y_true, 'k-', linewidth=2, label='True function', zorder=5)
ax.scatter(x_train, y_train, c='black', s=20, zorder=10, label='Training data')
ax.plot(x_test, mean_pred, 'b-', linewidth=2, label='Ensemble mean')
ax.fill_between(x_test, mean_pred - 2*std_pred, mean_pred + 2*std_pred,
                alpha=0.3, color='blue', label='±2σ (epistemic uncertainty)')

# Mark data-sparse region (high epistemic uncertainty)
ax.axvspan(3.5, 6.5, alpha=0.1, color='red', label='No training data here')
ax.set_title('Epistemic Uncertainty: High in Data-Sparse Regions, Low Where Data Is Dense')
ax.legend()
ax.set_xlabel('x')
plt.tight_layout()

print("Uncertainty comparison:")
print(f"  Mean std in [0,3] (data-rich):   {std_pred[(x_test < 3)].mean():.4f}")
print(f"  Mean std in [4,6] (data-poor):   {std_pred[(x_test > 4) & (x_test < 6)].mean():.4f}")
print(f"  Mean std in [7,10] (data-rich):  {std_pred[(x_test > 7)].mean():.4f}")

Part 7 - Bayesian Deep Learning: Practical Approximations

Full Bayesian inference over neural network weights is intractable. In practice, two approximations dominate:

MC Dropout

Gal & Ghahramani (2016) showed that dropout at inference time approximates Bayesian inference:

import torch
import torch.nn as nn
import numpy as np
import matplotlib.pyplot as plt

class BayesianMLP(nn.Module):
    """
    MLP with dropout at inference time for Bayesian uncertainty estimation.
    (MC Dropout - approximates Bayesian inference)
    """
    def __init__(self, input_dim: int, hidden_dim: int, output_dim: int,
                 dropout_rate: float = 0.2):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(dropout_rate),      # active at BOTH train and test
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(dropout_rate),
            nn.Linear(hidden_dim, output_dim)
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.net(x)

def mc_dropout_predict(model: nn.Module, x: torch.Tensor,
                        n_samples: int = 50) -> tuple:
    """
    Enable dropout at inference (train mode) for uncertainty estimation.
    Returns: (mean_prediction, uncertainty_std)
    """
    model.train()  # keep dropout active
    with torch.no_grad():
        preds = torch.stack([model(x) for _ in range(n_samples)])
    model.eval()

    mean = preds.mean(dim=0)
    uncertainty = preds.std(dim=0)
    return mean, uncertainty

# Build and train on sparse 1D data
np.random.seed(42)
torch.manual_seed(42)

n_train = 100
x_data  = torch.linspace(-3, 3, n_train).unsqueeze(1)
y_data  = torch.sin(x_data) + 0.1 * torch.randn_like(x_data)

bayesian_model = BayesianMLP(input_dim=1, hidden_dim=64, output_dim=1, dropout_rate=0.2)
optimizer = torch.optim.Adam(bayesian_model.parameters(), lr=1e-3)
criterion = nn.MSELoss()

bayesian_model.train()
for epoch in range(500):
    optimizer.zero_grad()
    y_pred = bayesian_model(x_data)
    loss   = criterion(y_pred, y_data)
    loss.backward()
    optimizer.step()

# Predict on test range (including extrapolation)
x_test_mc = torch.linspace(-5, 5, 200).unsqueeze(1)
mean_pred, uncertainty = mc_dropout_predict(bayesian_model, x_test_mc, n_samples=100)

x_np   = x_test_mc.numpy().flatten()
mean_np = mean_pred.numpy().flatten()
std_np  = uncertainty.numpy().flatten()

fig, ax = plt.subplots(figsize=(14, 6))
ax.scatter(x_data.numpy(), y_data.numpy(), c='black', s=20, alpha=0.6, label='Train data', zorder=5)
ax.plot(x_np, mean_np, 'b-', linewidth=2, label='MC Dropout mean')
ax.fill_between(x_np, mean_np - 2*std_np, mean_np + 2*std_np,
                alpha=0.3, label='±2σ (MC Dropout uncertainty)')
ax.axvspan(3.1, 5.0, alpha=0.1, color='red', label='Extrapolation region')
ax.set_title('MC Dropout: Uncertainty grows outside training region')
ax.legend()
plt.tight_layout()

Deep Ensembles

Training multiple independent models and averaging their predictions gives a practical and effective uncertainty estimate:

import torch
import torch.nn as nn
import numpy as np
from typing import List, Tuple

class NNEnsemble:
    """
    Deep ensemble for uncertainty estimation.
    Each model is trained from a different random initialisation.
    Epistemic uncertainty = disagreement between ensemble members.
    """

    def __init__(self, n_models: int = 5, hidden_dim: int = 64, dropout: float = 0.1):
        self.n_models   = n_models
        self.hidden_dim = hidden_dim
        self.models: List[nn.Module] = []

    def _build_model(self, input_dim: int) -> nn.Module:
        return nn.Sequential(
            nn.Linear(input_dim, self.hidden_dim),
            nn.ReLU(),
            nn.Linear(self.hidden_dim, self.hidden_dim),
            nn.ReLU(),
            nn.Linear(self.hidden_dim, 2)  # predict mean + log_variance
        )

    def fit(self, X: torch.Tensor, y: torch.Tensor, epochs: int = 300):
        input_dim = X.shape[1]
        self.models = []

        for i in range(self.n_models):
            torch.manual_seed(i * 42)
            model = self._build_model(input_dim)
            opt   = torch.optim.Adam(model.parameters(), lr=1e-3)

            for _ in range(epochs):
                opt.zero_grad()
                out  = model(X)
                mu   = out[:, 0:1]
                logv = out[:, 1:2]
                # Gaussian NLL loss (learns both mean and variance)
                loss = 0.5 * (logv + ((y - mu)**2) / logv.exp()).mean()
                loss.backward()
                opt.step()

            model.eval()
            self.models.append(model)

    def predict(self, X: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
        """Returns: (mean_prediction, total_uncertainty_std)"""
        means = []
        vars_ = []

        with torch.no_grad():
            for model in self.models:
                out = model(X)
                means.append(out[:, 0])
                vars_.append(out[:, 1].exp())  # aleatoric variance per model

        means_t = torch.stack(means)  # (n_models, n_samples)
        vars_t  = torch.stack(vars_)  # (n_models, n_samples)

        # Mixture of Gaussians approximation
        ensemble_mean = means_t.mean(dim=0)
        aleatoric_var = vars_t.mean(dim=0)                       # avg aleatoric
        epistemic_var = means_t.var(dim=0)                        # disagreement
        total_std     = (aleatoric_var + epistemic_var).sqrt()

        return ensemble_mean, total_std, aleatoric_var.sqrt(), epistemic_var.sqrt()

Part 8 - The KL Divergence and Entropy Connection

KL Divergence

$D_{\text{KL}}(P \| Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)} = \mathbb{E}_P\left[\log \frac{P(X)}{Q(X)}\right]$

Interpretation: how much information is lost when we use $Q$ to approximate $P$ . Always $\geq 0$ , equals 0 iff $P = Q$ .

Cross-Entropy = Entropy + KL Divergence

$H(P, Q) = H(P) + D_{\text{KL}}(P \| Q)$

$H(P, Q) = -\sum_x P(x) \log Q(x)$

When minimising cross-entropy loss, $P$ = true label distribution (one-hot), $Q$ = model's predicted probabilities. Since $H(P)$ is fixed (the true labels don't change), minimising $H(P, Q)$ is equivalent to minimising $D_{\text{KL}}(P \| Q)$ :

$\boxed{\text{Minimising cross-entropy} \equiv \text{Minimising KL from true distribution to model}}$

import numpy as np

def kl_divergence(p: np.ndarray, q: np.ndarray, epsilon: float = 1e-10) -> float:
    """KL(P||Q) - KL divergence from P to Q."""
    p = np.clip(p, epsilon, 1)
    q = np.clip(q, epsilon, 1)
    return np.sum(p * np.log(p / q))

def cross_entropy(p: np.ndarray, q: np.ndarray, epsilon: float = 1e-10) -> float:
    """H(P, Q) = -sum P * log Q"""
    q = np.clip(q, epsilon, 1)
    return -np.sum(p * np.log(q))

def entropy(p: np.ndarray, epsilon: float = 1e-10) -> float:
    """H(P) = -sum P * log P"""
    p = np.clip(p, epsilon, 1)
    return -np.sum(p * np.log(p))

# Verify: H(P, Q) = H(P) + KL(P||Q)
p_true = np.array([1.0, 0.0, 0.0])   # one-hot: class 0
q_model = np.array([0.7, 0.2, 0.1])  # model predictions

ce  = cross_entropy(p_true, q_model)
h   = entropy(p_true)
kl  = kl_divergence(p_true, q_model)

print(f"H(P):        {h:.4f}")
print(f"KL(P||Q):    {kl:.4f}")
print(f"H(P,Q):      {ce:.4f}")
print(f"H(P)+KL(P||Q):{h + kl:.4f}")  # should equal H(P,Q)

# One-hot → H(P) = 0 → H(P,Q) = KL(P||Q) = -log q[true_class]
print(f"\nFor one-hot labels: H(P)=0, so H(P,Q) = KL(P||Q) = -log(q[true]) = {-np.log(0.7):.4f}")

Part 9 - Label Smoothing

Hard one-hot labels create overconfident models. Label smoothing prevents this by softening the target distribution:

$y_{\text{smooth}} = (1 - \epsilon) \cdot y_{\text{one-hot}} + \frac{\epsilon}{K}$

import torch
import torch.nn.functional as F

def label_smoothing_loss(logits: torch.Tensor, targets: torch.Tensor,
                          smoothing: float = 0.1) -> torch.Tensor:
    """
    Cross-entropy with label smoothing.
    Prevents model from being overconfident (probabilities near 0 or 1).
    Equivalent to regularising the KL divergence from a uniform distribution.
    """
    n_classes = logits.shape[-1]
    log_probs  = F.log_softmax(logits, dim=-1)

    # Hard targets: one-hot
    nll_loss = F.nll_loss(log_probs, targets)

    # Smooth targets: uniform over all classes
    smooth_loss = -log_probs.mean(dim=-1).mean()

    loss = (1 - smoothing) * nll_loss + smoothing * smooth_loss
    return loss

# Compare standard vs label smoothing
logits = torch.randn(32, 10)    # batch=32, classes=10
targets = torch.randint(0, 10, (32,))

ce_standard = F.cross_entropy(logits, targets)
ce_smoothed = label_smoothing_loss(logits, targets, smoothing=0.1)

print(f"Standard CE loss:       {ce_standard.item():.4f}")
print(f"Label-smoothed CE loss: {ce_smoothed.item():.4f}")

# In PyTorch 1.10+, use built-in label smoothing
ce_builtin = F.cross_entropy(logits, targets, label_smoothing=0.1)
print(f"PyTorch built-in:       {ce_builtin.item():.4f}")

YouTube Resources

Video	Channel	Focus
MLE and MAP estimation	StatQuest	MLE intuition and derivation
Cross Entropy Loss	StatQuest	Cross-entropy as NLL
Probability Calibration	ritvikmath	Calibration curves
Bayesian Deep Learning	Yannic Kilcher	MC Dropout and uncertainty
KL Divergence	Intuitively Explained	KL divergence visual intuition

Interview Questions

Q1: What probability distribution does minimising MSE implicitly assume? What are the consequences if that assumption is violated?

Minimising MSE is equivalent to MLE under a Gaussian noise assumption: $y = f_\theta(x) + \varepsilon$ , $\varepsilon \sim \mathcal{N}(0, \sigma^2)$ . Consequences when violated: (1) If errors are heavy-tailed (Laplace or t-distributed), MSE over-weights outliers - Huber loss or MAE would be more appropriate. (2) If errors are heteroscedastic (variance depends on $x$ ), MSE treats all samples equally but they shouldn't be - Weighted Least Squares or a model that predicts variance is better. (3) If the target is bounded (e.g., probabilities in [0,1]), MSE can produce predictions outside the valid range - use a model with appropriate output activation (sigmoid/softmax) and corresponding NLL loss.

Q2: What is the probabilistic interpretation of L2 regularisation?

L2 regularisation (Ridge) is MAP estimation with a Gaussian prior over parameters: $P(\theta) = \mathcal{N}(0, \tau^2 I)$ . The MAP objective is: $\arg\max_\theta [\log P(D|\theta) + \log P(\theta)] = \arg\min_\theta [\text{NLL} + \frac{\lambda}{2}\|\theta\|_2^2]$ where $\lambda = \sigma^2/\tau^2$ (ratio of noise variance to prior variance). A smaller prior variance $\tau^2$ (stronger belief that parameters are near zero) corresponds to larger $\lambda$ (stronger regularisation). Similarly, L1 regularisation corresponds to a Laplace prior, which has heavier tails and a sharper peak at zero - this is why L1 promotes sparsity: the Laplace distribution assigns high probability to weights being exactly zero.

Q3: Why are neural networks typically poorly calibrated, and how do you fix it?

Neural networks trained with cross-entropy on modern deep architectures (with batch normalisation, weight decay) tend to be overconfident - they produce probabilities near 0 or 1 more often than the empirical frequency justifies. The reasons: (1) Cross-entropy training encourages the model to be confident to reduce loss; (2) Batch normalisation and weight decay interact in ways that push logit magnitudes high; (3) Overfitting to training distribution affects probability scaling. Fixes: (1) Temperature scaling - divide logits by a temperature $T > 1$ before softmax (single parameter, cheap); (2) Platt scaling - fit a logistic regression on top of model scores; (3) Isotonic regression - non-parametric, more flexible; (4) Label smoothing - prevents the model from being maximally confident during training; (5) MC Dropout / deep ensembles - produce better-calibrated uncertainty estimates.

Q4: What is the difference between aleatoric and epistemic uncertainty? Give an example of each in a medical ML context.

Aleatoric uncertainty is irreducible - it's inherent randomness in the data that cannot be resolved with more data. Epistemic uncertainty is reducible - it arises from limited data or model assumptions, and decreases as more relevant data is collected. In medical ML: aleatoric - two identical patients (same age, labs, vitals) may have different outcomes due to genetic variation, unobserved factors, or random biological processes; no amount of data resolves this. Epistemic - a rare disease with 50 training cases: the model is uncertain about rare presentations because it hasn't seen enough examples; gathering 5,000 cases would dramatically reduce this uncertainty. Models that distinguish both types (heteroscedastic NNs for aleatoric, ensembles/MC Dropout for epistemic) are more trustworthy for deployment, especially in safety-critical applications where knowing "I don't know" is crucial.

Q5: Derive why binary cross-entropy is the correct loss for logistic regression from a probabilistic perspective.

Logistic regression assumes $P_\theta(y=1|x) = \sigma(x^T\theta)$ where $\sigma$ is the sigmoid function. The likelihood of the entire dataset under Bernoulli independence is:

$P(D|\theta) = \prod_{i=1}^n \hat{p}_i^{y_i}(1-\hat{p}_i)^{1-y_i}$

Taking the negative log-likelihood (NLL):

$-\log P(D|\theta) = -\sum_{i=1}^n [y_i \log \hat{p}_i + (1-y_i)\log(1-\hat{p}_i)]$

This is exactly the binary cross-entropy loss. Minimising BCE = finding the MLE solution for the Bernoulli model. There's no "arbitrary" choice here - BCE is the theoretically correct objective given the Bernoulli assumption. Choosing MSE instead for classification would correspond to assuming Gaussian noise on binary outputs, which doesn't match the data-generating process and leads to worse probability estimates.

Q6: What is the Brier score and how does it differ from log loss for evaluating probabilistic predictions?

The Brier score is $\frac{1}{n}\sum_i(y_i - \hat{p}_i)^2$ - the MSE between predicted probabilities and binary outcomes. Log loss is $-\frac{1}{n}\sum_i[y_i\log\hat{p}_i + (1-y_i)\log(1-\hat{p}_i)]$ . Key differences: (1) Log loss is unbounded - a confident wrong prediction (predicts 0.99, true label is 0) gives loss $\log(0.01) \approx 4.6$ while Brier gives $(0-0.99)^2 \approx 0.98$ . Log loss penalises confident mistakes much more severely. (2) Brier score is bounded [0, 1] and more interpretable. (3) Log loss has a direct probabilistic interpretation (NLL of Bernoulli), while Brier score is a proper scoring rule from a more general family. For model comparison: use log loss to penalise confident errors; use Brier score for overall probability accuracy. Both decompose into calibration and resolution components.

Q7: Explain label smoothing from a probabilistic perspective. When is it useful and when does it hurt?

Label smoothing replaces one-hot targets $y_{\text{one-hot}}$ with $y_{\text{smooth}} = (1-\epsilon) \cdot y_{\text{one-hot}} + \epsilon/K$ . Probabilistically, it changes the target distribution from the empirical (true class = 100% probability) to a mixture: mostly the true class but with $\epsilon$ probability spread uniformly. This prevents the model from minimising CE by pushing logit margins to infinity (which it could with hard labels). Equivalently, label smoothing adds a KL divergence term from the model's distribution to a uniform distribution: $\mathcal{L} = (1-\epsilon) H(P_{\text{hard}}, Q) + \epsilon H(P_{\text{uniform}}, Q)$ . Useful when: the model tends to be overconfident, for knowledge distillation (teacher labels are already soft), for datasets with label noise. Hurts when: you use the model for calibration-sensitive applications (label smoothing systematically changes the temperature of predictions); for tasks requiring calibrated probabilities, apply temperature scaling after training instead.

Q8: How would you implement predictive uncertainty estimation for a production classification model without retraining from scratch?

Three practical approaches in order of implementation cost: (1) Temperature scaling: fit a single temperature parameter $T$ on validation data such that $\sigma(\text{logit}/T)$ is well-calibrated. One parameter, no retraining, takes seconds. Best for reducing overconfidence without changing ranking. (2) MC Dropout: if the model has dropout layers, enable dropout at inference and run $N=50$ forward passes. Average predictions for the mean, use standard deviation as uncertainty. Works without retraining if dropout is already present; slightly increases inference latency by $N\times$ . (3) Deep ensemble: train $M=5$ models from different random seeds. Average their softmax outputs. High-quality uncertainty estimates, but requires $M\times$ training compute and serving infrastructure. For production, I'd implement (1) immediately (calibration fix), then evaluate if (3) is worth the cost given the latency/cost constraints.

:::tip Role-Specific Angles MLE Interview: MLE derivation for regression/classification, connection between loss functions and distributions, MAP = MLE + prior = regularisation, calibration and Brier score Research Engineer: Aleatoric vs epistemic uncertainty, MC Dropout theory, deep ensembles, label smoothing, KL divergence decomposition Data Scientist: Probability calibration, reliability diagrams, calibration curves, Platt vs isotonic scaling AI Engineer: Label smoothing implementation in PyTorch, uncertainty quantification in production APIs, confidence thresholds for rejection :::

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Probability Distributions demo on the EngineersOfAI Playground - no code required.

:::

What You Will Learn​

Part 1 - The Probabilistic Framework​

From Patterns to Distributions​

Part 2 - Maximum Likelihood Estimation (MLE)​

The Core Idea​

Case 1: Regression - Gaussian Likelihood → MSE​

Case 2: Binary Classification - Bernoulli Likelihood → Cross-Entropy​

Case 3: Multi-Class - Categorical Likelihood → Softmax Cross-Entropy​

Part 3 - Loss Functions and Their Implied Distributions​

When the Assumed Distribution Doesn't Match​

Part 4 - MAP Estimation: Adding a Prior​

L2 Regularisation ≡ Gaussian Prior​

L1 Regularisation ≡ Laplace Prior​

The Bayesian Posterior​

Part 5 - Probability Calibration​

Reliability Diagrams​

Calibration Methods​

Brier Score - The Calibration Metric​

Part 6 - Aleatoric vs Epistemic Uncertainty​

Part 7 - Bayesian Deep Learning: Practical Approximations​

MC Dropout​

Deep Ensembles​

Part 8 - The KL Divergence and Entropy Connection​

KL Divergence​

Cross-Entropy = Entropy + KL Divergence​

Part 9 - Label Smoothing​

YouTube Resources​

Interview Questions​