Estimation Theory: The Math Behind How Models Learn
Reading time: ~35 min | Interview relevance: Very High | Target roles: MLE, AI Engineer, Research Engineer
The Production Scenario
Your team is training a language model. The loss function in your training script reads:
loss = F.cross_entropy(logits, targets)
A new grad joins and asks: "Why cross-entropy? Why not mean squared error? Why not something else?"
Most engineers say "that's just standard for classification" and move on. But a senior ML engineer knows the real answer: cross-entropy IS maximum likelihood estimation. The loss function you minimise during training is the negative log-likelihood of your training data under the model. This is not a coincidence or a convention - it is the mathematically principled answer to the question "what parameters best explain the data I observed?"
Estimation theory gives you this deeper understanding. It also tells you exactly where L2 regularisation comes from (it is a Bayesian prior), and it gives you the formal language to reason about whether your model's parameter estimates are good ones.
What Is an Estimator?
You observe data drawn from some distribution where is an unknown parameter (or parameter vector). An estimator is any function of the data that produces a guess for :
Examples:
- Sample mean estimates the population mean
- Sample variance estimates population variance
- The weights from gradient descent estimate the "true" best weights
Different estimators have different properties. Two of the most important: bias and variance.
Bias and Variance of Estimators
Bias
The bias of an estimator is the systematic error - how far off on average is our estimate from the true value?
An estimator is unbiased if .
Example - why divide by for sample variance?
Consider two estimators for variance:
The biased version is the MLE solution, but it systematically underestimates variance because is itself estimated from the data. The correction factor compensates. In NumPy:
import numpy as np
data = np.array([2.1, 2.5, 3.0, 2.8, 2.3, 3.1, 2.7])
# Biased (MLE) - divides by n
var_biased = np.var(data, ddof=0)
# Unbiased (Bessel's correction) - divides by n-1
var_unbiased = np.var(data, ddof=1)
print(f"Biased variance: {var_biased:.4f}")
print(f"Unbiased variance: {var_unbiased:.4f}")
print(f"Ratio (n/(n-1)): {len(data)/(len(data)-1):.4f}")
# Verify: unbiased = biased * n/(n-1)
print(f"Check: {var_biased * len(data)/(len(data)-1):.4f}")
Variance of an Estimator
The variance of an estimator measures how much it fluctuates across different datasets:
A low-variance estimator gives similar answers no matter which random sample you happen to observe. This is exactly what you want in ML - a training procedure that gives you roughly the same model regardless of which mini-batches you sampled.
Mean Squared Error Decomposition
The Mean Squared Error (MSE) of an estimator decomposes into bias and variance:
Proof:
This is the bias-variance tradeoff at the estimator level - the same tradeoff you encounter in ML models.
:::tip ML Engineering Connection The bias-variance tradeoff of ML models (underfitting vs overfitting) is literally the bias-variance tradeoff of estimators. A model with high bias (underfitting) has a biased estimator for the true function. A model with high variance (overfitting) is a high-variance estimator that is sensitive to which training set you happened to sample. :::
Visualising Bias-Variance
Low Bias, Low Bias, High Bias, High Bias,
Low Variance High Variance Low Variance High Variance
* * * * * *
* * * * * * *
* * * *
[X] [X] [X] [X]
[X] = true parameter value
* = estimates from different data samples
Maximum Likelihood Estimation (MLE)
MLE answers the question: which parameter value makes the observed data most probable?
Given data assumed i.i.d. from , the likelihood is:
The MLE is:
In practice, we maximise the log-likelihood (which has the same maximiser, but products become sums - much better for numerical stability and calculus):
And since minimisation is more familiar in ML, we equivalently minimise the negative log-likelihood (NLL):
MLE for the Gaussian Distribution
Suppose . The log-likelihood is:
Deriving : Take the partial derivative with respect to and set to zero:
The MLE for is the sample mean. This is the intuitive answer - and now we know it is the principled MLE answer.
Deriving : Take the partial derivative with respect to (let ):
Note: this is the biased estimator (divides by , not ). MLE is consistent but not always unbiased.
import numpy as np
from scipy.stats import norm
np.random.seed(42)
mu_true, sigma_true = 3.5, 1.2
n = 1000
data = np.random.normal(mu_true, sigma_true, n)
# MLE estimates (manual)
mu_mle = np.mean(data)
sigma2_mle = np.mean((data - mu_mle)**2) # biased (divides by n)
sigma2_unbiased = np.var(data, ddof=1) # unbiased (divides by n-1)
print(f"True mu: {mu_true}")
print(f"MLE mu: {mu_mle:.4f}")
print(f"True sigma^2: {sigma_true**2:.4f}")
print(f"MLE sigma^2: {sigma2_mle:.4f} (biased)")
print(f"Unbiased sigma^2: {sigma2_unbiased:.4f} (Bessel's correction)")
# Verify with scipy MLE
mu_scipy, sigma_scipy = norm.fit(data)
print(f"\nScipy MLE mu: {mu_scipy:.4f}")
print(f"Scipy MLE sigma: {sigma_scipy:.4f}")
MLE for the Bernoulli Distribution
If with :
The MLE for a Bernoulli proportion is just the sample proportion. Again: the intuitive answer IS the principled MLE answer.
Cross-Entropy Loss IS MLE - The Critical Connection
This is one of the most important connections in all of ML engineering.
Suppose you have a classification model that outputs probabilities for binary classification. The likelihood of the training data under this model is:
The log-likelihood:
Maximising log-likelihood is equivalent to minimising negative log-likelihood:
This is exactly binary cross-entropy loss. For multi-class with classes:
This is exactly negative log-likelihood under a categorical distribution.
:::tip ML Engineering Connection
Every time you write F.cross_entropy(logits, targets) in PyTorch, you are running Maximum Likelihood Estimation. The model parameters that minimise cross-entropy are exactly the parameters that make your training data most probable under the model's probability distribution. This is not a heuristic - it is the mathematically optimal answer to "what parameters best fit this data?"
:::
import numpy as np
def cross_entropy_loss(y_true, y_pred, eps=1e-15):
"""Binary cross-entropy = negative log-likelihood."""
y_pred = np.clip(y_pred, eps, 1 - eps)
return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
def negative_log_likelihood(y_true, y_pred, eps=1e-15):
"""Bernoulli negative log-likelihood."""
y_pred = np.clip(y_pred, eps, 1 - eps)
log_lik = np.sum(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
return -log_lik / len(y_true)
# These are identical
y_true = np.array([1, 0, 1, 1, 0])
y_pred = np.array([0.9, 0.1, 0.8, 0.7, 0.3])
print(f"Cross-entropy loss: {cross_entropy_loss(y_true, y_pred):.6f}")
print(f"Negative log-likelihood: {negative_log_likelihood(y_true, y_pred):.6f}")
# Output will be identical
MSE Loss IS MLE for Gaussian Noise
Similarly, Mean Squared Error (used for regression) is the MLE estimator assuming Gaussian noise:
If where , then:
The log-likelihood:
Maximising this with respect to is equivalent to minimising:
MSE loss = MLE under Gaussian noise assumption. When you switch from MSE to Huber loss, you are implicitly changing the noise distribution from Gaussian to something with heavier tails (more robust to outliers).
| Loss Function | Implicit Noise Model |
|---|---|
| MSE | Gaussian |
| MAE | Laplace distribution |
| Huber loss | Gaussian for small errors, Laplace for large |
| Cross-entropy | Bernoulli / Categorical |
Maximum A Posteriori (MAP) Estimation
MLE only uses the likelihood - it ignores any prior knowledge about . MAP estimation incorporates a prior distribution :
Taking the log:
This is equivalent to minimising the regularised loss:
L2 Regularisation = Gaussian Prior
If (a Gaussian prior centred at zero):
The MAP objective becomes:
This is exactly L2 regularisation (Ridge). The regularisation strength is the precision of the Gaussian prior - a stronger prior means we are more confident weights should be near zero.
L1 Regularisation = Laplace Prior
If (a Laplace prior):
The MAP objective:
This is L1 regularisation (Lasso). The Laplace prior has heavier tails than the Gaussian and places more probability mass near zero, which promotes sparsity.
import numpy as np
from numpy.linalg import norm
np.random.seed(42)
n, d = 100, 50 # n samples, d features (underdetermined: d > n)
X = np.random.randn(n, d)
w_true = np.zeros(d)
w_true[:5] = [1.0, -0.5, 0.8, -1.2, 0.3] # sparse true weights
y = X @ w_true + 0.1 * np.random.randn(n)
# MLE solution (OLS) - use lstsq for numerical stability
w_mle, _, _, _ = np.linalg.lstsq(X, y, rcond=None)
# MAP solution (Ridge regression) - L2 prior
# Normal equations with regularization: w_map = (X^T X + lambda I)^{-1} X^T y
lambda_reg = 1.0
A = X.T @ X + lambda_reg * np.eye(d)
b = X.T @ y
w_map = np.linalg.solve(A, b)
print("MLE vs MAP estimation (underdetermined system):")
print(f"MLE weights norm: {norm(w_mle):.3f}")
print(f"MAP weights norm: {norm(w_map):.3f}")
print(f"True weights norm: {norm(w_true):.3f}")
mle_error = norm(w_mle - w_true)
map_error = norm(w_map - w_true)
print(f"\nMLE estimation error: {mle_error:.3f}")
print(f"MAP estimation error: {map_error:.3f}")
print(f"MAP improvement: {((mle_error - map_error)/mle_error)*100:.1f}%")
:::note The Regularisation Story Every regularisation technique in ML has a Bayesian interpretation. When you add weight decay to Adam, you are performing MAP estimation with a Gaussian prior. When you add L1 to a linear model, you are performing MAP with a Laplace prior. The hyperparameter controls how strong your prior belief is that weights should be small. :::
Properties of Good Estimators
Consistency
An estimator is consistent if it converges to the true value as :
MLE estimators are consistent under mild regularity conditions - this is why "more data helps": your model parameters converge to the optimal values with enough data.
Efficiency and the Cramér-Rao Bound
An estimator is efficient if it achieves the lowest possible variance among all unbiased estimators. The Cramér-Rao lower bound establishes this minimum variance:
where is the Fisher information:
MLE is asymptotically efficient - for large , the MLE achieves the Cramér-Rao bound. No other consistent estimator can do better asymptotically.
import numpy as np
def cramer_rao_bound(sigma2, n):
"""Minimum variance from Cramér-Rao bound for Gaussian mean."""
# Fisher information for mean: I(mu) = n/sigma^2
return sigma2 / n
sigma2_true = 4.0
sample_sizes = [10, 50, 100, 500, 1000]
print("Cramér-Rao Bound vs Empirical Variance of Sample Mean:")
print(f"{'n':>8} | {'CRB (min var)':>14} | {'Empirical Var':>14} | {'Ratio':>8}")
print("-" * 54)
for n in sample_sizes:
crb = cramer_rao_bound(sigma2_true, n)
n_experiments = 10_000
means = [np.mean(np.random.normal(0, np.sqrt(sigma2_true), n))
for _ in range(n_experiments)]
empirical_var = np.var(means)
print(f"{n:>8} | {crb:>14.6f} | {empirical_var:>14.6f} | {empirical_var/crb:>8.4f}")
# Ratio should be ~1.0 - sample mean achieves the CRB
Summary of Estimator Properties
| Property | Definition | ML Implication |
|---|---|---|
| Unbiased | Sample mean is unbiased for | |
| Consistent | as | MLE converges with more data |
| Efficient | Achieves Cramér-Rao bound | MLE is asymptotically efficient |
| Sufficient | Uses all info in data | MLE uses sufficient statistics |
MLE vs MAP vs Full Bayesian
| Approach | Formula | Output | Regularisation | Use Case |
|---|---|---|---|---|
| MLE | Point estimate | None | Large data, no prior | |
| MAP | Point estimate | Yes (via prior) | Small data, prior known | |
| Full Bayesian | Distribution | Yes (via prior) | Uncertainty needed |
MLE overfits in small-data regimes. MAP adds regularisation. Full Bayesian inference (covered in Module 06) gives a distribution over parameters - uncertainty quantification rather than just a point estimate.
:::warning Common Interview Trap When asked "what is MLE?", many candidates say "you maximise the likelihood." That is correct but incomplete. The full answer: MLE finds the parameters such that the observed data is most probable. It is equivalent to minimising cross-entropy loss for classification and MSE loss for regression with Gaussian noise. The regularisation you add converts it from MLE to MAP. :::
Putting It Together: The Training Loop as MLE
import numpy as np
# This simplified training loop IS MLE
def logistic_regression_mle(X, y, learning_rate=0.01, epochs=1000):
"""
Train logistic regression by minimising NLL = MLE.
This is exactly what torch.nn.BCELoss() does.
"""
n, d = X.shape
theta = np.zeros(d)
for epoch in range(epochs):
# Forward pass: predict probabilities
logits = X @ theta
y_hat = 1 / (1 + np.exp(-logits)) # sigmoid
# Loss = negative log-likelihood = cross-entropy
eps = 1e-15
nll = -np.mean(y * np.log(y_hat + eps) + (1 - y) * np.log(1 - y_hat + eps))
# Gradient of NLL w.r.t. theta
grad = X.T @ (y_hat - y) / n
# Gradient descent step (MLE update)
theta -= learning_rate * grad
if epoch % 100 == 0:
print(f"Epoch {epoch:4d} | NLL: {nll:.4f}")
return theta
# Generate synthetic data
np.random.seed(42)
n = 500
X = np.column_stack([np.ones(n), np.random.randn(n, 2)])
w_true = np.array([0.5, 1.2, -0.8])
probs = 1 / (1 + np.exp(-X @ w_true))
y = (np.random.rand(n) < probs).astype(float)
w_mle = logistic_regression_mle(X, y)
print(f"\nTrue weights: {w_true}")
print(f"MLE weights: {w_mle.round(3)}")
The training loop above is exactly MLE. When you add lambda * np.sum(theta**2) to the NLL, you convert it to MAP with a Gaussian prior - which is L2 regularised logistic regression.
Interview Q&A
Q1: Why does training a neural network with cross-entropy loss correspond to MLE?
Cross-entropy loss is . If the model outputs a categorical distribution , then the log-likelihood of the training data under this model is . Minimising cross-entropy is equivalent to maximising log-likelihood, which is MLE. The model parameters that minimise cross-entropy are exactly those that make the training labels most probable under the model's predicted distribution.
Q2: What is the difference between MLE and MAP? Give a practical example.
MLE finds the parameters that maximise the likelihood . MAP finds the parameters that maximise the posterior , incorporating a prior belief about . Practical example: in linear regression, MLE gives ordinary least squares (OLS). MAP with a Gaussian prior on the weights gives Ridge regression (L2 regularisation). MAP with a Laplace prior gives Lasso (L1 regularisation). The regularisation term in your loss function IS the negative log of your prior.
Q3: What does it mean for an estimator to be biased? Is the MLE for Gaussian variance biased?
An estimator is biased if its expectation differs from the true parameter: . Yes, the MLE for Gaussian variance is biased - it divides by rather than , giving . The expected value is . Bessel's correction multiplies by to produce the unbiased .
Q4: How does the bias-variance tradeoff of estimators relate to the bias-variance tradeoff in ML models?
They are the same tradeoff at different scales. For an estimator, . For an ML model, expected generalisation error decomposes the same way: . A model with high bias (underfitting) is using a function class too simple to represent the true relationship. A model with high variance (overfitting) is sensitive to which training dataset was sampled - the estimator has high variance across datasets.
Q5: What is the Cramér-Rao lower bound and why does it matter in ML?
The Cramér-Rao lower bound states that for any unbiased estimator , where is the Fisher information. No unbiased estimator can be more precise than this. It matters in ML because: (1) it tells us the fundamental limits of parameter estimation from data, (2) MLE achieves this bound asymptotically (it is asymptotically efficient), and (3) the Fisher information matrix is the basis for natural gradient methods - K-FAC, and Adam's second-moment estimate is a diagonal approximation of the inverse Fisher. Understanding Fisher information connects training dynamics to statistical estimation theory.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the MLE & MAP Explorer demo on the EngineersOfAI Playground - no code required.
:::
