Skip to main content

Estimation Theory: The Math Behind How Models Learn

Reading time: ~35 min | Interview relevance: Very High | Target roles: MLE, AI Engineer, Research Engineer

The Production Scenario

Your team is training a language model. The loss function in your training script reads:

loss = F.cross_entropy(logits, targets)

A new grad joins and asks: "Why cross-entropy? Why not mean squared error? Why not something else?"

Most engineers say "that's just standard for classification" and move on. But a senior ML engineer knows the real answer: cross-entropy IS maximum likelihood estimation. The loss function you minimise during training is the negative log-likelihood of your training data under the model. This is not a coincidence or a convention - it is the mathematically principled answer to the question "what parameters best explain the data I observed?"

Estimation theory gives you this deeper understanding. It also tells you exactly where L2 regularisation comes from (it is a Bayesian prior), and it gives you the formal language to reason about whether your model's parameter estimates are good ones.

What Is an Estimator?

You observe data x1,x2,,xnx_1, x_2, \ldots, x_n drawn from some distribution p(xθ)p(x | \theta) where θ\theta is an unknown parameter (or parameter vector). An estimator θ^\hat{\theta} is any function of the data that produces a guess for θ\theta:

θ^=g(x1,x2,,xn)\hat{\theta} = g(x_1, x_2, \ldots, x_n)

Examples:

  • Sample mean xˉ=1ni=1nxi\bar{x} = \frac{1}{n}\sum_{i=1}^n x_i estimates the population mean μ\mu
  • Sample variance s2=1n1i=1n(xixˉ)2s^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})^2 estimates population variance σ2\sigma^2
  • The weights w^\hat{w} from gradient descent estimate the "true" best weights ww^*

Different estimators have different properties. Two of the most important: bias and variance.

Bias and Variance of Estimators

Bias

The bias of an estimator is the systematic error - how far off on average is our estimate from the true value?

Bias(θ^)=E[θ^]θ\text{Bias}(\hat{\theta}) = \mathbb{E}[\hat{\theta}] - \theta

An estimator is unbiased if E[θ^]=θ\mathbb{E}[\hat{\theta}] = \theta.

Example - why divide by n1n-1 for sample variance?

Consider two estimators for variance:

σ^biased2=1ni=1n(xixˉ)2(MLE estimator)\hat{\sigma}^2_{\text{biased}} = \frac{1}{n}\sum_{i=1}^n (x_i - \bar{x})^2 \quad \text{(MLE estimator)}

σ^unbiased2=1n1i=1n(xixˉ)2(Bessel’s correction)\hat{\sigma}^2_{\text{unbiased}} = \frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})^2 \quad \text{(Bessel's correction)}

The biased version is the MLE solution, but it systematically underestimates variance because xˉ\bar{x} is itself estimated from the data. The correction factor nn1\frac{n}{n-1} compensates. In NumPy:

import numpy as np

data = np.array([2.1, 2.5, 3.0, 2.8, 2.3, 3.1, 2.7])

# Biased (MLE) - divides by n
var_biased = np.var(data, ddof=0)

# Unbiased (Bessel's correction) - divides by n-1
var_unbiased = np.var(data, ddof=1)

print(f"Biased variance: {var_biased:.4f}")
print(f"Unbiased variance: {var_unbiased:.4f}")
print(f"Ratio (n/(n-1)): {len(data)/(len(data)-1):.4f}")

# Verify: unbiased = biased * n/(n-1)
print(f"Check: {var_biased * len(data)/(len(data)-1):.4f}")

Variance of an Estimator

The variance of an estimator measures how much it fluctuates across different datasets:

Var(θ^)=E[(θ^E[θ^])2]\text{Var}(\hat{\theta}) = \mathbb{E}\left[(\hat{\theta} - \mathbb{E}[\hat{\theta}])^2\right]

A low-variance estimator gives similar answers no matter which random sample you happen to observe. This is exactly what you want in ML - a training procedure that gives you roughly the same model regardless of which mini-batches you sampled.

Mean Squared Error Decomposition

The Mean Squared Error (MSE) of an estimator decomposes into bias and variance:

MSE(θ^)=E[(θ^θ)2]=Bias2(θ^)+Var(θ^)\text{MSE}(\hat{\theta}) = \mathbb{E}[(\hat{\theta} - \theta)^2] = \text{Bias}^2(\hat{\theta}) + \text{Var}(\hat{\theta})

Proof:

MSE=E[(θ^θ)2]\text{MSE} = \mathbb{E}[(\hat{\theta} - \theta)^2] =E[(θ^E[θ^]+E[θ^]θ)2]= \mathbb{E}\left[(\hat{\theta} - \mathbb{E}[\hat{\theta}] + \mathbb{E}[\hat{\theta}] - \theta)^2\right] =E[(θ^E[θ^])2]+2E[(θ^E[θ^])(E[θ^]θ)]=0+(E[θ^]θ)2= \mathbb{E}\left[(\hat{\theta} - \mathbb{E}[\hat{\theta}])^2\right] + 2\underbrace{\mathbb{E}\left[(\hat{\theta} - \mathbb{E}[\hat{\theta}])(\mathbb{E}[\hat{\theta}] - \theta)\right]}_{=0} + (\mathbb{E}[\hat{\theta}] - \theta)^2 =Var(θ^)+Bias2(θ^)= \text{Var}(\hat{\theta}) + \text{Bias}^2(\hat{\theta})

This is the bias-variance tradeoff at the estimator level - the same tradeoff you encounter in ML models.

:::tip ML Engineering Connection The bias-variance tradeoff of ML models (underfitting vs overfitting) is literally the bias-variance tradeoff of estimators. A model with high bias (underfitting) has a biased estimator for the true function. A model with high variance (overfitting) is a high-variance estimator that is sensitive to which training set you happened to sample. :::

Visualising Bias-Variance

Low Bias, Low Bias, High Bias, High Bias,
Low Variance High Variance Low Variance High Variance

* * * * * *
* * * * * * *
* * * *
[X] [X] [X] [X]

[X] = true parameter value
* = estimates from different data samples

Maximum Likelihood Estimation (MLE)

MLE answers the question: which parameter value makes the observed data most probable?

Given data D={x1,,xn}\mathcal{D} = \{x_1, \ldots, x_n\} assumed i.i.d. from p(xθ)p(x|\theta), the likelihood is:

L(θ)=p(Dθ)=i=1np(xiθ)\mathcal{L}(\theta) = p(\mathcal{D}|\theta) = \prod_{i=1}^n p(x_i|\theta)

The MLE is:

θ^MLE=argmaxθL(θ)=argmaxθi=1np(xiθ)\hat{\theta}_{\text{MLE}} = \arg\max_\theta \mathcal{L}(\theta) = \arg\max_\theta \prod_{i=1}^n p(x_i|\theta)

In practice, we maximise the log-likelihood (which has the same maximiser, but products become sums - much better for numerical stability and calculus):

θ^MLE=argmaxθlogL(θ)=argmaxθi=1nlogp(xiθ)\hat{\theta}_{\text{MLE}} = \arg\max_\theta \log \mathcal{L}(\theta) = \arg\max_\theta \sum_{i=1}^n \log p(x_i|\theta)

And since minimisation is more familiar in ML, we equivalently minimise the negative log-likelihood (NLL):

θ^MLE=argminθi=1nlogp(xiθ)\hat{\theta}_{\text{MLE}} = \arg\min_\theta -\sum_{i=1}^n \log p(x_i|\theta)

MLE for the Gaussian Distribution

Suppose x1,,xnN(μ,σ2)x_1, \ldots, x_n \sim \mathcal{N}(\mu, \sigma^2). The log-likelihood is:

logL(μ,σ2)=i=1nlog[12πσ2exp((xiμ)22σ2)]\log \mathcal{L}(\mu, \sigma^2) = \sum_{i=1}^n \log \left[\frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x_i-\mu)^2}{2\sigma^2}\right)\right]

=n2log(2π)n2log(σ2)12σ2i=1n(xiμ)2= -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log(\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n (x_i - \mu)^2

Deriving μ^MLE\hat{\mu}_{\text{MLE}}: Take the partial derivative with respect to μ\mu and set to zero:

logLμ=1σ2i=1n(xiμ)=0\frac{\partial \log \mathcal{L}}{\partial \mu} = \frac{1}{\sigma^2}\sum_{i=1}^n (x_i - \mu) = 0

i=1n(xiμ)=0μ^MLE=1ni=1nxi=xˉ\Rightarrow \sum_{i=1}^n (x_i - \mu) = 0 \Rightarrow \hat{\mu}_{\text{MLE}} = \frac{1}{n}\sum_{i=1}^n x_i = \bar{x}

The MLE for μ\mu is the sample mean. This is the intuitive answer - and now we know it is the principled MLE answer.

Deriving σ^MLE2\hat{\sigma}^2_{\text{MLE}}: Take the partial derivative with respect to σ2\sigma^2 (let v=σ2v = \sigma^2):

logLv=n2v+12v2i=1n(xiμ)2=0\frac{\partial \log \mathcal{L}}{\partial v} = -\frac{n}{2v} + \frac{1}{2v^2}\sum_{i=1}^n (x_i - \mu)^2 = 0

σ^MLE2=1ni=1n(xixˉ)2\Rightarrow \hat{\sigma}^2_{\text{MLE}} = \frac{1}{n}\sum_{i=1}^n (x_i - \bar{x})^2

Note: this is the biased estimator (divides by nn, not n1n-1). MLE is consistent but not always unbiased.

import numpy as np
from scipy.stats import norm

np.random.seed(42)
mu_true, sigma_true = 3.5, 1.2

n = 1000
data = np.random.normal(mu_true, sigma_true, n)

# MLE estimates (manual)
mu_mle = np.mean(data)
sigma2_mle = np.mean((data - mu_mle)**2) # biased (divides by n)
sigma2_unbiased = np.var(data, ddof=1) # unbiased (divides by n-1)

print(f"True mu: {mu_true}")
print(f"MLE mu: {mu_mle:.4f}")
print(f"True sigma^2: {sigma_true**2:.4f}")
print(f"MLE sigma^2: {sigma2_mle:.4f} (biased)")
print(f"Unbiased sigma^2: {sigma2_unbiased:.4f} (Bessel's correction)")

# Verify with scipy MLE
mu_scipy, sigma_scipy = norm.fit(data)
print(f"\nScipy MLE mu: {mu_scipy:.4f}")
print(f"Scipy MLE sigma: {sigma_scipy:.4f}")

MLE for the Bernoulli Distribution

If xi{0,1}x_i \in \{0, 1\} with p(x=1)=pp(x=1) = p:

logL(p)=i=1n[xilogp+(1xi)log(1p)]\log \mathcal{L}(p) = \sum_{i=1}^n [x_i \log p + (1-x_i)\log(1-p)]

logLp=xipnxi1p=0\frac{\partial \log \mathcal{L}}{\partial p} = \frac{\sum x_i}{p} - \frac{n - \sum x_i}{1-p} = 0

p^MLE=1ni=1nxi=xˉ\hat{p}_{\text{MLE}} = \frac{1}{n}\sum_{i=1}^n x_i = \bar{x}

The MLE for a Bernoulli proportion is just the sample proportion. Again: the intuitive answer IS the principled MLE answer.

Cross-Entropy Loss IS MLE - The Critical Connection

This is one of the most important connections in all of ML engineering.

Suppose you have a classification model fθf_\theta that outputs probabilities y^i=p(y=1xi,θ)\hat{y}_i = p(y=1 | x_i, \theta) for binary classification. The likelihood of the training data under this model is:

L(θ)=i=1ny^iyi(1y^i)1yi\mathcal{L}(\theta) = \prod_{i=1}^n \hat{y}_i^{y_i}(1-\hat{y}_i)^{1-y_i}

The log-likelihood:

logL(θ)=i=1n[yilogy^i+(1yi)log(1y^i)]\log \mathcal{L}(\theta) = \sum_{i=1}^n \left[y_i \log \hat{y}_i + (1-y_i)\log(1-\hat{y}_i)\right]

Maximising log-likelihood is equivalent to minimising negative log-likelihood:

1nlogL(θ)=1ni=1n[yilogy^i+(1yi)log(1y^i)]-\frac{1}{n}\log \mathcal{L}(\theta) = -\frac{1}{n}\sum_{i=1}^n \left[y_i \log \hat{y}_i + (1-y_i)\log(1-\hat{y}_i)\right]

This is exactly binary cross-entropy loss. For multi-class with KK classes:

LCE=1ni=1nk=1Kyiklogy^ik\mathcal{L}_{\text{CE}} = -\frac{1}{n}\sum_{i=1}^n \sum_{k=1}^K y_{ik}\log \hat{y}_{ik}

This is exactly negative log-likelihood under a categorical distribution.

:::tip ML Engineering Connection Every time you write F.cross_entropy(logits, targets) in PyTorch, you are running Maximum Likelihood Estimation. The model parameters that minimise cross-entropy are exactly the parameters that make your training data most probable under the model's probability distribution. This is not a heuristic - it is the mathematically optimal answer to "what parameters best fit this data?" :::

import numpy as np

def cross_entropy_loss(y_true, y_pred, eps=1e-15):
"""Binary cross-entropy = negative log-likelihood."""
y_pred = np.clip(y_pred, eps, 1 - eps)
return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))

def negative_log_likelihood(y_true, y_pred, eps=1e-15):
"""Bernoulli negative log-likelihood."""
y_pred = np.clip(y_pred, eps, 1 - eps)
log_lik = np.sum(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
return -log_lik / len(y_true)

# These are identical
y_true = np.array([1, 0, 1, 1, 0])
y_pred = np.array([0.9, 0.1, 0.8, 0.7, 0.3])

print(f"Cross-entropy loss: {cross_entropy_loss(y_true, y_pred):.6f}")
print(f"Negative log-likelihood: {negative_log_likelihood(y_true, y_pred):.6f}")
# Output will be identical

MSE Loss IS MLE for Gaussian Noise

Similarly, Mean Squared Error (used for regression) is the MLE estimator assuming Gaussian noise:

If yi=fθ(xi)+ϵiy_i = f_\theta(x_i) + \epsilon_i where ϵiN(0,σ2)\epsilon_i \sim \mathcal{N}(0, \sigma^2), then:

p(yixi,θ)=N(fθ(xi),σ2)p(y_i | x_i, \theta) = \mathcal{N}(f_\theta(x_i), \sigma^2)

The log-likelihood:

logL(θ)=n2log(2πσ2)12σ2i=1n(yifθ(xi))2\log \mathcal{L}(\theta) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n (y_i - f_\theta(x_i))^2

Maximising this with respect to θ\theta is equivalent to minimising:

i=1n(yifθ(xi))2=MSE loss (up to a constant)\sum_{i=1}^n (y_i - f_\theta(x_i))^2 = \text{MSE loss (up to a constant)}

MSE loss = MLE under Gaussian noise assumption. When you switch from MSE to Huber loss, you are implicitly changing the noise distribution from Gaussian to something with heavier tails (more robust to outliers).

Loss FunctionImplicit Noise Model
MSEGaussian N(0,σ2)\mathcal{N}(0, \sigma^2)
MAELaplace distribution
Huber lossGaussian for small errors, Laplace for large
Cross-entropyBernoulli / Categorical

Maximum A Posteriori (MAP) Estimation

MLE only uses the likelihood - it ignores any prior knowledge about θ\theta. MAP estimation incorporates a prior distribution p(θ)p(\theta):

θ^MAP=argmaxθp(θD)=argmaxθp(Dθ)p(θ)\hat{\theta}_{\text{MAP}} = \arg\max_\theta p(\theta | \mathcal{D}) = \arg\max_\theta p(\mathcal{D}|\theta) \cdot p(\theta)

Taking the log:

θ^MAP=argmaxθ[logL(θ)+logp(θ)]\hat{\theta}_{\text{MAP}} = \arg\max_\theta \left[\log \mathcal{L}(\theta) + \log p(\theta)\right]

This is equivalent to minimising the regularised loss:

θ^MAP=argminθ[logL(θ)logp(θ)]\hat{\theta}_{\text{MAP}} = \arg\min_\theta \left[-\log \mathcal{L}(\theta) - \log p(\theta)\right]

L2 Regularisation = Gaussian Prior

If p(θ)=N(0,λ1I)p(\theta) = \mathcal{N}(0, \lambda^{-1}I) (a Gaussian prior centred at zero):

logp(θ)=λ2θ2+const\log p(\theta) = -\frac{\lambda}{2}\|\theta\|^2 + \text{const}

The MAP objective becomes:

θ^MAP=argminθ[NLL(θ)+λ2θ2]\hat{\theta}_{\text{MAP}} = \arg\min_\theta \left[\text{NLL}(\theta) + \frac{\lambda}{2}\|\theta\|^2\right]

This is exactly L2 regularisation (Ridge). The regularisation strength λ\lambda is the precision of the Gaussian prior - a stronger prior means we are more confident weights should be near zero.

L1 Regularisation = Laplace Prior

If p(θ)exp(λθ)p(\theta) \propto \exp(-\lambda|\theta|) (a Laplace prior):

logp(θ)=λθ1+const\log p(\theta) = -\lambda\|\theta\|_1 + \text{const}

The MAP objective:

θ^MAP=argminθ[NLL(θ)+λθ1]\hat{\theta}_{\text{MAP}} = \arg\min_\theta \left[\text{NLL}(\theta) + \lambda\|\theta\|_1\right]

This is L1 regularisation (Lasso). The Laplace prior has heavier tails than the Gaussian and places more probability mass near zero, which promotes sparsity.

import numpy as np
from numpy.linalg import norm

np.random.seed(42)
n, d = 100, 50 # n samples, d features (underdetermined: d > n)

X = np.random.randn(n, d)
w_true = np.zeros(d)
w_true[:5] = [1.0, -0.5, 0.8, -1.2, 0.3] # sparse true weights
y = X @ w_true + 0.1 * np.random.randn(n)

# MLE solution (OLS) - use lstsq for numerical stability
w_mle, _, _, _ = np.linalg.lstsq(X, y, rcond=None)

# MAP solution (Ridge regression) - L2 prior
# Normal equations with regularization: w_map = (X^T X + lambda I)^{-1} X^T y
lambda_reg = 1.0
A = X.T @ X + lambda_reg * np.eye(d)
b = X.T @ y
w_map = np.linalg.solve(A, b)

print("MLE vs MAP estimation (underdetermined system):")
print(f"MLE weights norm: {norm(w_mle):.3f}")
print(f"MAP weights norm: {norm(w_map):.3f}")
print(f"True weights norm: {norm(w_true):.3f}")

mle_error = norm(w_mle - w_true)
map_error = norm(w_map - w_true)
print(f"\nMLE estimation error: {mle_error:.3f}")
print(f"MAP estimation error: {map_error:.3f}")
print(f"MAP improvement: {((mle_error - map_error)/mle_error)*100:.1f}%")

:::note The Regularisation Story Every regularisation technique in ML has a Bayesian interpretation. When you add weight decay to Adam, you are performing MAP estimation with a Gaussian prior. When you add L1 to a linear model, you are performing MAP with a Laplace prior. The hyperparameter λ\lambda controls how strong your prior belief is that weights should be small. :::

Properties of Good Estimators

Consistency

An estimator θ^n\hat{\theta}_n is consistent if it converges to the true value as nn \to \infty:

θ^npθas n\hat{\theta}_n \xrightarrow{p} \theta \quad \text{as } n \to \infty

MLE estimators are consistent under mild regularity conditions - this is why "more data helps": your model parameters converge to the optimal values with enough data.

Efficiency and the Cramér-Rao Bound

An estimator is efficient if it achieves the lowest possible variance among all unbiased estimators. The Cramér-Rao lower bound establishes this minimum variance:

Var(θ^)1I(θ)\text{Var}(\hat{\theta}) \geq \frac{1}{I(\theta)}

where I(θ)I(\theta) is the Fisher information:

I(θ)=E[2logp(xθ)θ2]I(\theta) = -\mathbb{E}\left[\frac{\partial^2 \log p(x|\theta)}{\partial \theta^2}\right]

MLE is asymptotically efficient - for large nn, the MLE achieves the Cramér-Rao bound. No other consistent estimator can do better asymptotically.

import numpy as np

def cramer_rao_bound(sigma2, n):
"""Minimum variance from Cramér-Rao bound for Gaussian mean."""
# Fisher information for mean: I(mu) = n/sigma^2
return sigma2 / n

sigma2_true = 4.0
sample_sizes = [10, 50, 100, 500, 1000]

print("Cramér-Rao Bound vs Empirical Variance of Sample Mean:")
print(f"{'n':>8} | {'CRB (min var)':>14} | {'Empirical Var':>14} | {'Ratio':>8}")
print("-" * 54)

for n in sample_sizes:
crb = cramer_rao_bound(sigma2_true, n)
n_experiments = 10_000
means = [np.mean(np.random.normal(0, np.sqrt(sigma2_true), n))
for _ in range(n_experiments)]
empirical_var = np.var(means)
print(f"{n:>8} | {crb:>14.6f} | {empirical_var:>14.6f} | {empirical_var/crb:>8.4f}")
# Ratio should be ~1.0 - sample mean achieves the CRB

Summary of Estimator Properties

PropertyDefinitionML Implication
UnbiasedE[θ^]=θ\mathbb{E}[\hat{\theta}] = \thetaSample mean is unbiased for μ\mu
Consistentθ^nθ\hat{\theta}_n \to \theta as nn\to\inftyMLE converges with more data
EfficientAchieves Cramér-Rao boundMLE is asymptotically efficient
SufficientUses all info in dataMLE uses sufficient statistics

MLE vs MAP vs Full Bayesian

ApproachFormulaOutputRegularisationUse Case
MLEargmaxθp(Dθ)\arg\max_\theta p(\mathcal{D}\|\theta)Point estimateNoneLarge data, no prior
MAPargmaxθp(Dθ)p(θ)\arg\max_\theta p(\mathcal{D}\|\theta)p(\theta)Point estimateYes (via prior)Small data, prior known
Full Bayesianp(θD)p(Dθ)p(θ)p(\theta\|\mathcal{D}) \propto p(\mathcal{D}\|\theta)p(\theta)DistributionYes (via prior)Uncertainty needed

MLE overfits in small-data regimes. MAP adds regularisation. Full Bayesian inference (covered in Module 06) gives a distribution over parameters - uncertainty quantification rather than just a point estimate.

:::warning Common Interview Trap When asked "what is MLE?", many candidates say "you maximise the likelihood." That is correct but incomplete. The full answer: MLE finds the parameters θ^\hat{\theta} such that the observed data is most probable. It is equivalent to minimising cross-entropy loss for classification and MSE loss for regression with Gaussian noise. The regularisation you add converts it from MLE to MAP. :::

Putting It Together: The Training Loop as MLE

import numpy as np

# This simplified training loop IS MLE
def logistic_regression_mle(X, y, learning_rate=0.01, epochs=1000):
"""
Train logistic regression by minimising NLL = MLE.
This is exactly what torch.nn.BCELoss() does.
"""
n, d = X.shape
theta = np.zeros(d)

for epoch in range(epochs):
# Forward pass: predict probabilities
logits = X @ theta
y_hat = 1 / (1 + np.exp(-logits)) # sigmoid

# Loss = negative log-likelihood = cross-entropy
eps = 1e-15
nll = -np.mean(y * np.log(y_hat + eps) + (1 - y) * np.log(1 - y_hat + eps))

# Gradient of NLL w.r.t. theta
grad = X.T @ (y_hat - y) / n

# Gradient descent step (MLE update)
theta -= learning_rate * grad

if epoch % 100 == 0:
print(f"Epoch {epoch:4d} | NLL: {nll:.4f}")

return theta

# Generate synthetic data
np.random.seed(42)
n = 500
X = np.column_stack([np.ones(n), np.random.randn(n, 2)])
w_true = np.array([0.5, 1.2, -0.8])
probs = 1 / (1 + np.exp(-X @ w_true))
y = (np.random.rand(n) < probs).astype(float)

w_mle = logistic_regression_mle(X, y)
print(f"\nTrue weights: {w_true}")
print(f"MLE weights: {w_mle.round(3)}")

The training loop above is exactly MLE. When you add lambda * np.sum(theta**2) to the NLL, you convert it to MAP with a Gaussian prior - which is L2 regularised logistic regression.

Interview Q&A

Q1: Why does training a neural network with cross-entropy loss correspond to MLE?

Cross-entropy loss is 1nikyiklogy^ik-\frac{1}{n}\sum_i \sum_k y_{ik}\log\hat{y}_{ik}. If the model outputs a categorical distribution y^ik=p(y=kxi;θ)\hat{y}_{ik} = p(y=k|x_i;\theta), then the log-likelihood of the training data under this model is ikyiklogp(y=kxi;θ)\sum_i \sum_k y_{ik}\log p(y=k|x_i;\theta). Minimising cross-entropy is equivalent to maximising log-likelihood, which is MLE. The model parameters that minimise cross-entropy are exactly those that make the training labels most probable under the model's predicted distribution.

Q2: What is the difference between MLE and MAP? Give a practical example.

MLE finds the parameters that maximise the likelihood p(Dθ)p(\mathcal{D}|\theta). MAP finds the parameters that maximise the posterior p(θD)p(Dθ)p(θ)p(\theta|\mathcal{D}) \propto p(\mathcal{D}|\theta)p(\theta), incorporating a prior belief about θ\theta. Practical example: in linear regression, MLE gives ordinary least squares (OLS). MAP with a Gaussian prior on the weights gives Ridge regression (L2 regularisation). MAP with a Laplace prior gives Lasso (L1 regularisation). The regularisation term in your loss function IS the negative log of your prior.

Q3: What does it mean for an estimator to be biased? Is the MLE for Gaussian variance biased?

An estimator is biased if its expectation differs from the true parameter: E[θ^]θ\mathbb{E}[\hat{\theta}] \neq \theta. Yes, the MLE for Gaussian variance is biased - it divides by nn rather than n1n-1, giving σ^MLE2=1n(xixˉ)2\hat{\sigma}^2_{\text{MLE}} = \frac{1}{n}\sum(x_i-\bar{x})^2. The expected value is n1nσ2<σ2\frac{n-1}{n}\sigma^2 < \sigma^2. Bessel's correction multiplies by nn1\frac{n}{n-1} to produce the unbiased 1n1(xixˉ)2\frac{1}{n-1}\sum(x_i-\bar{x})^2.

Q4: How does the bias-variance tradeoff of estimators relate to the bias-variance tradeoff in ML models?

They are the same tradeoff at different scales. For an estimator, MSE(θ^)=Bias2(θ^)+Var(θ^)\text{MSE}(\hat{\theta}) = \text{Bias}^2(\hat{\theta}) + \text{Var}(\hat{\theta}). For an ML model, expected generalisation error decomposes the same way: Bias2+Variance+Irreducible noise\text{Bias}^2 + \text{Variance} + \text{Irreducible noise}. A model with high bias (underfitting) is using a function class too simple to represent the true relationship. A model with high variance (overfitting) is sensitive to which training dataset was sampled - the estimator has high variance across datasets.

Q5: What is the Cramér-Rao lower bound and why does it matter in ML?

The Cramér-Rao lower bound states that for any unbiased estimator θ^\hat{\theta}, Var(θ^)1/I(θ)\text{Var}(\hat{\theta}) \geq 1/I(\theta) where I(θ)I(\theta) is the Fisher information. No unbiased estimator can be more precise than this. It matters in ML because: (1) it tells us the fundamental limits of parameter estimation from data, (2) MLE achieves this bound asymptotically (it is asymptotically efficient), and (3) the Fisher information matrix is the basis for natural gradient methods - K-FAC, and Adam's second-moment estimate is a diagonal approximation of the inverse Fisher. Understanding Fisher information connects training dynamics to statistical estimation theory.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the MLE & MAP Explorer demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.