Expectation, Variance, and Moments
Reading time: ~45 minutes | Interview relevance: Very High | Target roles: ML Engineer, AI Engineer, Research Scientist, Data Scientist, MLOps Engineer
The ML Scenario That Motivates This Lesson
You are training a transformer model and you notice the training loss oscillates wildly. Someone suggests increasing the batch size. Why does that help?
The answer requires understanding variance of estimators. The gradient of the true loss function (over the full training set) is the expected gradient over the data distribution. The mini-batch gradient is an estimate of this expectation. Its variance decreases as where is the batch size - doubling the batch size halves the gradient variance, leading to more stable updates.
Similarly, Batch Normalization works by computing the mean and variance of activations within each mini-batch and normalizing by them. These are moment statistics. The learnable parameters and re-scale and re-shift to optimal mean and variance. Without understanding expectation and variance, BatchNorm is a mystery; with these tools, it's a natural solution.
1. Expected Value
The expected value (or expectation, or mean) is the probability-weighted average of all possible values of .
Discrete RV
Example: Fair die:
Continuous RV
Example: Standard normal: (by symmetry)
Expected Value of a Function
For any function :
This is called the Law of the Unconscious Statistician (LOTUS) - you don't need the distribution of to compute .
import numpy as np
from scipy.stats import norm, binom
# Discrete: expected value of a die
die_faces = np.arange(1, 7)
die_probs = np.ones(6) / 6
ev_die = np.sum(die_faces * die_probs)
print(f"E[X] for fair die: {ev_die:.4f}") # 3.5
# E[X^2] for a fair die (using LOTUS)
ev_x_squared = np.sum(die_faces**2 * die_probs)
print(f"E[X^2] for fair die: {ev_x_squared:.4f}") # 15.1667
# Continuous: verify E[X] = 0 for N(0,1) via integration
from scipy import integrate
integrand = lambda x: x * norm.pdf(x)
ev_normal, _ = integrate.quad(integrand, -np.inf, np.inf)
print(f"E[X] for N(0,1): {ev_normal:.8f}") # ~0
2. Linearity of Expectation
The most important property of expectation:
This holds even when and are dependent - a remarkable fact.
More generally, for any constants and random variables :
Why This Is Powerful
Example 1: Expected number of heads in flips
Instead of computing from the Binomial PMF, write where each . Then:
Example 2: Expected loss over a dataset
If the total loss is , then:
By the Law of Large Numbers (Lesson 07), as , the empirical average converges to this expectation.
:::tip Linearity of Expectation in SGD The stochastic gradient is an unbiased estimator of the full gradient:
This unbiasedness is guaranteed by linearity of expectation. It is why SGD converges to a minimum of the true loss function (under appropriate learning rate schedules), not just a minimum of the training batch. :::
3. Variance
Variance measures the spread of a distribution around its mean:
Computational Formula
A simpler formula for computing variance:
Proof:
Standard Deviation
Standard deviation has the same units as , making it more interpretable than variance.
Properties of Variance
| Property | Formula | Notes |
|---|---|---|
| Constant shift | Shifting doesn't change spread | |
| Scaling | Scaling squares the variance | |
| Sum of independent | Only if | |
| Non-negative | Zero iff is a constant |
import numpy as np
np.random.seed(42)
n = 1_000_000
X = np.random.randn(n) # N(0,1)
# Variance properties
c = 5.0
a = 3.0
print(f"Var(X) = {X.var():.4f} (expected: 1.0)")
print(f"Var(X + c) = {(X + c).var():.4f} (expected: 1.0)")
print(f"Var(a*X) = {(a*X).var():.4f} (expected: {a**2:.1f})")
print(f"Std(X) = {X.std():.4f} (expected: 1.0)")
# Computational formula: Var(X) = E[X^2] - E[X]^2
ev_x2 = (X**2).mean()
ev_x = X.mean()
var_formula = ev_x2 - ev_x**2
print(f"\nE[X^2] - E[X]^2 = {var_formula:.6f} (direct: {X.var():.6f})")
# Variance of sum: independent vs dependent
Y = np.random.randn(n) # independent of X
Z_indep = X + Y
print(f"\nVar(X+Y) where X,Y independent:")
print(f" Direct: {Z_indep.var():.4f}, Formula: {X.var() + Y.var():.4f}")
# Make Y dependent on X
Y_dep = 0.8 * X + 0.6 * np.random.randn(n)
Z_dep = X + Y_dep
print(f"\nVar(X+Y) where Y = 0.8X + noise:")
print(f" Direct: {Z_dep.var():.4f}")
print(f" (higher than {X.var() + Y_dep.var():.4f} because of positive covariance)")
4. Covariance and Correlation
Covariance
Covariance measures the joint variability of two random variables:
Computational formula:
| Covariance Value | Interpretation |
|---|---|
| and tend to increase together | |
| and are uncorrelated | |
| When increases, tends to decrease |
Key identity:
Sum of correlated variables:
Correlation Coefficient
Normalized covariance that lives in :
- : perfect positive linear relationship
- : perfect negative linear relationship
- : uncorrelated (but may still be dependent!)
:::warning Uncorrelated Does Not Mean Independent means no linear relationship. Two random variables can be perfectly dependent but uncorrelated. Example: , . Then (odd moments of symmetric distributions are zero). Yet is completely determined by .
In ML: features can be uncorrelated but still redundant or dependent in nonlinear ways. :::
import numpy as np
np.random.seed(42)
n = 100_000
# X and Y with controlled correlation
rho = 0.8
X = np.random.randn(n)
Y = rho * X + np.sqrt(1 - rho**2) * np.random.randn(n)
print(f"Target rho = {rho}")
print(f"Empirical Cov(X,Y) = {np.cov(X, Y)[0,1]:.4f}")
print(f"Empirical Corr(X,Y) = {np.corrcoef(X, Y)[0,1]:.4f}")
# Demonstrate: uncorrelated but dependent
X2 = np.random.randn(n)
Y2 = X2**2
print(f"\nX and Y=X^2:")
print(f"Correlation = {np.corrcoef(X2, Y2)[0,1]:.4f} (near 0 - uncorrelated)")
print(f"But Y is completely determined by X (dependent!)")
Covariance Matrix
For a random vector , the covariance matrix is:
Properties:
- is symmetric:
- is positive semi-definite: for all
- Diagonal entries are variances:
import numpy as np
# Compute empirical covariance matrix
np.random.seed(42)
n_samples, n_features = 1000, 4
# Create correlated features
A = np.array([[1.0, 0.5, 0.0, -0.3],
[0.5, 1.0, 0.2, 0.0],
[0.0, 0.2, 1.0, 0.4],
[-0.3, 0.0, 0.4, 1.0]])
# Generate data with this covariance structure
L = np.linalg.cholesky(A) # Cholesky decomposition
X = (L @ np.random.randn(n_features, n_samples)).T # (n_samples, n_features)
Sigma_empirical = np.cov(X.T)
print("Empirical covariance matrix:")
print(Sigma_empirical.round(3))
print(f"\nSymmetric: {np.allclose(Sigma_empirical, Sigma_empirical.T)}")
print(f"PSD (all eigenvalues >= 0): {np.all(np.linalg.eigvals(Sigma_empirical) >= 0)}")
5. Higher Moments
The -th moment of about the origin:
The -th central moment (about the mean ):
| Moment | Quantity | Formula |
|---|---|---|
| 1st central | Mean | |
| 2nd central | Variance | |
| 3rd standardized | Skewness | |
| 4th standardized | Kurtosis |
Skewness
- Positive skewness: long right tail (e.g., income distributions, word frequencies)
- Negative skewness: long left tail
- Zero skewness: symmetric distribution (e.g., Gaussian)
Kurtosis
- Gaussian: kurtosis = 3 (or 0 in excess kurtosis convention)
- Higher kurtosis ("leptokurtic"): heavier tails, more extreme outliers
- Lower kurtosis ("platykurtic"): lighter tails
Skewness vs Kurtosis:
Normal: Positive Skew: High Kurtosis:
┌───┐ ┌─┐ ┌─┐
│ │ │ \ ││ │
─┤ ├─ ─┤ \─ ──┤ ├──
└───┘ └───── └─┘
Symmetric Long right tail Heavy tails
from scipy.stats import norm, expon, t as t_dist
import numpy as np
np.random.seed(42)
n = 100_000
distributions = {
'Normal N(0,1)': np.random.randn(n),
'Exponential(1)': np.random.exponential(1, n),
't-distribution(df=3)': np.random.standard_t(df=3, size=n),
}
from scipy.stats import skew, kurtosis
print(f"{'Distribution':<25} {'Skewness':>10} {'Kurtosis':>10}")
print("-" * 47)
for name, samples in distributions.items():
sk = skew(samples)
ku = kurtosis(samples, fisher=False) # Fisher=False gives Pearson kurtosis
print(f"{name:<25} {sk:>10.4f} {ku:>10.4f}")
# Normal: skewness≈0, kurtosis≈3
# Exponential: skewness≈2, kurtosis≈9 (right-skewed, heavy tails)
# t(df=3): skewness≈0 (symmetric), kurtosis>>3 (very heavy tails)
6. Batch Normalization: Moments in Practice
Batch Normalization normalizes activations using the empirical mean and variance of each feature across the mini-batch:
where:
Then it applies learnable affine transform:
BatchNorm steps:
x_i (activations)
│
▼ Compute batch mean μ and variance σ²
│
▼ Normalize: x_hat = (x - μ) / sqrt(σ² + ε)
│
▼ Rescale: y = γ * x_hat + β (learned γ, β)
│
▼ y_i (normalized activations)
Why does it work? Deep networks suffer from internal covariate shift - the distribution of activations changes as weights update. BatchNorm forces each layer's input to have approximately zero mean and unit variance (before the transform), stabilizing training.
import numpy as np
def batch_norm(X, gamma=1.0, beta=0.0, eps=1e-5):
"""
X: shape (batch_size, features)
Computes batch normalization along the batch dimension.
"""
mu = X.mean(axis=0) # mean over batch: shape (features,)
sigma2 = X.var(axis=0) # variance over batch
X_norm = (X - mu) / np.sqrt(sigma2 + eps) # normalize
return gamma * X_norm + beta # scale and shift
# Simulate a mini-batch of activations
np.random.seed(42)
batch = np.random.randn(32, 64) * 5 + 3 # Non-standard: mean=3, std=5
print(f"Before BatchNorm: mean={batch.mean():.2f}, std={batch.std():.2f}")
normed = batch_norm(batch, gamma=1.0, beta=0.0)
print(f"After BatchNorm: mean={normed.mean():.4f}, std={normed.std():.4f}")
# After BN with gamma=2, beta=0.5
normed_gs = batch_norm(batch, gamma=2.0, beta=0.5)
print(f"With gamma=2, beta=0.5: mean={normed_gs.mean():.4f}, std={normed_gs.std():.4f}")
7. Gradient Variance and Batch Size
Recall from SGD: the stochastic gradient is:
By linearity of expectation: (unbiased).
What about variance? Assuming the per-sample gradients are independent:
Gradient variance scales as .
This explains:
- Larger batches → lower gradient variance → more stable, larger learning rates possible
- Very small batches → high variance → acts as noise/regularization (can help escape local minima)
- The "linear scaling rule": if you multiply batch size by , multiply learning rate by (to keep the effective step size consistent with the variance reduction)
import numpy as np
np.random.seed(42)
n_params = 1
# Simulate per-sample gradients (iid draws from N(1, 4))
true_gradient = 1.0
sample_std = 2.0
N = 10_000 # full dataset
per_sample_grads = true_gradient + sample_std * np.random.randn(N)
def estimate_gradient_variance(per_sample_grads, batch_size, n_estimates=5000):
"""Simulate gradient variance for given batch size."""
estimates = []
for _ in range(n_estimates):
batch_idx = np.random.choice(len(per_sample_grads), batch_size, replace=False)
batch_grad = per_sample_grads[batch_idx].mean()
estimates.append(batch_grad)
return np.var(estimates)
print(f"True gradient: {true_gradient}, Per-sample std: {sample_std}")
print(f"\nBatch size | Gradient Var | Expected Var (σ²/B)")
print("-" * 50)
for B in [1, 4, 16, 64, 256]:
empirical_var = estimate_gradient_variance(per_sample_grads, B, n_estimates=2000)
expected_var = sample_std**2 / B
print(f" B={B:<5} | {empirical_var:.5f} | {expected_var:.5f}")
8. ML Connection Summary
| ML Concept | Probabilistic Quantity | Lesson |
|---|---|---|
| Training loss | Expectation | |
| SGD gradient | Unbiased estimate of | Linearity of expectation |
| Gradient stability | Variance | |
| Batch Normalization | Normalize by and | Mean and variance |
| Feature redundancy | High | Covariance |
| PCA | Eigendecomposition of covariance matrix | Covariance matrix |
| Activation distribution | Skewness, kurtosis | Higher moments |
| Weight initialization | Variance of weights → variance of activations | Variance propagation |
9. Interview Q&A
Q1: What is the linearity of expectation and why does it matter for SGD?
A: Linearity of expectation states that for any random variables (even if they are dependent) and any constants . For SGD, the mini-batch gradient is . By linearity: . This means the mini-batch gradient is an unbiased estimator of the true gradient - in expectation, we move in exactly the right direction. Without this property, SGD would not converge to the correct optimum.
Q2: What is the relationship between variance and batch size? What does this imply for training?
A: The variance of the mini-batch gradient estimate is , where is the per-sample gradient variance. Gradient variance decreases linearly with batch size. This implies: (1) small batches have high gradient variance - updates are noisy but this noise can help escape local minima; (2) large batches have low variance - updates are more accurate but the model may converge to sharper minima that generalize worse ("generalization gap"); (3) the linear scaling rule for learning rate: if batch size increases by , multiply learning rate by to maintain the same effective update magnitude relative to gradient variance; (4) gradient accumulation (simulating large batches by accumulating gradients over multiple small batches) reduces variance exactly as larger batches would.
Q3: What is the computational formula for variance and why is it preferred numerically?
A: The computational formula allows you to compute mean and mean-of-squares in a single pass, then subtract. The definitional formula requires two passes (first compute , then compute squared deviations). However, the computational formula can have catastrophic cancellation issues: if is very large and is small, subtracting two large numbers of similar magnitude causes loss of significant digits. In practice, Welford's online algorithm computes a numerically stable variance in a single pass by maintaining running mean and sum of squared deviations. NumPy's np.var() uses a two-pass algorithm to avoid this issue.
Q4: What is the covariance matrix and why is it important in ML?
A: The covariance matrix has . It captures all pairwise linear relationships between features. It matters in ML for several reasons: (1) PCA finds the eigenvectors of the covariance matrix of the data - they are the principal directions of variance; (2) Gaussian distributions are fully characterized by their mean vector and covariance matrix - so the covariance matrix completely determines a multivariate Gaussian; (3) Linear regression - the covariance matrix of features appears in the normal equations , and its condition number determines numerical stability; (4) Mahalanobis distance uses the inverse covariance matrix to measure distances in a feature space that accounts for correlations and different scales.
Q5: How does Batch Normalization use moments, and what problem does it solve?
A: Batch Normalization computes the empirical mean and variance over each mini-batch, then normalizes: . Finally, it applies learnable scale and shift . The problem it solves is internal covariate shift: as weights update during training, the distribution of activations at each layer shifts, forcing subsequent layers to constantly re-adapt. By forcing each layer's inputs to have approximately zero mean and unit variance (before correction), BatchNorm stabilizes these distributions, allows higher learning rates, reduces sensitivity to weight initialization, and acts as mild regularization (the mean/variance statistics are computed from the batch, introducing noise). At inference time, running statistics (accumulated exponential moving averages of and over training batches) are used instead of batch statistics.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Moments Explorer demo on the EngineersOfAI Playground - no code required.
:::
