Bias-Variance Tradeoff

Reading time: ~22 minutes | Level: Statistical Learning Theory → Foundational Deep Learning

The Interview Setup

You're asked in an ML interview: "Explain the bias-variance tradeoff."

The wrong answer: "Bias is when the model is too simple and underfits. Variance is when the model is too complex and overfits. There's a tradeoff between them."

This is qualitatively correct but fails to impress because it lacks the mathematical structure that makes the concept precise and actionable.

The right answer involves: a formal decomposition of the mean squared error into three components, an understanding of where each comes from, and - critically - an explanation of why this classical picture breaks down for modern deep learning (double descent).

This lesson gives you all of it.

What You Will Learn

The formal MSE decomposition into bias², variance, and irreducible noise - with full derivation
Intuition via polynomial regression, with code to compute each component empirically
How model choices (capacity, regularization, data, ensembles) affect bias and variance
Variance reduction via ensembling: formal analysis and the correlation limit
Learning curve interpretation: how to diagnose high-bias vs high-variance from a plot
Double descent: why the classical U-shaped curve breaks for modern overparameterized models
The implicit regularization story: why SGD in the overparameterized regime still generalizes
Five interview questions with full worked answers

The Mathematical Decomposition

Consider regression: predict $y \in \mathbb{R}$ from $x \in \mathcal{X}$ .

Setup:

True relationship: $y = f(x) + \epsilon$ , where $\epsilon \sim (0, \sigma^2)$ (noise)
Hypothesis class: $\mathcal{H}$ (e.g., polynomials of degree $k$ )
Training set: $\mathcal{D} = \{(x_i, y_i)\}_{i=1}^n$ drawn i.i.d.
Learned model: $\hat{f}_\mathcal{D}(x)$ - depends on the specific training set $\mathcal{D}$
Loss: mean squared error

Decomposition of the expected test MSE at a fixed test point $x_0$ :

$\underbrace{E_\mathcal{D}[(y_0 - \hat{f}_\mathcal{D}(x_0))^2]}_{\text{expected test MSE}} = \underbrace{(E_\mathcal{D}[\hat{f}_\mathcal{D}(x_0)] - f(x_0))^2}_{\text{Bias}^2} + \underbrace{E_\mathcal{D}[(\hat{f}_\mathcal{D}(x_0) - E_\mathcal{D}[\hat{f}_\mathcal{D}(x_0)])^2]}_{\text{Variance}} + \underbrace{\sigma^2}_{\text{Noise}}$

where the expectation $E_\mathcal{D}$ is over all possible training sets drawn from the same distribution.

Derivation: Let $\bar{f}(x) = E_\mathcal{D}[\hat{f}_\mathcal{D}(x)]$ (the average prediction across all training sets). Then:

$E[(y - \hat{f})^2] = E[(y - f + f - \bar{f} + \bar{f} - \hat{f})^2]$

Expanding and using the fact that $\epsilon = y - f$ has zero mean and is independent of $\hat{f}$ , the cross terms vanish:

$= E[(y - f)^2] + E[(f - \bar{f})^2] + E[(\bar{f} - \hat{f})^2]$ $= \sigma^2 + \text{Bias}^2 + \text{Variance}$

:::note Key Definitions Bias: $\text{Bias}(\hat{f}(x)) = E_\mathcal{D}[\hat{f}_\mathcal{D}(x)] - f(x)$

The systematic error: how far the average prediction (over all training sets) is from the truth. A model that can only represent linear functions has high bias when the truth is nonlinear - no matter how much data you give it.

Variance: $\text{Var}(\hat{f}(x)) = E_\mathcal{D}[(\hat{f}_\mathcal{D}(x) - \bar{f}(x))^2]$

The estimation error: how much the prediction varies across different training sets. A complex model fits noise in the training data, so its predictions vary wildly depending on which $n$ training examples were drawn.

Noise: $\sigma^2 = E[(y - f(x))^2]$

Irreducible error: the inherent noise in the data-generating process. No model, regardless of complexity, can reduce this below $\sigma^2$ . :::

Intuition via Polynomial Regression

import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline

def polynomial_regressor(degree):
    return Pipeline([
        ('poly', PolynomialFeatures(degree=degree, include_bias=True)),
        ('lr', LinearRegression(fit_intercept=False))
    ])

def compute_bias_variance(degree, n_train=30, n_test=1000,
                          n_datasets=200, noise_std=0.5):
    """
    Estimate bias^2 and variance of degree-k polynomial regression
    over many random training sets.
    """
    np.random.seed(42)
    true_f = lambda x: np.sin(2 * np.pi * x)

    # Fixed test points
    X_test = np.linspace(0, 1, n_test).reshape(-1, 1)
    f_true = true_f(X_test.ravel())

    # Collect predictions from many training sets
    predictions = []  # shape: (n_datasets, n_test)

    for _ in range(n_datasets):
        X_train = np.random.uniform(0, 1, n_train).reshape(-1, 1)
        y_train = true_f(X_train.ravel()) + np.random.normal(0, noise_std, n_train)

        model = polynomial_regressor(degree)
        try:
            model.fit(X_train, y_train)
            y_pred = model.predict(X_test)
            predictions.append(y_pred)
        except np.linalg.LinAlgError:
            continue

    predictions = np.array(predictions)  # (n_datasets, n_test)
    f_bar = predictions.mean(axis=0)     # average prediction

    bias_sq = np.mean((f_bar - f_true)**2)
    variance = np.mean(np.var(predictions, axis=0))
    noise = noise_std**2
    total = bias_sq + variance + noise

    return bias_sq, variance, noise, total

# Compute for different polynomial degrees
degrees = [1, 2, 3, 5, 7, 10, 15]
results = {}
for d in degrees:
    b2, v, noise, total = compute_bias_variance(d, n_train=30)
    results[d] = (b2, v, noise, total)
    print(f"Degree {d:2d}: Bias²={b2:.4f}, Variance={v:.4f}, "
          f"Noise={noise:.4f}, Total={total:.4f}")

# Plot
fig, ax = plt.subplots(figsize=(11, 5))
d_vals = list(results.keys())
bias_sq_vals = [results[d][0] for d in d_vals]
var_vals = [results[d][1] for d in d_vals]
noise_vals = [results[d][2] for d in d_vals]
total_vals = [results[d][3] for d in d_vals]

ax.plot(d_vals, bias_sq_vals, 'b-o', linewidth=2, label='Bias²', markersize=8)
ax.plot(d_vals, var_vals, 'r-s', linewidth=2, label='Variance', markersize=8)
ax.plot(d_vals, total_vals, 'g-^', linewidth=2, label='Total MSE', markersize=8)
ax.axhline(noise_vals[0], color='gray', linestyle='--', label=f'Noise floor (σ²={noise_vals[0]:.3f})')
ax.set_xlabel('Polynomial degree (model complexity)')
ax.set_ylabel('MSE component')
ax.set_title('Bias-Variance Tradeoff: Polynomial Regression')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('bias_variance_polynomial.png', dpi=150)

The U-shaped total error curve shows:

Low degree: high bias (model too simple to fit the sine curve), low variance
High degree: low bias (can fit any smooth curve), high variance (overfits noise)
Optimal degree: minimizes bias² + variance

The Formal Bias-Variance for Classification

For binary classification with 0-1 loss, the bias-variance decomposition is more complex (no clean additive form). One formulation (Domingos, 2000):

$P[\hat{f}(x) \neq y] = \sigma^2(x) + \text{bias}(x) + \text{variance}(x)$

where the terms are defined differently for discrete loss. However, the qualitative picture is the same: complex models have lower bias but higher variance.

For cross-entropy loss (used in neural networks), no clean closed-form decomposition exists, but the intuition holds: the generalization gap arises from a combination of approximation error (bias) and estimation error (variance).

Practical Consequences for ML Engineering

How Model Choices Affect Bias and Variance

Model Choice	Effect on Bias	Effect on Variance
Increase model capacity (more layers, neurons)	Decreases bias	Increases variance
Add regularization (L2, L1, dropout)	Increases bias slightly	Decreases variance
Get more training data	No effect on bias	Decreases variance ( $\propto 1/n$ )
Ensemble of models (averaging)	No effect on bias	Decreases variance (to $\sigma^2/M$ for independent models)
Feature engineering (adding good features)	Decreases bias	May increase variance slightly
Increase batch normalization	Slight regularization effect	Variance reduction

Variance Reduction via Ensembles

For $M$ independent models each with variance $\sigma^2_f$ :

$\text{Var}\left[\frac{1}{M}\sum_{m=1}^M \hat{f}_m(x)\right] = \frac{\sigma^2_f}{M}$

This is why ensembles work: averaging reduces variance by $M$ , while bias remains the same (average of unbiased estimators is unbiased). In practice, models are correlated, so the reduction is by $\frac{1}{M}(\rho + (1-\rho)/M)\sigma^2_f$ where $\rho$ is the correlation. Minimizing $\rho$ (diversity in ensemble) is key.

import numpy as np

def simulate_ensemble_benefit(true_f, n_train=50, n_test=1000,
                               n_models=20, degree=5, noise=0.3):
    """Demonstrate variance reduction through ensembling."""
    np.random.seed(42)
    X_test = np.linspace(0, 1, n_test)
    y_test_true = true_f(X_test)

    single_model_preds = []
    for _ in range(100):  # 100 different training sets
        X_train = np.random.uniform(0, 1, n_train)
        y_train = true_f(X_train) + np.random.normal(0, noise, n_train)
        coeffs = np.polyfit(X_train, y_train, degree)
        y_pred = np.polyval(coeffs, X_test)
        single_model_preds.append(y_pred)

    single_model_preds = np.array(single_model_preds)

    # Single model MSE
    single_mse = np.mean((single_model_preds - y_test_true)**2)

    # Ensemble of M models
    ensemble_sizes = [1, 2, 5, 10, 20, 50, 100]
    ensemble_mses = []
    for M in ensemble_sizes:
        # Average M models trained on different data subsets
        if M > len(single_model_preds):
            break
        ensemble_pred = single_model_preds[:M].mean(axis=0)
        ensemble_mse = np.mean((ensemble_pred - y_test_true)**2)
        ensemble_mses.append((M, ensemble_mse))

    print(f"Single model MSE: {single_mse:.4f}")
    print(f"Ensemble MSE by size:")
    for M, mse in ensemble_mses:
        print(f"  M={M:3d}: MSE={mse:.4f} (reduction: {(single_mse-mse)/single_mse*100:.1f}%)")

simulate_ensemble_benefit(true_f=lambda x: np.sin(4*np.pi*x))

The Bias-Variance-Complexity Curve

Learning curves tell a story about bias and variance:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.model_selection import learning_curve

def plot_learning_curves_bias_variance():
    """
    Learning curves reveal bias (gap between train and test at large n)
    and variance (gap between train and test at small n).
    """
    np.random.seed(42)
    n_total = 5000
    X = np.random.uniform(-3, 3, n_total).reshape(-1, 1)
    y = np.sin(X.ravel()) + 0.2 * np.random.randn(n_total)

    fig, axes = plt.subplots(1, 2, figsize=(14, 5))

    for ax, (name, model) in zip(axes, [
        ('Low capacity (degree=1, high bias)', Pipeline([
            ('poly', PolynomialFeatures(1)),
            ('ridge', Ridge(alpha=1.0))
        ])),
        ('High capacity (degree=8, high variance)', Pipeline([
            ('poly', PolynomialFeatures(8)),
            ('ridge', Ridge(alpha=0.001))
        ])),
    ]):
        train_sizes, train_scores, val_scores = learning_curve(
            model, X, y,
            train_sizes=np.logspace(1.2, 3.3, 15).astype(int),
            scoring='neg_mean_squared_error',
            cv=5, n_jobs=-1
        )
        train_mse = -train_scores.mean(axis=1)
        val_mse = -val_scores.mean(axis=1)

        ax.semilogx(train_sizes, train_mse, 'b-o', label='Training MSE')
        ax.semilogx(train_sizes, val_mse, 'r-s', label='Validation MSE')
        ax.fill_between(train_sizes, train_mse, val_mse,
                       alpha=0.2, color='orange', label='Generalization gap')

        ax.set_xlabel('Training set size n')
        ax.set_ylabel('Mean Squared Error')
        ax.set_title(name)
        ax.legend()
        ax.grid(True, alpha=0.3)

    plt.suptitle('Learning Curves: Diagnosing Bias vs Variance', fontsize=13)
    plt.tight_layout()
    plt.savefig('bias_variance_learning_curves.png', dpi=150)
    print("Learning curves saved")

plot_learning_curves_bias_variance()

Reading learning curves:

Large gap at all $n$ : high variance - model is overfitting. Add regularization, reduce complexity, or get more data.
Both train and test error are high at large $n$ : high bias - model is underfitting. Increase complexity or add features.
Convergence to a high asymptote: irreducible error (noise). This cannot be reduced regardless of data or model.

Double Descent: The Classical Picture Breaks

In 2019, Belkin et al. published "Reconciling Modern Machine Learning Practice and the Classical Bias-Variance Tradeoff," demonstrating empirically what theorists had begun to notice: for modern ML, the bias-variance tradeoff looks wrong.

The classical picture predicts a U-shaped test error curve. The modern picture shows double descent: test error decreases, then increases (classical overfitting), then decreases again as model capacity goes far beyond the interpolation threshold.

Classical bias-variance:
    Test error
    │         ⌒
    │       ╱  ╲
    │     ╱     ╲___
    │   ╱
    │ ╱
    └────────────────> Model complexity
          ↑
       optimal
      complexity

Modern (double descent):
    Test error
    │         ⌒
    │       ╱  ╲
    │     ╱     ╲      ╲___
    │   ╱        ╲   ╱
    │ ╱            ╲╱
    └────────────────────────> Model complexity
                   ↑
          interpolation threshold
          (n parameters = n data points)

The "interpolation threshold" is where the model has exactly enough parameters to perfectly fit the training data. Classical theory predicts disaster here. Modern theory and empirical evidence shows that going far beyond this threshold (overparameterization) can actually improve generalization.

Why? In the overparameterized regime, there are infinitely many zero-training-error solutions. Gradient descent with small initialization and appropriate step size finds the minimum-norm solution - implicit regularization. This minimum-norm interpolating solution generalizes well despite having zero training error.

This phenomenon is studied rigorously in Lessons 05 (Regularization Theory) and 07 (Generalization in Deep Learning).

Connections and Role-Specific Relevance

:::note Role-specific relevance ML Engineers: The practical debugging workflow - learning curves + bias/variance diagnosis - is a fundamental skill. When your model generalizes poorly, the bias-variance decomposition tells you exactly what to try next: more data (variance), more capacity (bias), regularization (variance), or feature engineering (bias).

Research Engineers / Scientists: Double descent is an active research area at ICML/NeurIPS. Understanding the connection between implicit SGD regularization and the minimum-norm interpolating solution (studied under "benign overfitting") is essential for following modern generalization theory literature.

AI Engineers deploying models: Ensembling (averaging multiple model runs with different seeds) is one of the highest-ROI techniques in production. The formal variance reduction analysis explains why it works: $M$ uncorrelated models reduce variance by $1/M$ . Quantify expected improvement before deciding whether ensemble inference cost is worth it.

Data Scientists: The "more data reduces variance but not bias" insight directly informs when to collect more data vs when to change the model. If your learning curve flatlines at high error, more data won't help - change the model. :::

:::tip The Single Most Useful Diagnostic Plot Always plot your learning curve before tuning hyperparameters:

x-axis: training set size $n$ (log scale)
y-axis: training error and validation error

High-bias signature: both curves converge to a high plateau regardless of $n$ . High-variance signature: large gap between train and val that closes slowly as $n$ grows.

This plot tells you whether to collect more data, change the model, or tune regularization - before you spend time optimizing the wrong thing. :::

Interview Questions

Q1: Formally derive the bias-variance decomposition of the expected test MSE.

For regression with true $f$ and noise $\epsilon \sim (0, \sigma^2)$ : let $\bar{f}(x) = E_\mathcal{D}[\hat{f}_\mathcal{D}(x)]$ . The expected MSE at $x$ :

$E[(y - \hat{f})^2] = E[(f + \epsilon - \hat{f})^2] = E[(f - \hat{f})^2] + \sigma^2$

(cross term vanishes because $\epsilon$ is independent and mean zero).

$E[(f - \hat{f})^2] = E[(f - \bar{f} + \bar{f} - \hat{f})^2] = (f - \bar{f})^2 + E[(\bar{f} - \hat{f})^2]$

(cross term $2(f - \bar{f})E[\bar{f} - \hat{f}] = 0$ since $E[\hat{f}] = \bar{f}$ ).

So: $E[(y-\hat{f})^2] = (f(x) - \bar{f}(x))^2 + E[(\hat{f}(x) - \bar{f}(x))^2] + \sigma^2 = \text{Bias}^2 + \text{Variance} + \text{Noise}$ .

Q2: Why does getting more training data reduce variance but not bias?

Variance measures how much the model's prediction changes across different training sets. As $n \to \infty$ , each training set becomes a more representative sample of the population - different training sets look more alike, so the model trained on them looks more alike. Formally, for a sample mean estimator, Var = $\sigma^2/n$ which goes to zero. But bias is the systematic error of the average predictor $\bar{f}(x) - f(x)$ . If the hypothesis class $\mathcal{H}$ cannot represent $f$ (e.g., $\mathcal{H}$ is linear but $f$ is sinusoidal), the average predictor still has non-zero systematic error regardless of $n$ . More data just gives a better estimate of the best linear approximation to $f$ - but the approximation error is irreducible without changing the model class. This is why more data alone doesn't fix underfitting.

Q3: Explain why ensembles reduce variance. When does ensemble averaging fail?

For $M$ models with predictions $\hat{f}_1, \ldots, \hat{f}_M$ , the ensemble is $\hat{f}_{ens} = \frac{1}{M}\sum_m \hat{f}_m$ . If models are independent with variance $\sigma^2_f$ : $\text{Var}[\hat{f}_{ens}] = \sigma^2_f / M$ . With correlation $\rho$ between models: $\text{Var}[\hat{f}_{ens}] = \frac{\sigma^2_f}{M}(1 + (M-1)\rho)$ . As $M \to \infty$ : $\text{Var} \to \rho \sigma^2_f$ . So the minimum variance achievable through ensembling is $\rho \sigma^2_f$ - determined by the correlation between models. Ensemble averaging fails when: (1) all models are trained on the same data with the same algorithm - they're too correlated; (2) the models all have the same high bias - averaging high-bias models doesn't reduce bias; (3) models are correlated because they memorize the same training examples. Diversity is key: Random Forests use random feature subsets to decorrelate trees; boosting uses sequential error-correction to create diverse models.

Q4: What is double descent and how does it challenge classical bias-variance thinking?

Classical bias-variance theory predicts that test error first decreases as model complexity grows (bias reduction), reaches a minimum, then increases (variance increase from overfitting). This U-shaped curve implies you should choose the "sweet spot" of model complexity. Double descent (Belkin et al., 2019; Advani & Saxe, 2017) shows a more complex picture for modern models: as complexity grows past the interpolation threshold (where the model can perfectly fit training data), test error first peaks, then decreases again with further overparameterization. This "second descent" happens because in the overparameterized regime, there are infinitely many zero-training-error solutions, and optimization (gradient descent with small init) selects the minimum-norm one - which often generalizes well due to implicit regularization. Double descent challenges classical theory by showing that overparameterization is not always bad. The resolution: classical theory assumes ERM always returns the minimum-training-error hypothesis with maximum complexity, but gradient descent has an implicit regularization bias toward simpler solutions.

Q5: How would you use the bias-variance decomposition to diagnose and fix a model that generalizes poorly?

Step 1: Compute training error and validation error at large n. Step 2: If both are high, it's high bias (underfitting). Fix: increase model capacity (more layers/neurons), add features, reduce regularization strength. Step 3: If training error is low but validation error is much higher, it's high variance (overfitting). Fix: get more data (reduces variance $\propto 1/n$ ), add regularization (L2/dropout), use ensemble methods, or reduce model complexity. Step 4: If both train and val errors are low but you still need better performance, it's noise (irreducible error). Fix: better data quality (less labeling noise), feature engineering, or redesign the prediction target. Step 5: Plot learning curves (error vs n) to distinguish bias and variance: high-bias models converge to high error even at large n; high-variance models have large gaps between train and val at small n that close as n grows.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Bias-Variance Tradeoff demo on the EngineersOfAI Playground - no code required.

:::

The Interview Setup​

What You Will Learn​

The Mathematical Decomposition​

Intuition via Polynomial Regression​

The Formal Bias-Variance for Classification​

Practical Consequences for ML Engineering​

How Model Choices Affect Bias and Variance​

Variance Reduction via Ensembles​

The Bias-Variance-Complexity Curve​

Double Descent: The Classical Picture Breaks​

Connections and Role-Specific Relevance​

Interview Questions​