Generalization, Overfitting, and Underfitting

Reading time: ~26 minutes | Level: ML Foundations | Role: MLE, ML Engineer, Data Scientist, Research Engineer

A healthcare ML team trained a sepsis prediction model on patient data from a single hospital system. Offline AUC: 0.94. They deployed to a partner hospital. AUC dropped to 0.71.

The model had learned to use the time since last lab test as a strong predictor of sepsis risk - but only because in the training hospital, very sick patients were tested more frequently. The partner hospital used different protocols. The model had not learned medicine; it had learned a hospital's administrative patterns.

This is not "overfitting" in the traditional sense. The training error was not dramatically low. This was distribution shift - the model generalized perfectly to its training distribution and failed entirely on the deployment distribution.

Understanding generalization deeply - not just as "train error vs test error" but as the full picture of when and why models fail - is one of the most important skills an ML engineer can develop.

What You Will Learn

The formal definition of generalization gap and why it matters more than test accuracy alone
Overfitting: causes, detection, and the most effective remedies
Underfitting: causes, detection, and fixes
The regularization toolkit: L1, L2, dropout, early stopping, data augmentation - and the theory behind each
Distribution shift: covariate shift, label shift, concept drift - and detection strategies
How to set up evaluation pipelines that actually measure generalization
Five interview questions at senior ML engineer level

Part 1 - What Generalization Means, Precisely

The Generalization Gap

A model generalizes if it performs well on data drawn from the same distribution as its training data.

Formally:

$\hat{R}(h) = \frac{1}{n}\sum_{i=1}^n \mathcal{L}(h(x_i), y_i)$ - empirical risk (training error)
$R(h) = \mathbb{E}_{(x,y)\sim\mathcal{D}}[\mathcal{L}(h(x), y)]$ - true risk (expected test error)

The generalization gap is $R(h) - \hat{R}(h)$ .

From PAC theory, with probability $\geq 1-\delta$ :

$R(h) \leq \hat{R}(h) + \sqrt{\frac{\ln|\mathcal{H}| + \ln(1/\delta)}{2n}}$

The gap shrinks as $n$ grows and as the hypothesis class complexity decreases. Regularization directly reduces the complexity term.

Training distribution D     Test distribution D' ≠ D
      ┌──────────┐               ┌──────────┐
      │  ●  ●    │               │  ●  ●    │
      │     ●   ●│               │     ●   ●│
      │  ●       │               │  ●       │
      └──────────┘               └──────────┘
         same D:                    D ≠ D':
      generalization gap         distribution shift
      tells you everything       - gap is misleading

Why The Test Set Must Be Sacred

Your test set is the only honest estimate of generalization. Every time you evaluate on it and use the result to make a decision (pick the better model, tune a threshold), you contaminate it. After $k$ evaluations on the same test set, your best-seen performance overfits the test partition by $O(\sqrt{\ln k / n_{test}})$ .

Use the test set exactly once - at the very end - to report final performance.

Part 2 - Overfitting: Causes and Diagnosis

What Overfitting Actually Means

Overfitting occurs when a model learns patterns specific to the training sample that don't generalize. The model has excess capacity relative to data volume and noise level.

True signal: y = sin(x) + noise

Underfitting (degree 1):    Good fit (degree 3):   Overfitting (degree 15):
      ●                          ●                       ●
  ●       ●                  ●     ●              ●   ╱╲    ●
       ●              ●         ●           ●  ╱  ╲  ╱╲  ╱╲
   ─────────────       ╰──────────╯          ╱    ╲╱  ╲╱  ╲╱
  misses curve         smooth fit            follows every noise point

Causes of Overfitting

Too little data relative to model complexity
Too many parameters (more than $O(n)$ for many model classes)
Training too many iterations - SGD finds noise-fitting solutions
Noisy labels - model memorizes labeling errors
Leaky features - features that encode the label directly
No regularization on an expressive model

Detecting Overfitting

import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

np.random.seed(42)
X, y = make_classification(
    n_samples=500, n_features=20, n_informative=5,
    n_redundant=5, random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

depths = range(1, 25)
train_accs, test_accs = [], []

for depth in depths:
    clf = DecisionTreeClassifier(max_depth=depth, random_state=42)
    clf.fit(X_train, y_train)
    train_accs.append(clf.score(X_train, y_train))
    test_accs.append(clf.score(X_test, y_test))

plt.figure(figsize=(10, 5))
plt.plot(depths, train_accs, 'b-o', lw=2, label='Train accuracy', ms=5)
plt.plot(depths, test_accs,  'r-s', lw=2, label='Test accuracy',  ms=5)
best_depth = depths[np.argmax(test_accs)]
plt.axvline(best_depth, color='green', ls='--', lw=1.5, label=f'Best depth={best_depth}')
plt.xlabel('Max depth of decision tree')
plt.ylabel('Accuracy')
plt.title('Overfitting in Decision Trees: Train vs Test Accuracy')
plt.legend()
plt.grid(True, alpha=0.3)
plt.savefig('overfitting_depth.png', dpi=150)

print(f"At depth=1:  train={train_accs[0]:.3f}, test={test_accs[0]:.3f}  ← underfitting")
print(f"At depth={best_depth}: train={train_accs[best_depth-1]:.3f}, test={test_accs[best_depth-1]:.3f}  ← optimal")
print(f"At depth=24: train={train_accs[-1]:.3f}, test={test_accs[-1]:.3f}  ← overfitting")

Part 3 - The Regularization Toolkit

Regularization reduces the generalization gap by discouraging the model from fitting noise.

L2 Regularization (Ridge / Weight Decay)

$\mathcal{L}_{reg}(\theta) = \mathcal{L}_{train}(\theta) + \lambda\|\theta\|_2^2$

Shrinks all weights toward zero. Equivalent to a Gaussian prior $P(\theta) \propto e^{-\lambda\|\theta\|^2}$ (MAP estimation). Reduces Rademacher complexity proportional to the weight norm bound - see Lesson 05.

L1 Regularization (Lasso)

$\mathcal{L}_{reg}(\theta) = \mathcal{L}_{train}(\theta) + \lambda\|\theta\|_1$

Drives some weights exactly to zero - automatic feature selection. Equivalent to a Laplace prior. The geometry: the L1 constraint set (hyperdiamond) has corners on the coordinate axes, so the optimal point often lies there (sparse solution).

	L1 (Lasso)	L2 (Ridge)
Effect	Sparse (some weights = 0)	Shrinks all uniformly
Feature selection	Yes - automatic	No
Correlated features	Picks one arbitrarily	Averages them
When to use	Many irrelevant features	All features likely matter

import numpy as np
from sklearn.linear_model import Ridge, Lasso
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

np.random.seed(42)
n = 80
X_raw = np.random.randn(n, 1)
y = 0.5 * X_raw.ravel()**3 - 2*X_raw.ravel() + np.random.randn(n) * 0.5

def make_poly_model(degree, model):
    return Pipeline([
        ('poly', PolynomialFeatures(degree)),
        ('scale', StandardScaler()),
        ('reg', model)
    ])

print(f"{'Model':<45} {'CV R² mean':>12} {'CV R² std':>12}")
print('-' * 70)
for name, model in [
    ('Poly-10, no reg   (alpha=0.0001)', Ridge(alpha=0.0001)),
    ('Poly-10, L2 weak  (alpha=0.1)',    Ridge(alpha=0.1)),
    ('Poly-10, L2 med   (alpha=1.0)',    Ridge(alpha=1.0)),
    ('Poly-10, L2 strong(alpha=100)',    Ridge(alpha=100.0)),
    ('Poly-10, L1 Lasso (alpha=0.01)',   Lasso(alpha=0.01, max_iter=5000)),
]:
    pipe = make_poly_model(10, model)
    scores = cross_val_score(pipe, X_raw, y, cv=10, scoring='r2')
    print(f"{name:<45} {scores.mean():>12.4f} {scores.std():>12.4f}")

Dropout

During training, randomly zero out units with probability $p$ :

$\tilde{h} = h \odot m, \quad m_i \sim \text{Bernoulli}(1-p)$

At inference: scale outputs by $(1-p)$ (or equivalently use inverted dropout at train time).

Why it works: Dropout trains an exponential ensemble of thinned sub-networks and approximates their average at test time. Prevents co-adaptation - neurons can't rely on specific other neurons always being present.

import torch
import torch.nn as nn
import torch.optim as optim

class MLPWithDropout(nn.Module):
    def __init__(self, dropout_rate=0.0):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(20, 256), nn.ReLU(),
            nn.Dropout(dropout_rate),
            nn.Linear(256, 256), nn.ReLU(),
            nn.Dropout(dropout_rate),
            nn.Linear(256, 1), nn.Sigmoid()
        )
    def forward(self, x):
        return self.net(x).squeeze()

def train_eval(dropout_rate, n_epochs=150, seed=42):
    torch.manual_seed(seed)
    X = torch.randn(250, 20)
    y = (X[:, 0] + X[:, 1] > 0).float()
    X_tr, X_te = X[:200], X[200:]
    y_tr, y_te = y[:200], y[200:]

    model = MLPWithDropout(dropout_rate)
    opt = optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-4)
    for _ in range(n_epochs):
        model.train()
        opt.zero_grad()
        nn.BCELoss()(model(X_tr), y_tr).backward()
        opt.step()

    model.eval()
    with torch.no_grad():
        train_acc = ((model(X_tr) > 0.5) == y_tr.bool()).float().mean().item()
        test_acc  = ((model(X_te) > 0.5) == y_te.bool()).float().mean().item()
    return train_acc, test_acc

print(f"{'Dropout rate':<15} {'Train acc':>12} {'Test acc':>12} {'Gap':>10}")
print('-' * 52)
for rate in [0.0, 0.2, 0.4, 0.6]:
    tr, te = train_eval(rate)
    print(f"{rate:<15.1f} {tr:>12.4f} {te:>12.4f} {tr-te:>10.4f}")

Early Stopping

Stop training when validation error stops improving. Formally:

For linear models with gradient descent: early stopping is equivalent to L2 regularization
For neural networks: stops weights from moving far from initialization - the "flat" minimum near init has lower effective complexity

import numpy as np
import matplotlib.pyplot as plt

epochs = np.arange(1, 201)
train_loss = 0.8 * np.exp(-epochs / 40) + 0.02
val_loss   = train_loss + 0.0003 * (epochs - 70).clip(0) ** 1.6

best_epoch = np.argmin(val_loss) + 1

plt.figure(figsize=(10, 4))
plt.plot(epochs, train_loss, 'b-', lw=2, label='Train loss')
plt.plot(epochs, val_loss,   'r-', lw=2, label='Val loss')
plt.axvline(best_epoch, color='green', ls='--', lw=2,
            label=f'Early stop at epoch {best_epoch}')
plt.fill_between(epochs[best_epoch:], train_loss[best_epoch:], val_loss[best_epoch:],
                 alpha=0.15, color='red', label='Overfitting zone')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Early Stopping: Optimal Training Checkpoint')
plt.legend(); plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('early_stopping.png', dpi=150)
print(f"Best checkpoint: epoch {best_epoch}")

Part 4 - Underfitting: Causes and Fixes

Underfitting: model too simple to capture the true signal. Both train and test error are high.

Signature on learning curve:
  Train error
  Val error
     │
     │  val  ─────────────────────
     │  train ────────────────────
     │  (both plateau at HIGH error regardless of n)
     └──────────────────────────→ n

Fix: More capacity, better features, less regularization.

Cause	Fix
Model too simple	Increase capacity (depth, width, polynomial degree)
Insufficient training	Train longer; reduce learning rate
Too strong regularization	Lower λ, reduce dropout rate
Wrong model family	Use CNN for images, LSTM/Transformer for sequences
Poor features	Feature engineering or representation learning

import numpy as np
from sklearn.linear_model import Ridge
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.datasets import make_friedman1

np.random.seed(42)
X, y = make_friedman1(n_samples=500, n_features=10, noise=1.0, random_state=42)

models = {
    'Linear Ridge (underfits - Friedman is nonlinear)': Pipeline([
        ('s', StandardScaler()), ('m', Ridge(alpha=1.0))
    ]),
    'Gradient Boosting depth=2': GradientBoostingRegressor(max_depth=2, random_state=42),
    'Gradient Boosting depth=4': GradientBoostingRegressor(max_depth=4, random_state=42),
    'Gradient Boosting depth=6': GradientBoostingRegressor(max_depth=6, random_state=42),
}

print(f"{'Model':<48} {'CV MSE':>10} {'Std':>8}")
print('-' * 68)
for name, model in models.items():
    scores = -cross_val_score(model, X, y, cv=5, scoring='neg_mean_squared_error')
    print(f"{name:<48} {scores.mean():>10.3f} {scores.std():>8.3f}")

print("\nLinear model cannot capture Friedman's nonlinear interactions → underfitting")

Part 5 - Distribution Shift: The Production Generalization Problem

All generalization theory assumes training and deployment come from the same distribution. In production, this fails constantly.

Covariate shift - features distribution changes, labels still follow same mechanism:

Train on summer data, deploy in winter; train on US, deploy in EU
Fix: importance weighting - reweight training samples by $P_{deploy}(x)/P_{train}(x)$

Label shift - class proportions change:

Fraud rate drops from 5% (training) to 0.1% (production)
Fix: recalibrate posterior probabilities using new prior

Concept drift - $P(Y|X)$ changes, the world evolves:

User preferences change; fraud patterns adapt; language usage shifts
Fix: periodic retraining, online learning, sliding-window models

import numpy as np
from scipy.stats import ks_2samp

np.random.seed(42)
X_train = np.random.randn(1000, 10)

# Covariate shift: features shift by 0.5 on average
X_deploy_shifted = np.random.randn(200, 10) + 0.5

print("KS Test for Distribution Shift Per Feature")
print(f"{'Feature':<12} {'KS stat':>10} {'p-value':>12} {'Shift?':>10}")
print('-' * 46)
for i in range(X_train.shape[1]):
    stat, p = ks_2samp(X_train[:, i], X_deploy_shifted[:, i])
    shift = 'YES ⚠' if p < 0.05 else 'no'
    print(f"Feature {i:<4} {stat:>10.4f} {p:>12.4f} {shift:>10}")

print("\nFeatures with p < 0.05 show statistically significant distribution shift.")

Monitoring for Drift in Production

from collections import deque
import numpy as np
from scipy.stats import ks_2samp

class DriftMonitor:
    """
    Sliding-window drift detector for production ML systems.
    Compares recent predictions to a baseline using KS test.
    """
    def __init__(self, window_size=500, alpha=0.05):
        self.window = deque(maxlen=window_size)
        self.baseline = None
        self.alpha = alpha

    def set_baseline(self, predictions):
        self.baseline = np.array(predictions)

    def log(self, prediction):
        self.window.append(prediction)

    def check(self):
        if len(self.window) < 50 or self.baseline is None:
            return {'ready': False}
        stat, p = ks_2samp(self.baseline, list(self.window))
        return {
            'drift_detected': p < self.alpha,
            'ks_stat': round(stat, 4),
            'p_value': round(p, 4),
            'baseline_mean': round(float(self.baseline.mean()), 4),
            'current_mean': round(float(np.mean(self.window)), 4),
        }

# Simulate stable period, then shift
monitor = DriftMonitor(window_size=300)
monitor.set_baseline(np.random.beta(2, 5, 1000))     # stable baseline

for _ in range(100): monitor.log(np.random.beta(2, 5))   # stable
print("After 100 stable predictions:")
print(monitor.check())

for _ in range(300): monitor.log(np.random.beta(5, 2))   # shifted
print("\nAfter 300 shifted predictions:")
print(monitor.check())

Part 6 - Evaluation Pipelines That Catch Generalization Problems

Wrong evaluation:                    Right evaluation:
  All data → random 80/20 split         Temporal split:
  → Works for i.i.d. data               Train: t < cutoff
  → FAILS for:                          Val:   cutoff ≤ t < cutoff+Δ
    - Time series (data leakage)        Test:  t ≥ cutoff+Δ
    - User data (group leakage)
    - Geographic data                   Group split:
    - Medical data (patient leakage)    Train: users A–G
                                        Test:  users H–Z

Split	Purpose	How to set size
Train	Fit parameters	60–80% of data
Validation	Select hyperparameters	10–20%
Test	Final estimate, used ONCE	10–20%; minimum 1,000 examples for stable estimates

Data leakage checklist (Lesson 10 covers in depth):

No future information in features for time-series tasks
No ID columns or proxy-ID columns as features
Scaling/normalization fit on train-only, applied to val/test
No rows from the same entity (user, patient, patient) in both train and test

Recommended Resources

:::tip Video Resources StatQuest - Overfitting: Fitting the noise Visual walkthrough of overfitting in decision trees and polynomial regression. (~7 min)

StatQuest - Regularization Part 1 - Ridge The clearest explanation of L2 regularization and why it works. (~10 min)

StatQuest - Regularization Part 2 - Lasso L1 sparsity geometry and the diamond constraint set. (~8 min)

Andrej Karpathy - Lecture: Training Neural Networks (Stanford CS231n) Covers dropout, batch norm, weight decay in practical depth. (~1h 20min) :::

Interview Questions

Q1: What is the generalization gap and what determines its size?

The generalization gap is $R(h) - \hat{R}(h)$ - the difference between true expected error and empirical training error. PAC theory bounds it as $O(\sqrt{\ln|\mathcal{H}|/n})$ for finite classes, or using Rademacher complexity for infinite classes. Practically, the gap is larger when: (1) model complexity is high relative to training data, (2) labels are noisy (high $\sigma^2$ ), (3) training data is small, (4) the test set was used to make model decisions (contamination). It is reduced by: more training data, regularization, ensembling, and careful evaluation protocol.

Q2: Explain the difference between L1 and L2 regularization geometrically and practically.

Both add a constraint on parameters: L2 constrains $\|\theta\|_2^2 \leq B^2$ (a sphere); L1 constrains $\|\theta\|_1 \leq B$ (a hyperdiamond). The optimal solution is where the loss function's level sets touch the constraint set. The L2 sphere has no corners - solutions occur anywhere on the surface, giving small-but-nonzero weights. The L1 hyperdiamond has corners on the coordinate axes - the optimal often lies at a corner, giving exact zeros (sparsity). Use L1 when many features are irrelevant (automatic feature selection); use L2 when all features matter and you want smooth shrinkage. ElasticNet combines both: L1 for sparsity + L2 for grouping correlated features.

Q3: You have training accuracy 99%, test accuracy 68%. Walk through diagnosis and remediation.

Step 1: Check for data leakage - is future information in features? Is the test split representative? Is preprocessing fit on the full dataset including test? Step 2: Confirm distributions match - run KS tests on key features. Step 3: This is high variance (overfitting). The 31-point gap is the signature. Remediation in priority order: (1) More training data - most reliable; (2) Add regularization - L2 weight decay, dropout; (3) Reduce model complexity - shallower tree, fewer layers; (4) Ensembling - bagging reduces variance without sacrificing bias; (5) Feature selection - remove noisy features. Measure each fix using cross-validation, not the test set.

Q4: What is distribution shift and how do you detect it in production?

Distribution shift: $P_{train}(X, Y) \neq P_{deploy}(X, Y)$ . Types: covariate shift ( $P(X)$ changes, $P(Y|X)$ stable), label shift ( $P(Y)$ changes), concept drift ( $P(Y|X)$ changes). Detection: (1) Feature drift - KS test comparing training vs recent production feature distributions; (2) Prediction drift - monitor model output score distribution; (3) Performance monitoring - if delayed labels available, track accuracy/AUC over time windows; (4) Data quality checks - missing rates, cardinality changes, value range violations. Automatic alerting when KS statistic exceeds threshold or PSI (Population Stability Index) > 0.2 is a common production setup.

Q5: What does "overfitting to the validation set" mean, and how does it happen in practice?

When hyperparameters are selected based on validation performance across many iterations (grid search, random search, Bayesian optimization), the selected configuration is partially "lucky" on that specific validation partition. After $k$ evaluations, the best-seen validation score overfits by $O(\sqrt{\ln k / n_{val}})$ . This is why: (1) the test set must remain untouched until the very end; (2) nested cross-validation is needed when both model selection and performance estimation are required; (3) Bayesian HPO overfits less than grid search (fewer evaluations). In Kaggle competitions, public leaderboard overfitting is rampant - teams submit many times until lucky on the public test shard, then fail on the private leaderboard.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Bias-Variance Tradeoff demo on the EngineersOfAI Playground - no code required.

:::

What You Will Learn​

Part 1 - What Generalization Means, Precisely​

The Generalization Gap​

Why The Test Set Must Be Sacred​

Part 2 - Overfitting: Causes and Diagnosis​

What Overfitting Actually Means​

Causes of Overfitting​

Detecting Overfitting​

Part 3 - The Regularization Toolkit​

L2 Regularization (Ridge / Weight Decay)​

L1 Regularization (Lasso)​

Dropout​

Early Stopping​

Part 4 - Underfitting: Causes and Fixes​

Part 5 - Distribution Shift: The Production Generalization Problem​

Monitoring for Drift in Production​

Part 6 - Evaluation Pipelines That Catch Generalization Problems​

Recommended Resources​

Interview Questions​