Skip to main content

Generalization, Overfitting, and Underfitting

Reading time: ~26 minutes | Level: ML Foundations | Role: MLE, ML Engineer, Data Scientist, Research Engineer

A healthcare ML team trained a sepsis prediction model on patient data from a single hospital system. Offline AUC: 0.94. They deployed to a partner hospital. AUC dropped to 0.71.

The model had learned to use the time since last lab test as a strong predictor of sepsis risk - but only because in the training hospital, very sick patients were tested more frequently. The partner hospital used different protocols. The model had not learned medicine; it had learned a hospital's administrative patterns.

This is not "overfitting" in the traditional sense. The training error was not dramatically low. This was distribution shift - the model generalized perfectly to its training distribution and failed entirely on the deployment distribution.

Understanding generalization deeply - not just as "train error vs test error" but as the full picture of when and why models fail - is one of the most important skills an ML engineer can develop.

What You Will Learn

  • The formal definition of generalization gap and why it matters more than test accuracy alone
  • Overfitting: causes, detection, and the most effective remedies
  • Underfitting: causes, detection, and fixes
  • The regularization toolkit: L1, L2, dropout, early stopping, data augmentation - and the theory behind each
  • Distribution shift: covariate shift, label shift, concept drift - and detection strategies
  • How to set up evaluation pipelines that actually measure generalization
  • Five interview questions at senior ML engineer level

Part 1 - What Generalization Means, Precisely

The Generalization Gap

A model generalizes if it performs well on data drawn from the same distribution as its training data.

Formally:

  • R^(h)=1ni=1nL(h(xi),yi)\hat{R}(h) = \frac{1}{n}\sum_{i=1}^n \mathcal{L}(h(x_i), y_i) - empirical risk (training error)
  • R(h)=E(x,y)D[L(h(x),y)]R(h) = \mathbb{E}_{(x,y)\sim\mathcal{D}}[\mathcal{L}(h(x), y)] - true risk (expected test error)

The generalization gap is R(h)R^(h)R(h) - \hat{R}(h).

From PAC theory, with probability 1δ\geq 1-\delta:

R(h)R^(h)+lnH+ln(1/δ)2nR(h) \leq \hat{R}(h) + \sqrt{\frac{\ln|\mathcal{H}| + \ln(1/\delta)}{2n}}

The gap shrinks as nn grows and as the hypothesis class complexity decreases. Regularization directly reduces the complexity term.

Training distribution D Test distribution D' ≠ D
┌──────────┐ ┌──────────┐
│ ● ● │ │ ● ● │
│ ● ●│ │ ● ●│
│ ● │ │ ● │
└──────────┘ └──────────┘
same D: D ≠ D':
generalization gap distribution shift
tells you everything - gap is misleading

Why The Test Set Must Be Sacred

Your test set is the only honest estimate of generalization. Every time you evaluate on it and use the result to make a decision (pick the better model, tune a threshold), you contaminate it. After kk evaluations on the same test set, your best-seen performance overfits the test partition by O(lnk/ntest)O(\sqrt{\ln k / n_{test}}).

Use the test set exactly once - at the very end - to report final performance.

Part 2 - Overfitting: Causes and Diagnosis

What Overfitting Actually Means

Overfitting occurs when a model learns patterns specific to the training sample that don't generalize. The model has excess capacity relative to data volume and noise level.

True signal: y = sin(x) + noise

Underfitting (degree 1): Good fit (degree 3): Overfitting (degree 15):
● ● ●
● ● ● ● ● ╱╲ ●
● ● ● ● ╱ ╲ ╱╲ ╱╲
───────────── ╰──────────╯ ╱ ╲╱ ╲╱ ╲╱
misses curve smooth fit follows every noise point

Causes of Overfitting

  1. Too little data relative to model complexity
  2. Too many parameters (more than O(n)O(n) for many model classes)
  3. Training too many iterations - SGD finds noise-fitting solutions
  4. Noisy labels - model memorizes labeling errors
  5. Leaky features - features that encode the label directly
  6. No regularization on an expressive model

Detecting Overfitting

import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

np.random.seed(42)
X, y = make_classification(
n_samples=500, n_features=20, n_informative=5,
n_redundant=5, random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

depths = range(1, 25)
train_accs, test_accs = [], []

for depth in depths:
clf = DecisionTreeClassifier(max_depth=depth, random_state=42)
clf.fit(X_train, y_train)
train_accs.append(clf.score(X_train, y_train))
test_accs.append(clf.score(X_test, y_test))

plt.figure(figsize=(10, 5))
plt.plot(depths, train_accs, 'b-o', lw=2, label='Train accuracy', ms=5)
plt.plot(depths, test_accs, 'r-s', lw=2, label='Test accuracy', ms=5)
best_depth = depths[np.argmax(test_accs)]
plt.axvline(best_depth, color='green', ls='--', lw=1.5, label=f'Best depth={best_depth}')
plt.xlabel('Max depth of decision tree')
plt.ylabel('Accuracy')
plt.title('Overfitting in Decision Trees: Train vs Test Accuracy')
plt.legend()
plt.grid(True, alpha=0.3)
plt.savefig('overfitting_depth.png', dpi=150)

print(f"At depth=1: train={train_accs[0]:.3f}, test={test_accs[0]:.3f} ← underfitting")
print(f"At depth={best_depth}: train={train_accs[best_depth-1]:.3f}, test={test_accs[best_depth-1]:.3f} ← optimal")
print(f"At depth=24: train={train_accs[-1]:.3f}, test={test_accs[-1]:.3f} ← overfitting")

Part 3 - The Regularization Toolkit

Regularization reduces the generalization gap by discouraging the model from fitting noise.

L2 Regularization (Ridge / Weight Decay)

Lreg(θ)=Ltrain(θ)+λθ22\mathcal{L}_{reg}(\theta) = \mathcal{L}_{train}(\theta) + \lambda\|\theta\|_2^2

Shrinks all weights toward zero. Equivalent to a Gaussian prior P(θ)eλθ2P(\theta) \propto e^{-\lambda\|\theta\|^2} (MAP estimation). Reduces Rademacher complexity proportional to the weight norm bound - see Lesson 05.

L1 Regularization (Lasso)

Lreg(θ)=Ltrain(θ)+λθ1\mathcal{L}_{reg}(\theta) = \mathcal{L}_{train}(\theta) + \lambda\|\theta\|_1

Drives some weights exactly to zero - automatic feature selection. Equivalent to a Laplace prior. The geometry: the L1 constraint set (hyperdiamond) has corners on the coordinate axes, so the optimal point often lies there (sparse solution).

L1 (Lasso)L2 (Ridge)
EffectSparse (some weights = 0)Shrinks all uniformly
Feature selectionYes - automaticNo
Correlated featuresPicks one arbitrarilyAverages them
When to useMany irrelevant featuresAll features likely matter
import numpy as np
from sklearn.linear_model import Ridge, Lasso
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

np.random.seed(42)
n = 80
X_raw = np.random.randn(n, 1)
y = 0.5 * X_raw.ravel()**3 - 2*X_raw.ravel() + np.random.randn(n) * 0.5

def make_poly_model(degree, model):
return Pipeline([
('poly', PolynomialFeatures(degree)),
('scale', StandardScaler()),
('reg', model)
])

print(f"{'Model':<45} {'CV R² mean':>12} {'CV R² std':>12}")
print('-' * 70)
for name, model in [
('Poly-10, no reg (alpha=0.0001)', Ridge(alpha=0.0001)),
('Poly-10, L2 weak (alpha=0.1)', Ridge(alpha=0.1)),
('Poly-10, L2 med (alpha=1.0)', Ridge(alpha=1.0)),
('Poly-10, L2 strong(alpha=100)', Ridge(alpha=100.0)),
('Poly-10, L1 Lasso (alpha=0.01)', Lasso(alpha=0.01, max_iter=5000)),
]:
pipe = make_poly_model(10, model)
scores = cross_val_score(pipe, X_raw, y, cv=10, scoring='r2')
print(f"{name:<45} {scores.mean():>12.4f} {scores.std():>12.4f}")

Dropout

During training, randomly zero out units with probability pp:

h~=hm,miBernoulli(1p)\tilde{h} = h \odot m, \quad m_i \sim \text{Bernoulli}(1-p)

At inference: scale outputs by (1p)(1-p) (or equivalently use inverted dropout at train time).

Why it works: Dropout trains an exponential ensemble of thinned sub-networks and approximates their average at test time. Prevents co-adaptation - neurons can't rely on specific other neurons always being present.

import torch
import torch.nn as nn
import torch.optim as optim

class MLPWithDropout(nn.Module):
def __init__(self, dropout_rate=0.0):
super().__init__()
self.net = nn.Sequential(
nn.Linear(20, 256), nn.ReLU(),
nn.Dropout(dropout_rate),
nn.Linear(256, 256), nn.ReLU(),
nn.Dropout(dropout_rate),
nn.Linear(256, 1), nn.Sigmoid()
)
def forward(self, x):
return self.net(x).squeeze()

def train_eval(dropout_rate, n_epochs=150, seed=42):
torch.manual_seed(seed)
X = torch.randn(250, 20)
y = (X[:, 0] + X[:, 1] > 0).float()
X_tr, X_te = X[:200], X[200:]
y_tr, y_te = y[:200], y[200:]

model = MLPWithDropout(dropout_rate)
opt = optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-4)
for _ in range(n_epochs):
model.train()
opt.zero_grad()
nn.BCELoss()(model(X_tr), y_tr).backward()
opt.step()

model.eval()
with torch.no_grad():
train_acc = ((model(X_tr) > 0.5) == y_tr.bool()).float().mean().item()
test_acc = ((model(X_te) > 0.5) == y_te.bool()).float().mean().item()
return train_acc, test_acc

print(f"{'Dropout rate':<15} {'Train acc':>12} {'Test acc':>12} {'Gap':>10}")
print('-' * 52)
for rate in [0.0, 0.2, 0.4, 0.6]:
tr, te = train_eval(rate)
print(f"{rate:<15.1f} {tr:>12.4f} {te:>12.4f} {tr-te:>10.4f}")

Early Stopping

Stop training when validation error stops improving. Formally:

  • For linear models with gradient descent: early stopping is equivalent to L2 regularization
  • For neural networks: stops weights from moving far from initialization - the "flat" minimum near init has lower effective complexity
import numpy as np
import matplotlib.pyplot as plt

epochs = np.arange(1, 201)
train_loss = 0.8 * np.exp(-epochs / 40) + 0.02
val_loss = train_loss + 0.0003 * (epochs - 70).clip(0) ** 1.6

best_epoch = np.argmin(val_loss) + 1

plt.figure(figsize=(10, 4))
plt.plot(epochs, train_loss, 'b-', lw=2, label='Train loss')
plt.plot(epochs, val_loss, 'r-', lw=2, label='Val loss')
plt.axvline(best_epoch, color='green', ls='--', lw=2,
label=f'Early stop at epoch {best_epoch}')
plt.fill_between(epochs[best_epoch:], train_loss[best_epoch:], val_loss[best_epoch:],
alpha=0.15, color='red', label='Overfitting zone')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Early Stopping: Optimal Training Checkpoint')
plt.legend(); plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('early_stopping.png', dpi=150)
print(f"Best checkpoint: epoch {best_epoch}")

Part 4 - Underfitting: Causes and Fixes

Underfitting: model too simple to capture the true signal. Both train and test error are high.

Signature on learning curve:
Train error
Val error

│ val ─────────────────────
│ train ────────────────────
│ (both plateau at HIGH error regardless of n)
└──────────────────────────→ n

Fix: More capacity, better features, less regularization.
CauseFix
Model too simpleIncrease capacity (depth, width, polynomial degree)
Insufficient trainingTrain longer; reduce learning rate
Too strong regularizationLower λ, reduce dropout rate
Wrong model familyUse CNN for images, LSTM/Transformer for sequences
Poor featuresFeature engineering or representation learning
import numpy as np
from sklearn.linear_model import Ridge
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.datasets import make_friedman1

np.random.seed(42)
X, y = make_friedman1(n_samples=500, n_features=10, noise=1.0, random_state=42)

models = {
'Linear Ridge (underfits - Friedman is nonlinear)': Pipeline([
('s', StandardScaler()), ('m', Ridge(alpha=1.0))
]),
'Gradient Boosting depth=2': GradientBoostingRegressor(max_depth=2, random_state=42),
'Gradient Boosting depth=4': GradientBoostingRegressor(max_depth=4, random_state=42),
'Gradient Boosting depth=6': GradientBoostingRegressor(max_depth=6, random_state=42),
}

print(f"{'Model':<48} {'CV MSE':>10} {'Std':>8}")
print('-' * 68)
for name, model in models.items():
scores = -cross_val_score(model, X, y, cv=5, scoring='neg_mean_squared_error')
print(f"{name:<48} {scores.mean():>10.3f} {scores.std():>8.3f}")

print("\nLinear model cannot capture Friedman's nonlinear interactions → underfitting")

Part 5 - Distribution Shift: The Production Generalization Problem

All generalization theory assumes training and deployment come from the same distribution. In production, this fails constantly.

Covariate shift - features distribution changes, labels still follow same mechanism:

  • Train on summer data, deploy in winter; train on US, deploy in EU
  • Fix: importance weighting - reweight training samples by Pdeploy(x)/Ptrain(x)P_{deploy}(x)/P_{train}(x)

Label shift - class proportions change:

  • Fraud rate drops from 5% (training) to 0.1% (production)
  • Fix: recalibrate posterior probabilities using new prior

Concept drift - P(YX)P(Y|X) changes, the world evolves:

  • User preferences change; fraud patterns adapt; language usage shifts
  • Fix: periodic retraining, online learning, sliding-window models
import numpy as np
from scipy.stats import ks_2samp

np.random.seed(42)
X_train = np.random.randn(1000, 10)

# Covariate shift: features shift by 0.5 on average
X_deploy_shifted = np.random.randn(200, 10) + 0.5

print("KS Test for Distribution Shift Per Feature")
print(f"{'Feature':<12} {'KS stat':>10} {'p-value':>12} {'Shift?':>10}")
print('-' * 46)
for i in range(X_train.shape[1]):
stat, p = ks_2samp(X_train[:, i], X_deploy_shifted[:, i])
shift = 'YES ⚠' if p < 0.05 else 'no'
print(f"Feature {i:<4} {stat:>10.4f} {p:>12.4f} {shift:>10}")

print("\nFeatures with p < 0.05 show statistically significant distribution shift.")

Monitoring for Drift in Production

from collections import deque
import numpy as np
from scipy.stats import ks_2samp

class DriftMonitor:
"""
Sliding-window drift detector for production ML systems.
Compares recent predictions to a baseline using KS test.
"""
def __init__(self, window_size=500, alpha=0.05):
self.window = deque(maxlen=window_size)
self.baseline = None
self.alpha = alpha

def set_baseline(self, predictions):
self.baseline = np.array(predictions)

def log(self, prediction):
self.window.append(prediction)

def check(self):
if len(self.window) < 50 or self.baseline is None:
return {'ready': False}
stat, p = ks_2samp(self.baseline, list(self.window))
return {
'drift_detected': p < self.alpha,
'ks_stat': round(stat, 4),
'p_value': round(p, 4),
'baseline_mean': round(float(self.baseline.mean()), 4),
'current_mean': round(float(np.mean(self.window)), 4),
}

# Simulate stable period, then shift
monitor = DriftMonitor(window_size=300)
monitor.set_baseline(np.random.beta(2, 5, 1000)) # stable baseline

for _ in range(100): monitor.log(np.random.beta(2, 5)) # stable
print("After 100 stable predictions:")
print(monitor.check())

for _ in range(300): monitor.log(np.random.beta(5, 2)) # shifted
print("\nAfter 300 shifted predictions:")
print(monitor.check())

Part 6 - Evaluation Pipelines That Catch Generalization Problems

Wrong evaluation: Right evaluation:
All data → random 80/20 split Temporal split:
→ Works for i.i.d. data Train: t < cutoff
→ FAILS for: Val: cutoff ≤ t < cutoff+Δ
- Time series (data leakage) Test: t ≥ cutoff+Δ
- User data (group leakage)
- Geographic data Group split:
- Medical data (patient leakage) Train: users A–G
Test: users H–Z
SplitPurposeHow to set size
TrainFit parameters60–80% of data
ValidationSelect hyperparameters10–20%
TestFinal estimate, used ONCE10–20%; minimum 1,000 examples for stable estimates

Data leakage checklist (Lesson 10 covers in depth):

  • No future information in features for time-series tasks
  • No ID columns or proxy-ID columns as features
  • Scaling/normalization fit on train-only, applied to val/test
  • No rows from the same entity (user, patient, patient) in both train and test

:::tip Video Resources StatQuest - Overfitting: Fitting the noise Visual walkthrough of overfitting in decision trees and polynomial regression. (~7 min)

StatQuest - Regularization Part 1 - Ridge The clearest explanation of L2 regularization and why it works. (~10 min)

StatQuest - Regularization Part 2 - Lasso L1 sparsity geometry and the diamond constraint set. (~8 min)

Andrej Karpathy - Lecture: Training Neural Networks (Stanford CS231n) Covers dropout, batch norm, weight decay in practical depth. (~1h 20min) :::

Interview Questions

Q1: What is the generalization gap and what determines its size?

The generalization gap is R(h)R^(h)R(h) - \hat{R}(h) - the difference between true expected error and empirical training error. PAC theory bounds it as O(lnH/n)O(\sqrt{\ln|\mathcal{H}|/n}) for finite classes, or using Rademacher complexity for infinite classes. Practically, the gap is larger when: (1) model complexity is high relative to training data, (2) labels are noisy (high σ2\sigma^2), (3) training data is small, (4) the test set was used to make model decisions (contamination). It is reduced by: more training data, regularization, ensembling, and careful evaluation protocol.

Q2: Explain the difference between L1 and L2 regularization geometrically and practically.

Both add a constraint on parameters: L2 constrains θ22B2\|\theta\|_2^2 \leq B^2 (a sphere); L1 constrains θ1B\|\theta\|_1 \leq B (a hyperdiamond). The optimal solution is where the loss function's level sets touch the constraint set. The L2 sphere has no corners - solutions occur anywhere on the surface, giving small-but-nonzero weights. The L1 hyperdiamond has corners on the coordinate axes - the optimal often lies at a corner, giving exact zeros (sparsity). Use L1 when many features are irrelevant (automatic feature selection); use L2 when all features matter and you want smooth shrinkage. ElasticNet combines both: L1 for sparsity + L2 for grouping correlated features.

Q3: You have training accuracy 99%, test accuracy 68%. Walk through diagnosis and remediation.

Step 1: Check for data leakage - is future information in features? Is the test split representative? Is preprocessing fit on the full dataset including test? Step 2: Confirm distributions match - run KS tests on key features. Step 3: This is high variance (overfitting). The 31-point gap is the signature. Remediation in priority order: (1) More training data - most reliable; (2) Add regularization - L2 weight decay, dropout; (3) Reduce model complexity - shallower tree, fewer layers; (4) Ensembling - bagging reduces variance without sacrificing bias; (5) Feature selection - remove noisy features. Measure each fix using cross-validation, not the test set.

Q4: What is distribution shift and how do you detect it in production?

Distribution shift: Ptrain(X,Y)Pdeploy(X,Y)P_{train}(X, Y) \neq P_{deploy}(X, Y). Types: covariate shift (P(X)P(X) changes, P(YX)P(Y|X) stable), label shift (P(Y)P(Y) changes), concept drift (P(YX)P(Y|X) changes). Detection: (1) Feature drift - KS test comparing training vs recent production feature distributions; (2) Prediction drift - monitor model output score distribution; (3) Performance monitoring - if delayed labels available, track accuracy/AUC over time windows; (4) Data quality checks - missing rates, cardinality changes, value range violations. Automatic alerting when KS statistic exceeds threshold or PSI (Population Stability Index) > 0.2 is a common production setup.

Q5: What does "overfitting to the validation set" mean, and how does it happen in practice?

When hyperparameters are selected based on validation performance across many iterations (grid search, random search, Bayesian optimization), the selected configuration is partially "lucky" on that specific validation partition. After kk evaluations, the best-seen validation score overfits by O(lnk/nval)O(\sqrt{\ln k / n_{val}}). This is why: (1) the test set must remain untouched until the very end; (2) nested cross-validation is needed when both model selection and performance estimation are required; (3) Bayesian HPO overfits less than grid search (fewer evaluations). In Kaggle competitions, public leaderboard overfitting is rampant - teams submit many times until lucky on the public test shard, then fail on the private leaderboard.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Bias-Variance Tradeoff demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.