Generalization, Overfitting, and Underfitting
Reading time: ~26 minutes | Level: ML Foundations | Role: MLE, ML Engineer, Data Scientist, Research Engineer
A healthcare ML team trained a sepsis prediction model on patient data from a single hospital system. Offline AUC: 0.94. They deployed to a partner hospital. AUC dropped to 0.71.
The model had learned to use the time since last lab test as a strong predictor of sepsis risk - but only because in the training hospital, very sick patients were tested more frequently. The partner hospital used different protocols. The model had not learned medicine; it had learned a hospital's administrative patterns.
This is not "overfitting" in the traditional sense. The training error was not dramatically low. This was distribution shift - the model generalized perfectly to its training distribution and failed entirely on the deployment distribution.
Understanding generalization deeply - not just as "train error vs test error" but as the full picture of when and why models fail - is one of the most important skills an ML engineer can develop.
What You Will Learn
- The formal definition of generalization gap and why it matters more than test accuracy alone
- Overfitting: causes, detection, and the most effective remedies
- Underfitting: causes, detection, and fixes
- The regularization toolkit: L1, L2, dropout, early stopping, data augmentation - and the theory behind each
- Distribution shift: covariate shift, label shift, concept drift - and detection strategies
- How to set up evaluation pipelines that actually measure generalization
- Five interview questions at senior ML engineer level
Part 1 - What Generalization Means, Precisely
The Generalization Gap
A model generalizes if it performs well on data drawn from the same distribution as its training data.
Formally:
- - empirical risk (training error)
- - true risk (expected test error)
The generalization gap is .
From PAC theory, with probability :
The gap shrinks as grows and as the hypothesis class complexity decreases. Regularization directly reduces the complexity term.
Training distribution D Test distribution D' ≠ D
┌──────────┐ ┌──────────┐
│ ● ● │ │ ● ● │
│ ● ●│ │ ● ●│
│ ● │ │ ● │
└──────────┘ └──────────┘
same D: D ≠ D':
generalization gap distribution shift
tells you everything - gap is misleading
Why The Test Set Must Be Sacred
Your test set is the only honest estimate of generalization. Every time you evaluate on it and use the result to make a decision (pick the better model, tune a threshold), you contaminate it. After evaluations on the same test set, your best-seen performance overfits the test partition by .
Use the test set exactly once - at the very end - to report final performance.
Part 2 - Overfitting: Causes and Diagnosis
What Overfitting Actually Means
Overfitting occurs when a model learns patterns specific to the training sample that don't generalize. The model has excess capacity relative to data volume and noise level.
True signal: y = sin(x) + noise
Underfitting (degree 1): Good fit (degree 3): Overfitting (degree 15):
● ● ●
● ● ● ● ● ╱╲ ●
● ● ● ● ╱ ╲ ╱╲ ╱╲
───────────── ╰──────────╯ ╱ ╲╱ ╲╱ ╲╱
misses curve smooth fit follows every noise point
Causes of Overfitting
- Too little data relative to model complexity
- Too many parameters (more than for many model classes)
- Training too many iterations - SGD finds noise-fitting solutions
- Noisy labels - model memorizes labeling errors
- Leaky features - features that encode the label directly
- No regularization on an expressive model
Detecting Overfitting
import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
np.random.seed(42)
X, y = make_classification(
n_samples=500, n_features=20, n_informative=5,
n_redundant=5, random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
depths = range(1, 25)
train_accs, test_accs = [], []
for depth in depths:
clf = DecisionTreeClassifier(max_depth=depth, random_state=42)
clf.fit(X_train, y_train)
train_accs.append(clf.score(X_train, y_train))
test_accs.append(clf.score(X_test, y_test))
plt.figure(figsize=(10, 5))
plt.plot(depths, train_accs, 'b-o', lw=2, label='Train accuracy', ms=5)
plt.plot(depths, test_accs, 'r-s', lw=2, label='Test accuracy', ms=5)
best_depth = depths[np.argmax(test_accs)]
plt.axvline(best_depth, color='green', ls='--', lw=1.5, label=f'Best depth={best_depth}')
plt.xlabel('Max depth of decision tree')
plt.ylabel('Accuracy')
plt.title('Overfitting in Decision Trees: Train vs Test Accuracy')
plt.legend()
plt.grid(True, alpha=0.3)
plt.savefig('overfitting_depth.png', dpi=150)
print(f"At depth=1: train={train_accs[0]:.3f}, test={test_accs[0]:.3f} ← underfitting")
print(f"At depth={best_depth}: train={train_accs[best_depth-1]:.3f}, test={test_accs[best_depth-1]:.3f} ← optimal")
print(f"At depth=24: train={train_accs[-1]:.3f}, test={test_accs[-1]:.3f} ← overfitting")
Part 3 - The Regularization Toolkit
Regularization reduces the generalization gap by discouraging the model from fitting noise.
L2 Regularization (Ridge / Weight Decay)
Shrinks all weights toward zero. Equivalent to a Gaussian prior (MAP estimation). Reduces Rademacher complexity proportional to the weight norm bound - see Lesson 05.
L1 Regularization (Lasso)
Drives some weights exactly to zero - automatic feature selection. Equivalent to a Laplace prior. The geometry: the L1 constraint set (hyperdiamond) has corners on the coordinate axes, so the optimal point often lies there (sparse solution).
| L1 (Lasso) | L2 (Ridge) | |
|---|---|---|
| Effect | Sparse (some weights = 0) | Shrinks all uniformly |
| Feature selection | Yes - automatic | No |
| Correlated features | Picks one arbitrarily | Averages them |
| When to use | Many irrelevant features | All features likely matter |
import numpy as np
from sklearn.linear_model import Ridge, Lasso
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
np.random.seed(42)
n = 80
X_raw = np.random.randn(n, 1)
y = 0.5 * X_raw.ravel()**3 - 2*X_raw.ravel() + np.random.randn(n) * 0.5
def make_poly_model(degree, model):
return Pipeline([
('poly', PolynomialFeatures(degree)),
('scale', StandardScaler()),
('reg', model)
])
print(f"{'Model':<45} {'CV R² mean':>12} {'CV R² std':>12}")
print('-' * 70)
for name, model in [
('Poly-10, no reg (alpha=0.0001)', Ridge(alpha=0.0001)),
('Poly-10, L2 weak (alpha=0.1)', Ridge(alpha=0.1)),
('Poly-10, L2 med (alpha=1.0)', Ridge(alpha=1.0)),
('Poly-10, L2 strong(alpha=100)', Ridge(alpha=100.0)),
('Poly-10, L1 Lasso (alpha=0.01)', Lasso(alpha=0.01, max_iter=5000)),
]:
pipe = make_poly_model(10, model)
scores = cross_val_score(pipe, X_raw, y, cv=10, scoring='r2')
print(f"{name:<45} {scores.mean():>12.4f} {scores.std():>12.4f}")
Dropout
During training, randomly zero out units with probability :
At inference: scale outputs by (or equivalently use inverted dropout at train time).
Why it works: Dropout trains an exponential ensemble of thinned sub-networks and approximates their average at test time. Prevents co-adaptation - neurons can't rely on specific other neurons always being present.
import torch
import torch.nn as nn
import torch.optim as optim
class MLPWithDropout(nn.Module):
def __init__(self, dropout_rate=0.0):
super().__init__()
self.net = nn.Sequential(
nn.Linear(20, 256), nn.ReLU(),
nn.Dropout(dropout_rate),
nn.Linear(256, 256), nn.ReLU(),
nn.Dropout(dropout_rate),
nn.Linear(256, 1), nn.Sigmoid()
)
def forward(self, x):
return self.net(x).squeeze()
def train_eval(dropout_rate, n_epochs=150, seed=42):
torch.manual_seed(seed)
X = torch.randn(250, 20)
y = (X[:, 0] + X[:, 1] > 0).float()
X_tr, X_te = X[:200], X[200:]
y_tr, y_te = y[:200], y[200:]
model = MLPWithDropout(dropout_rate)
opt = optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-4)
for _ in range(n_epochs):
model.train()
opt.zero_grad()
nn.BCELoss()(model(X_tr), y_tr).backward()
opt.step()
model.eval()
with torch.no_grad():
train_acc = ((model(X_tr) > 0.5) == y_tr.bool()).float().mean().item()
test_acc = ((model(X_te) > 0.5) == y_te.bool()).float().mean().item()
return train_acc, test_acc
print(f"{'Dropout rate':<15} {'Train acc':>12} {'Test acc':>12} {'Gap':>10}")
print('-' * 52)
for rate in [0.0, 0.2, 0.4, 0.6]:
tr, te = train_eval(rate)
print(f"{rate:<15.1f} {tr:>12.4f} {te:>12.4f} {tr-te:>10.4f}")
Early Stopping
Stop training when validation error stops improving. Formally:
- For linear models with gradient descent: early stopping is equivalent to L2 regularization
- For neural networks: stops weights from moving far from initialization - the "flat" minimum near init has lower effective complexity
import numpy as np
import matplotlib.pyplot as plt
epochs = np.arange(1, 201)
train_loss = 0.8 * np.exp(-epochs / 40) + 0.02
val_loss = train_loss + 0.0003 * (epochs - 70).clip(0) ** 1.6
best_epoch = np.argmin(val_loss) + 1
plt.figure(figsize=(10, 4))
plt.plot(epochs, train_loss, 'b-', lw=2, label='Train loss')
plt.plot(epochs, val_loss, 'r-', lw=2, label='Val loss')
plt.axvline(best_epoch, color='green', ls='--', lw=2,
label=f'Early stop at epoch {best_epoch}')
plt.fill_between(epochs[best_epoch:], train_loss[best_epoch:], val_loss[best_epoch:],
alpha=0.15, color='red', label='Overfitting zone')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Early Stopping: Optimal Training Checkpoint')
plt.legend(); plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('early_stopping.png', dpi=150)
print(f"Best checkpoint: epoch {best_epoch}")
Part 4 - Underfitting: Causes and Fixes
Underfitting: model too simple to capture the true signal. Both train and test error are high.
Signature on learning curve:
Train error
Val error
│
│ val ─────────────────────
│ train ────────────────────
│ (both plateau at HIGH error regardless of n)
└──────────────────────────→ n
Fix: More capacity, better features, less regularization.
| Cause | Fix |
|---|---|
| Model too simple | Increase capacity (depth, width, polynomial degree) |
| Insufficient training | Train longer; reduce learning rate |
| Too strong regularization | Lower λ, reduce dropout rate |
| Wrong model family | Use CNN for images, LSTM/Transformer for sequences |
| Poor features | Feature engineering or representation learning |
import numpy as np
from sklearn.linear_model import Ridge
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.datasets import make_friedman1
np.random.seed(42)
X, y = make_friedman1(n_samples=500, n_features=10, noise=1.0, random_state=42)
models = {
'Linear Ridge (underfits - Friedman is nonlinear)': Pipeline([
('s', StandardScaler()), ('m', Ridge(alpha=1.0))
]),
'Gradient Boosting depth=2': GradientBoostingRegressor(max_depth=2, random_state=42),
'Gradient Boosting depth=4': GradientBoostingRegressor(max_depth=4, random_state=42),
'Gradient Boosting depth=6': GradientBoostingRegressor(max_depth=6, random_state=42),
}
print(f"{'Model':<48} {'CV MSE':>10} {'Std':>8}")
print('-' * 68)
for name, model in models.items():
scores = -cross_val_score(model, X, y, cv=5, scoring='neg_mean_squared_error')
print(f"{name:<48} {scores.mean():>10.3f} {scores.std():>8.3f}")
print("\nLinear model cannot capture Friedman's nonlinear interactions → underfitting")
Part 5 - Distribution Shift: The Production Generalization Problem
All generalization theory assumes training and deployment come from the same distribution. In production, this fails constantly.
Covariate shift - features distribution changes, labels still follow same mechanism:
- Train on summer data, deploy in winter; train on US, deploy in EU
- Fix: importance weighting - reweight training samples by
Label shift - class proportions change:
- Fraud rate drops from 5% (training) to 0.1% (production)
- Fix: recalibrate posterior probabilities using new prior
Concept drift - changes, the world evolves:
- User preferences change; fraud patterns adapt; language usage shifts
- Fix: periodic retraining, online learning, sliding-window models
import numpy as np
from scipy.stats import ks_2samp
np.random.seed(42)
X_train = np.random.randn(1000, 10)
# Covariate shift: features shift by 0.5 on average
X_deploy_shifted = np.random.randn(200, 10) + 0.5
print("KS Test for Distribution Shift Per Feature")
print(f"{'Feature':<12} {'KS stat':>10} {'p-value':>12} {'Shift?':>10}")
print('-' * 46)
for i in range(X_train.shape[1]):
stat, p = ks_2samp(X_train[:, i], X_deploy_shifted[:, i])
shift = 'YES ⚠' if p < 0.05 else 'no'
print(f"Feature {i:<4} {stat:>10.4f} {p:>12.4f} {shift:>10}")
print("\nFeatures with p < 0.05 show statistically significant distribution shift.")
Monitoring for Drift in Production
from collections import deque
import numpy as np
from scipy.stats import ks_2samp
class DriftMonitor:
"""
Sliding-window drift detector for production ML systems.
Compares recent predictions to a baseline using KS test.
"""
def __init__(self, window_size=500, alpha=0.05):
self.window = deque(maxlen=window_size)
self.baseline = None
self.alpha = alpha
def set_baseline(self, predictions):
self.baseline = np.array(predictions)
def log(self, prediction):
self.window.append(prediction)
def check(self):
if len(self.window) < 50 or self.baseline is None:
return {'ready': False}
stat, p = ks_2samp(self.baseline, list(self.window))
return {
'drift_detected': p < self.alpha,
'ks_stat': round(stat, 4),
'p_value': round(p, 4),
'baseline_mean': round(float(self.baseline.mean()), 4),
'current_mean': round(float(np.mean(self.window)), 4),
}
# Simulate stable period, then shift
monitor = DriftMonitor(window_size=300)
monitor.set_baseline(np.random.beta(2, 5, 1000)) # stable baseline
for _ in range(100): monitor.log(np.random.beta(2, 5)) # stable
print("After 100 stable predictions:")
print(monitor.check())
for _ in range(300): monitor.log(np.random.beta(5, 2)) # shifted
print("\nAfter 300 shifted predictions:")
print(monitor.check())
Part 6 - Evaluation Pipelines That Catch Generalization Problems
Wrong evaluation: Right evaluation:
All data → random 80/20 split Temporal split:
→ Works for i.i.d. data Train: t < cutoff
→ FAILS for: Val: cutoff ≤ t < cutoff+Δ
- Time series (data leakage) Test: t ≥ cutoff+Δ
- User data (group leakage)
- Geographic data Group split:
- Medical data (patient leakage) Train: users A–G
Test: users H–Z
| Split | Purpose | How to set size |
|---|---|---|
| Train | Fit parameters | 60–80% of data |
| Validation | Select hyperparameters | 10–20% |
| Test | Final estimate, used ONCE | 10–20%; minimum 1,000 examples for stable estimates |
Data leakage checklist (Lesson 10 covers in depth):
- No future information in features for time-series tasks
- No ID columns or proxy-ID columns as features
- Scaling/normalization fit on train-only, applied to val/test
- No rows from the same entity (user, patient, patient) in both train and test
Recommended Resources
:::tip Video Resources StatQuest - Overfitting: Fitting the noise Visual walkthrough of overfitting in decision trees and polynomial regression. (~7 min)
StatQuest - Regularization Part 1 - Ridge The clearest explanation of L2 regularization and why it works. (~10 min)
StatQuest - Regularization Part 2 - Lasso L1 sparsity geometry and the diamond constraint set. (~8 min)
Andrej Karpathy - Lecture: Training Neural Networks (Stanford CS231n) Covers dropout, batch norm, weight decay in practical depth. (~1h 20min) :::
Interview Questions
Q1: What is the generalization gap and what determines its size?
The generalization gap is - the difference between true expected error and empirical training error. PAC theory bounds it as for finite classes, or using Rademacher complexity for infinite classes. Practically, the gap is larger when: (1) model complexity is high relative to training data, (2) labels are noisy (high ), (3) training data is small, (4) the test set was used to make model decisions (contamination). It is reduced by: more training data, regularization, ensembling, and careful evaluation protocol.
Q2: Explain the difference between L1 and L2 regularization geometrically and practically.
Both add a constraint on parameters: L2 constrains (a sphere); L1 constrains (a hyperdiamond). The optimal solution is where the loss function's level sets touch the constraint set. The L2 sphere has no corners - solutions occur anywhere on the surface, giving small-but-nonzero weights. The L1 hyperdiamond has corners on the coordinate axes - the optimal often lies at a corner, giving exact zeros (sparsity). Use L1 when many features are irrelevant (automatic feature selection); use L2 when all features matter and you want smooth shrinkage. ElasticNet combines both: L1 for sparsity + L2 for grouping correlated features.
Q3: You have training accuracy 99%, test accuracy 68%. Walk through diagnosis and remediation.
Step 1: Check for data leakage - is future information in features? Is the test split representative? Is preprocessing fit on the full dataset including test? Step 2: Confirm distributions match - run KS tests on key features. Step 3: This is high variance (overfitting). The 31-point gap is the signature. Remediation in priority order: (1) More training data - most reliable; (2) Add regularization - L2 weight decay, dropout; (3) Reduce model complexity - shallower tree, fewer layers; (4) Ensembling - bagging reduces variance without sacrificing bias; (5) Feature selection - remove noisy features. Measure each fix using cross-validation, not the test set.
Q4: What is distribution shift and how do you detect it in production?
Distribution shift: . Types: covariate shift ( changes, stable), label shift ( changes), concept drift ( changes). Detection: (1) Feature drift - KS test comparing training vs recent production feature distributions; (2) Prediction drift - monitor model output score distribution; (3) Performance monitoring - if delayed labels available, track accuracy/AUC over time windows; (4) Data quality checks - missing rates, cardinality changes, value range violations. Automatic alerting when KS statistic exceeds threshold or PSI (Population Stability Index) > 0.2 is a common production setup.
Q5: What does "overfitting to the validation set" mean, and how does it happen in practice?
When hyperparameters are selected based on validation performance across many iterations (grid search, random search, Bayesian optimization), the selected configuration is partially "lucky" on that specific validation partition. After evaluations, the best-seen validation score overfits by . This is why: (1) the test set must remain untouched until the very end; (2) nested cross-validation is needed when both model selection and performance estimation are required; (3) Bayesian HPO overfits less than grid search (fewer evaluations). In Kaggle competitions, public leaderboard overfitting is rampant - teams submit many times until lucky on the public test shard, then fail on the private leaderboard.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Bias-Variance Tradeoff demo on the EngineersOfAI Playground - no code required.
:::
