Cross-Validation
Reading time: ~38 minutes | Level: ML Foundations | Role: MLE, Data Scientist, MLOps, Research Engineer
You present your credit fraud detector to the team. Test accuracy: 98.7%. Everyone is impressed. Then someone asks: "How did you split the data?" You say: "One random 80/20 split." They pause.
The dataset has 500,000 transactions from 50,000 customers - an average of 10 transactions per customer. With a random split, a customer's transactions ended up in both training and test. The model essentially memorised customer-level patterns that leaked from training into evaluation. The real generalisation accuracy is probably 93% - which is still above the baseline, but the gap matters enormously for fraud.
This is the cross-validation problem. Not just "use k-Fold instead of one split" - it's about understanding what question your evaluation is answering, and whether that question matches your deployment reality. This lesson makes you fluent in every CV variant, its assumptions, its failure modes, and when each one applies.
What You Will Learn
- Why single train/test splits give high-variance, potentially biased estimates
- The mathematical variance analysis of k-Fold vs LOOCV
- Stratified, repeated, group, and leave-p-out CV variants
- Time-series cross-validation with temporal integrity - walk-forward and blocked CV
- Nested CV for simultaneous hyperparameter tuning and unbiased evaluation
- Data leakage: how it enters, how to detect it, how to prevent it in pipelines
- Computational trade-offs and the right CV choice for each data type
- Eight interview Q&As at senior MLE/research engineer level
Part 1 - The Problem: Why One Split Isn't Enough
Imagine you're estimating the accuracy of a coin flip.
You flip it 5 times and get 3 heads → "60% heads"
That's one sample from a distribution.
Flip 5 more times, you might get 40%, 80%, 60%, ...
A single 80/20 train/test split is the same thing.
You got one estimate from one particular random partition.
If you reshuffle the data, you get a different estimate.
The true generalisation error could be anywhere in a wide CI.
Empirical demonstration of split variance:
import numpy as np
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
X, y = make_classification(
n_samples=500, n_features=20, n_informative=10,
n_redundant=5, random_state=None # no fixed seed - show true variance
)
# Run 50 different random splits
scores_80_20 = []
scores_60_40 = []
for seed in range(50):
model = LogisticRegression(max_iter=500, random_state=seed)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=seed)
model.fit(X_train, y_train)
scores_80_20.append(accuracy_score(y_test, model.predict(X_test)))
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.4, random_state=seed)
model.fit(X_train, y_train)
scores_60_40.append(accuracy_score(y_test, model.predict(X_test)))
print("Single split variance (50 random seeds):")
print(f" 80/20 split: mean={np.mean(scores_80_20):.4f}, std={np.std(scores_80_20):.4f}")
print(f" range=[{np.min(scores_80_20):.4f}, {np.max(scores_80_20):.4f}]")
print(f" 60/40 split: mean={np.mean(scores_60_40):.4f}, std={np.std(scores_60_40):.4f}")
print(f" range=[{np.min(scores_60_40):.4f}, {np.max(scores_60_40):.4f}]")
# Larger test set reduces variance but increases bias (less training data)
# CV solves both by averaging over non-overlapping test sets
The fundamental tension in a single split:
Small test set (e.g., 10%):
✓ Model trained on more data → lower bias
✗ Small test set → high variance in the estimate
Large test set (e.g., 40%):
✓ Large test set → lower variance in the estimate
✗ Model trained on less data → higher bias
Cross-validation resolves this by using ALL data for both training and testing
(in different folds), then averaging to reduce variance.
Part 2 - k-Fold Cross-Validation
The Core Algorithm
DATA: [●●●●●●●●●●●●●●●●●●●●] (n samples, shuffled)
Divide into k equal-sized folds:
[ F1 | F2 | F3 | F4 | F5 ]
Iteration 1: Train on F2+F3+F4+F5, Evaluate on F1 → score_1
Iteration 2: Train on F1+F3+F4+F5, Evaluate on F2 → score_2
Iteration 3: Train on F1+F2+F4+F5, Evaluate on F3 → score_3
Iteration 4: Train on F1+F2+F3+F5, Evaluate on F4 → score_4
Iteration 5: Train on F1+F2+F3+F4, Evaluate on F5 → score_5
Final estimate: μ = mean(score_1...5)
Uncertainty: σ = std(score_1...5)
95% CI: μ ± 2σ
Mathematical Bias-Variance of k-Fold
Each fold uses fraction of the data for training. As increases:
- Bias decreases: each training set is closer to size → closer to the performance of the final model trained on all data
- Variance increases: each test fold is smaller ( samples) → noisier individual estimates
The variance of the k-Fold estimator is approximately:
The covariance terms are positive (train sets overlap) and prevent variance from decreasing as fast as independent samples would. This is why k=10 doesn't give 2× better estimates than k=5.
from sklearn.model_selection import cross_val_score, KFold, cross_validate
from sklearn.svm import SVC
from sklearn.datasets import make_classification
import numpy as np
X, y = make_classification(n_samples=500, n_features=20,
n_informative=10, random_state=42)
model = SVC(kernel='rbf', C=1.0)
# Compare different k values
print(f"{'k':>5} {'Mean':>10} {'Std':>10} {'95% CI Width':>14}")
print("-" * 45)
for k in [3, 5, 10, 20]:
kf = KFold(n_splits=k, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=kf, scoring='accuracy')
ci_width = 4 * scores.std() # ±2σ → full width
print(f"{k:>5} {scores.mean():>10.4f} {scores.std():>10.4f} {ci_width:>14.4f}")
Getting Full Diagnostic Information
from sklearn.model_selection import cross_validate
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
# Always use Pipelines to prevent data leakage
pipe = Pipeline([
('scaler', StandardScaler()),
('clf', SVC(kernel='rbf', C=1.0))
])
cv = KFold(n_splits=5, shuffle=True, random_state=42)
results = cross_validate(
pipe, X, y,
cv=cv,
scoring={
'accuracy': 'accuracy',
'f1': 'f1_weighted',
'roc_auc': 'roc_auc',
'precision': 'precision_weighted',
'recall': 'recall_weighted',
},
return_train_score=True, # detect overfitting
return_estimator=True, # access fitted pipelines
n_jobs=-1 # parallel execution
)
print("\n5-Fold CV Results (train vs test - diagnose overfitting):")
for metric in ['accuracy', 'f1', 'roc_auc']:
train_m = results[f'train_{metric}']
test_m = results[f'test_{metric}']
gap = train_m.mean() - test_m.mean()
print(f"\n{metric}:")
print(f" Train: {train_m.mean():.4f} ± {train_m.std():.4f}")
print(f" Test: {test_m.mean():.4f} ± {test_m.std():.4f}")
print(f" Gap: {gap:.4f} {'← possible overfit' if gap > 0.05 else '← healthy'}")
# Access the fitted estimators from each fold (e.g., for feature importance)
for i, estimator in enumerate(results['estimator']):
svm = estimator.named_steps['clf']
print(f"Fold {i+1} support vectors: {svm.n_support_}")
Part 3 - Stratified k-Fold
Random k-Fold can produce folds with very unequal class distributions. With class imbalance (e.g., 95% negative, 5% positive), some folds might contain zero positive examples - making the fold's score meaningless.
Stratified k-Fold ensures each fold preserves the original class ratio.
from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.datasets import make_classification
import numpy as np
# Imbalanced dataset: 90% class 0, 10% class 1
X, y = make_classification(n_samples=500, n_features=20,
weights=[0.9, 0.1], random_state=42)
print(f"Overall positive rate: {y.mean():.3f}")
print()
# Compare random vs stratified fold composition
for cv_name, cv in [('KFold', KFold(n_splits=5, shuffle=True, random_state=42)),
('Stratified', StratifiedKFold(n_splits=5, shuffle=True, random_state=42))]:
print(f"{cv_name} fold compositions:")
for i, (train_idx, test_idx) in enumerate(cv.split(X, y)):
train_pos = y[train_idx].mean()
test_pos = y[test_idx].mean()
print(f" Fold {i+1}: train_pos={train_pos:.3f}, test_pos={test_pos:.3f}")
print()
# StratifiedKFold → each fold should show ~0.10 positive rate
# KFold → positive rate varies - some folds may have 0.05 or 0.18
Rule of thumb: For classification, always use StratifiedKFold. For regression, use KFold. Scikit-learn's cross_val_score uses StratifiedKFold automatically when the estimator is a classifier - but be explicit when building pipelines.
from sklearn.model_selection import cross_val_score
# sklearn auto-selects Stratified for classifiers
scores = cross_val_score(
SVC(), X, y,
cv=5, # integer → auto StratifiedKFold for classifiers
scoring='f1_weighted'
)
print(f"Auto-stratified scores: {scores}")
Part 4 - Repeated k-Fold
A single run of k-Fold has variance from the particular shuffle chosen. Running k-Fold times with different seeds averages over this shuffle-induced variance:
from sklearn.model_selection import RepeatedStratifiedKFold, RepeatedKFold
import numpy as np
# 5 folds × 10 repeats = 50 total evaluations
rskf = RepeatedStratifiedKFold(n_splits=5, n_repeats=10, random_state=42)
scores = cross_val_score(SVC(), X, y, cv=rskf, scoring='accuracy', n_jobs=-1)
print(f"Repeated 5×10 CV: {scores.mean():.4f} ± {scores.std():.4f}")
print(f"95% CI: ({scores.mean() - 2*scores.std():.4f}, {scores.mean() + 2*scores.std():.4f})")
print(f"n_evaluations: {len(scores)}")
# Compare variance: single 5-fold vs repeated 10×5-fold
kf_scores = cross_val_score(SVC(), X, y,
cv=StratifiedKFold(5, shuffle=True, random_state=42),
scoring='accuracy')
print(f"\nSingle 5-Fold std: {kf_scores.std():.4f}")
print(f"Repeated 10×5 std: {scores.std():.4f}")
# Repeated CV should show lower std - more stable estimate
When to use Repeated CV: When n < 1000 and you need a reliable estimate for a critical model selection decision. Overkill for large datasets - the estimate is already stable.
Part 5 - Leave-One-Out CV (LOOCV)
Extreme case of k-Fold where . Each sample is its own test set of size 1.
where is trained on all samples except .
from sklearn.model_selection import LeaveOneOut, cross_val_score
from sklearn.linear_model import LogisticRegression
import numpy as np
X_small, y_small = X[:100], y[:100] # LOO only practical for small n
loo = LeaveOneOut()
scores = cross_val_score(LogisticRegression(max_iter=300), X_small, y_small,
cv=loo, scoring='accuracy')
print(f"LOOCV: mean={scores.mean():.4f}, std={scores.std():.4f}")
print(f"n_fits: {loo.get_n_splits(X_small)}") # = n_samples
LOOCV: Properties and When to Use
Bias: Minimal - each model trains on n-1 samples
Variance: High - each test set has exactly 1 sample;
the score (0 or 1) is binary → high variance
(this is paradoxically worse than 5-fold for small n)
Cost: n model fits - prohibitive for n > 1000
When to use:
✓ n < 50-100 samples
✓ You need the lowest-bias estimate
✓ Model training is fast (linear models)
✗ Never for neural networks or ensemble models with large n
Shortcut for linear models: LOOCV has an efficient closed-form!
LOOCV Shortcut for Linear Regression
For linear regression, LOOCV can be computed without re-fitting models:
where is the residual from the full model, and is the -th diagonal of the hat matrix .
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression
X_lin, y_lin = make_regression(n_samples=100, n_features=5, noise=10, random_state=42)
# Method 1: Brute-force LOO (n fits)
from sklearn.model_selection import LeaveOneOut
loo_scores = cross_val_score(LinearRegression(), X_lin, y_lin,
cv=LeaveOneOut(), scoring='neg_mean_squared_error')
loocv_mse_brute = -loo_scores.mean()
# Method 2: Efficient formula (1 fit)
model = LinearRegression().fit(X_lin, y_lin)
y_pred = model.predict(X_lin)
residuals = y_lin - y_pred
# Hat matrix diagonal
H = X_lin @ np.linalg.pinv(X_lin.T @ X_lin) @ X_lin.T
h_diag = np.diag(H)
loocv_mse_fast = np.mean((residuals / (1 - h_diag))**2)
print(f"Brute-force LOOCV MSE: {loocv_mse_brute:.4f}")
print(f"Efficient LOOCV MSE: {loocv_mse_fast:.4f}")
# Should be nearly identical, but fast version requires only 1 fit
Part 6 - Group k-Fold
When samples are correlated within groups (patient visits, user sessions, photos of the same object), splitting randomly leaks group-level information.
Scenario: 50 patients, each with 10 medical measurements
WRONG (random KFold):
Train: Patient 3 visit 1,3,5,7,9; Patient 5 visit 1,2,3,4,5 ...
Test: Patient 3 visit 2,4,6,8,10 ← patient-specific features leak!
Your model learns "this is Patient 3's baseline" from training
and generalises that to test. That's not real generalisation - it's
patient-specific memorisation.
RIGHT (GroupKFold):
Fold 1 test: ALL data from Patients 11-20
Fold 1 train: ALL data from Patients 1-10, 21-50
from sklearn.model_selection import GroupKFold, GroupShuffleSplit
import numpy as np
from sklearn.datasets import make_classification
# Simulate patient medical data
n_patients = 50
visits_per_patient = 10
n_samples = n_patients * visits_per_patient
X = np.random.randn(n_samples, 10)
y = np.random.randint(0, 2, n_samples)
groups = np.repeat(np.arange(n_patients), visits_per_patient)
gkf = GroupKFold(n_splits=5)
print("GroupKFold - verifying no group leakage:")
for fold, (train_idx, test_idx) in enumerate(gkf.split(X, y, groups)):
train_groups = set(groups[train_idx])
test_groups = set(groups[test_idx])
overlap = train_groups & test_groups
print(f" Fold {fold+1}: {len(test_groups)} test patients, "
f"{len(train_groups)} train patients, "
f"overlap={len(overlap)} {'← BUG!' if overlap else '← OK'}")
# GroupShuffleSplit for a single stratified group split
gss = GroupShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
train_idx, test_idx = next(gss.split(X, y, groups))
print(f"\nGroupShuffleSplit: {len(set(groups[train_idx]))} train patients, "
f"{len(set(groups[test_idx]))} test patients")
Real-World Group CV Scenarios
Data Type │ Group By │ CV Method
─────────────────────────────┼───────────────────────┼──────────────────
Medical records │ Patient ID │ GroupKFold
User clickstream │ User ID / Session ID │ GroupKFold
NLP: sentence classification │ Document ID │ GroupKFold
Audio: speaker recognition │ Speaker ID │ GroupKFold
Geospatial prediction │ Spatial block │ Spatial CV
Drug discovery (molecules) │ Scaffold cluster │ GroupKFold
Fraud detection │ Account ID │ GroupKFold
Part 7 - Time-Series Cross-Validation
Time series data has a fundamental constraint: you cannot use future information to predict the past. Random k-Fold violates this constraint by placing future timestamps in the training set.
Random KFold on time series - THE WRONG WAY:
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10
Fold 2 train: t1 t3 t4 t6 t7 t9
Fold 2 test: t2 t5 t8 t10
↑ The model trains on t3 and predicts t2 → future leaks to past!
Walk-Forward Validation (Expanding Window)
from sklearn.model_selection import TimeSeriesSplit
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_absolute_error
np.random.seed(42)
n = 300
t = np.arange(n)
# Synthetic time series: trend + seasonality + noise
y = 0.05 * t + 10 * np.sin(2 * np.pi * t / 30) + np.random.normal(0, 2, n)
X = np.column_stack([
t,
np.sin(2 * np.pi * t / 30),
np.cos(2 * np.pi * t / 30),
t**2,
])
# TimeSeriesSplit - always train before test
tscv = TimeSeriesSplit(n_splits=5, gap=5) # gap=5 prevents leakage from lag features
fold_scores = []
for fold, (train_idx, test_idx) in enumerate(tscv.split(X)):
model = Ridge(alpha=1.0)
model.fit(X[train_idx], y[train_idx])
y_pred_fold = model.predict(X[test_idx])
score = mean_absolute_error(y[test_idx], y_pred_fold)
fold_scores.append(score)
print(f"Fold {fold+1}: train=[{train_idx[0]},{train_idx[-1]}], "
f"test=[{test_idx[0]},{test_idx[-1]}], MAE={score:.3f}")
print(f"\nCV MAE: {np.mean(fold_scores):.4f} ± {np.std(fold_scores):.4f}")
Visualising Time Series CV Splits
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
tscv = TimeSeriesSplit(n_splits=5)
fig, ax = plt.subplots(figsize=(14, 6))
colors = {'train': 'steelblue', 'test': 'tomato', 'gap': 'lightgray'}
for fold, (train_idx, test_idx) in enumerate(tscv.split(X)):
y_pos = fold + 1
# Train bar
ax.barh(y_pos, len(train_idx), left=train_idx[0],
color=colors['train'], alpha=0.7, height=0.5)
# Test bar
ax.barh(y_pos, len(test_idx), left=test_idx[0],
color=colors['test'], alpha=0.9, height=0.5)
ax.text(train_idx[-1] + 1, y_pos, f' Fold {fold+1}', va='center', fontsize=9)
ax.set_xlabel('Sample Index (time →)')
ax.set_ylabel('CV Fold')
ax.set_title('TimeSeriesSplit: No Future Leakage')
train_p = mpatches.Patch(color='steelblue', alpha=0.7, label='Train')
test_p = mpatches.Patch(color='tomato', alpha=0.9, label='Test')
ax.legend(handles=[train_p, test_p], loc='lower right')
plt.tight_layout()
Sliding Window (Fixed Training Size)
When older data becomes stale (e.g., user preferences shift rapidly), a fixed training window may outperform the expanding window:
class SlidingWindowCV:
"""
Time series CV with fixed-size training window.
Useful when older data is stale or distribution shifts over time.
"""
def __init__(self, n_splits=5, train_size=100, test_size=20, gap=0):
self.n_splits = n_splits
self.train_size = train_size
self.test_size = test_size
self.gap = gap
def split(self, X, y=None, groups=None):
n = len(X)
step = (n - self.train_size - self.test_size - self.gap) // (self.n_splits - 1)
for i in range(self.n_splits):
start = i * step
train_end = start + self.train_size
test_start = train_end + self.gap
test_end = test_start + self.test_size
if test_end > n:
break
train_idx = np.arange(start, train_end)
test_idx = np.arange(test_start, test_end)
yield train_idx, test_idx
def get_n_splits(self, X=None, y=None, groups=None):
return self.n_splits
# Usage
sliding_cv = SlidingWindowCV(n_splits=5, train_size=150, test_size=30, gap=5)
for fold, (train_idx, test_idx) in enumerate(sliding_cv.split(X)):
print(f"Fold {fold+1}: train [{train_idx[0]}-{train_idx[-1]}] "
f"→ test [{test_idx[0]}-{test_idx[-1]}] "
f"(train_size={len(train_idx)})")
Gap Parameter - Why It Matters
# Gap prevents leakage when features use lag windows
# Example: features use last 7 days of data
# Without gap: test point t+1 could have training point t in both
# the train set AND as a feature input for t+1
# With gap=7: no overlap between train set and feature window of test
tscv_no_gap = TimeSeriesSplit(n_splits=5, gap=0)
tscv_with_gap = TimeSeriesSplit(n_splits=5, gap=7)
# With lag features, always set gap >= max_lag
Part 8 - Nested Cross-Validation
The Hyperparameter Tuning Bias Problem
Scenario: You tune C={0.1, 1, 10, 100} for SVM using 5-fold CV.
Best C selected: C=10, CV score: 0.93
Now you report 0.93 as your model's generalisation accuracy.
PROBLEM: You used the test folds to SELECT C.
The folds that chose C=10 had some randomness in their favour.
C=10 looks best partially because it got lucky test splits.
The reported 0.93 is optimistically biased.
How biased? Depends on the problem, but 1-5% over-estimation
is common, and can be much larger for small datasets.
Nested CV Architecture
OUTER LOOP - estimates true generalisation error
│
│ Outer fold 1: outer_train = 80%, outer_test = 20%
│ │
│ │ INNER LOOP - selects best hyperparameters on outer_train
│ │ Inner fold 1,2,3: 3-fold CV on outer_train
│ │ → best_C = 10
│ │
│ └─ Retrain with best_C on all outer_train
│ Evaluate on outer_test → score_1
│
│ Outer fold 2, 3, 4, 5 → score_2, 3, 4, 5
│
└─ Final estimate: mean(score_1...5) ← UNBIASED
The outer test sets were NEVER used to select hyperparameters
from sklearn.model_selection import GridSearchCV, cross_val_score, KFold
from sklearn.svm import SVC
from sklearn.datasets import make_classification
import numpy as np
X, y = make_classification(n_samples=300, n_features=15,
n_informative=8, random_state=42)
# NESTED CV - gives unbiased generalisation estimate
outer_cv = KFold(n_splits=5, shuffle=True, random_state=42)
inner_cv = KFold(n_splits=3, shuffle=True, random_state=42)
param_grid = {
'C': [0.01, 0.1, 1.0, 10.0, 100.0],
'gamma': ['scale', 'auto', 0.001, 0.01],
'kernel': ['rbf', 'linear']
}
# GridSearchCV with inner CV is the inner loop
grid_search = GridSearchCV(SVC(), param_grid, cv=inner_cv,
scoring='accuracy', n_jobs=-1)
# cross_val_score with GridSearchCV is the outer loop
nested_scores = cross_val_score(
grid_search, X, y,
cv=outer_cv, scoring='accuracy', n_jobs=-1
)
# BIASED comparison: just report the inner CV best score
grid_search.fit(X, y)
non_nested_score = grid_search.best_score_
print(f"Nested CV estimate (unbiased): {nested_scores.mean():.4f} ± {nested_scores.std():.4f}")
print(f"Inner CV best score (biased): {non_nested_score:.4f}")
print(f"Optimism bias: {non_nested_score - nested_scores.mean():+.4f}")
Nested CV with RandomizedSearchCV (More Practical)
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import loguniform, uniform
param_distributions = {
'C': loguniform(1e-3, 1e3),
'gamma': loguniform(1e-4, 1e0),
}
random_search = RandomizedSearchCV(
SVC(kernel='rbf'), param_distributions,
n_iter=20, # 20 random combinations instead of full grid
cv=inner_cv,
scoring='accuracy',
random_state=42,
n_jobs=-1
)
nested_random_scores = cross_val_score(
random_search, X, y,
cv=outer_cv, scoring='accuracy', n_jobs=-1
)
print(f"\nNested RandomizedSearchCV: {nested_random_scores.mean():.4f} ± {nested_random_scores.std():.4f}")
Part 9 - Data Leakage in Cross-Validation
Data leakage causes CV scores to be optimistically biased. It comes in three main forms:
Form 1: Preprocessing Leakage
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import numpy as np
X, y = make_classification(n_samples=500, n_features=20, random_state=42)
# WRONG: fit scaler on ALL data before CV
scaler_wrong = StandardScaler()
X_scaled_wrong = scaler_wrong.fit_transform(X) # uses test fold statistics!
wrong_scores = cross_val_score(
SVC(), X_scaled_wrong, y,
cv=KFold(5, shuffle=True, random_state=42)
)
# RIGHT: scaler inside Pipeline - fits only on train fold each time
pipe_right = Pipeline([
('scaler', StandardScaler()),
('svm', SVC())
])
right_scores = cross_val_score(
pipe_right, X, y,
cv=KFold(5, shuffle=True, random_state=42)
)
print(f"Wrong (leaky scaler): {wrong_scores.mean():.4f} ± {wrong_scores.std():.4f}")
print(f"Right (Pipeline): {right_scores.mean():.4f} ± {right_scores.std():.4f}")
# On clean data the difference is small - but on high-dimensional, small datasets
# the difference can be massive
Form 2: Feature Selection Leakage
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.pipeline import Pipeline
# WRONG: select features using all data before CV
selector_wrong = SelectKBest(f_classif, k=10)
X_selected_wrong = selector_wrong.fit_transform(X, y) # uses test labels!
wrong_fs_scores = cross_val_score(
SVC(), X_selected_wrong, y,
cv=KFold(5, shuffle=True, random_state=42)
)
# RIGHT: feature selection inside Pipeline
pipe_fs = Pipeline([
('selector', SelectKBest(f_classif, k=10)),
('scaler', StandardScaler()),
('svm', SVC())
])
right_fs_scores = cross_val_score(
pipe_fs, X, y,
cv=KFold(5, shuffle=True, random_state=42)
)
print(f"Wrong (leaky selection): {wrong_fs_scores.mean():.4f}")
print(f"Right (Pipeline): {right_fs_scores.mean():.4f}")
# On noisy datasets this gap can be 5-20%!
Form 3: Target Leakage
import pandas as pd
import numpy as np
# Scenario: predicting customer churn
# A feature "days_until_churn" leaks the target directly
np.random.seed(42)
n = 1000
churned = np.random.randint(0, 2, n)
# Legitimate features
df = pd.DataFrame({
'tenure_months': np.random.exponential(24, n),
'monthly_charge': np.random.normal(50, 15, n),
'support_calls': np.random.poisson(2, n),
# LEAKY FEATURE: computed after churn decision
'days_since_last_login': np.where(churned, np.random.exponential(60, n),
np.random.exponential(5, n)),
})
# "days_since_last_login" looks like a feature but it's derived FROM churn
# After churn: customers stop logging in → high days_since_login
# → Including it gives falsely high accuracy
print("Target leakage is the hardest to detect.")
print("Always ask: 'Is this feature available at the time of prediction?'")
print("If it's only known AFTER the target occurs → leaky feature.")
Leakage Detection Checklist
PREPROCESSING LEAKAGE:
□ Are scalers, imputers, encoders fit on ALL data before CV?
→ Fix: put everything inside sklearn Pipeline
FEATURE LEAKAGE:
□ Is feature selection done on all data before CV?
→ Fix: SelectKBest, RFE inside Pipeline
□ Are derived statistics (mean encoding, target encoding) computed globally?
→ Fix: use TargetEncoder inside Pipeline
TARGET LEAKAGE:
□ Is any feature derived from or correlated with future target values?
□ Do timestamps reveal target (e.g., columns filled in after event)?
□ Are there near-duplicate records that span train/test?
→ Fix: check correlation of each feature with target directly
GROUP LEAKAGE:
□ Are samples from the same entity in both train and test?
□ Do features encode entity-level statistics computed globally?
→ Fix: GroupKFold, or entity-level train/test split
TIME SERIES LEAKAGE:
□ Is future data appearing in training (random split on time series)?
□ Are lag features computed across the split boundary?
→ Fix: TimeSeriesSplit with gap parameter
Part 10 - Choosing the Right CV Strategy
Part 11 - Computational Cost Analysis
import time
from sklearn.model_selection import (
cross_val_score, KFold, StratifiedKFold,
RepeatedStratifiedKFold, LeaveOneOut
)
from sklearn.svm import SVC
from sklearn.datasets import make_classification
import numpy as np
X_bench, y_bench = make_classification(n_samples=300, n_features=20, random_state=42)
model = SVC(kernel='rbf')
strategies = {
'KFold(3)': KFold(n_splits=3, shuffle=True, random_state=42),
'KFold(5)': KFold(n_splits=5, shuffle=True, random_state=42),
'KFold(10)': KFold(n_splits=10, shuffle=True, random_state=42),
'Repeated 5×5': RepeatedStratifiedKFold(n_splits=5, n_repeats=5, random_state=42),
'Repeated 5×10': RepeatedStratifiedKFold(n_splits=5, n_repeats=10, random_state=42),
'LOOCV': LeaveOneOut(),
}
print(f"\n{'Strategy':<20} {'n_fits':>8} {'Mean':>8} {'Std':>8} {'Time(s)':>10}")
print("-" * 60)
for name, cv in strategies.items():
n_fits = cv.get_n_splits(X_bench)
start = time.perf_counter()
scores = cross_val_score(model, X_bench, y_bench, cv=cv,
scoring='accuracy', n_jobs=1)
elapsed = time.perf_counter() - start
print(f"{name:<20} {n_fits:>8} {scores.mean():>8.4f} {scores.std():>8.4f} {elapsed:>10.2f}")
Practical recommendations:
| Dataset Size | Recommended CV | Rationale |
|---|---|---|
| n < 50 | LOOCV | Minimise bias; computation still feasible |
| 50 ≤ n < 500 | Repeated 5×10 | Stable estimate; reduce variance |
| 500 ≤ n < 5,000 | StratifiedKFold(10) | Good bias-variance; manageable compute |
| 5,000 ≤ n < 50,000 | StratifiedKFold(5) | Sufficient data per fold |
| n > 50,000 | KFold(3) or single split | Large test fold already stable |
Part 12 - Advanced: Statistical Testing Across CV Folds
When comparing two models with CV, you should test whether the performance difference is statistically significant. The corrected resampled t-test (Dietterich, 1998) accounts for fold correlation:
from scipy.stats import t
import numpy as np
from sklearn.model_selection import cross_val_score, KFold
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=500, n_features=20,
n_informative=10, random_state=42)
cv = KFold(n_splits=10, shuffle=True, random_state=42)
scores_lr = cross_val_score(LogisticRegression(max_iter=500), X, y, cv=cv)
scores_svm = cross_val_score(SVC(kernel='rbf'), X, y, cv=cv)
def corrected_resampled_ttest(scores_a, scores_b, n_train_frac=0.9):
"""
Corrected resampled t-test for comparing CV scores.
Accounts for the fact that CV folds are not independent.
(Dietterich, 1998)
"""
differences = scores_a - scores_b
n = len(differences)
mean_diff = differences.mean()
var_diff = differences.var(ddof=1)
n_test = 1 - n_train_frac
# Correction factor for correlated folds
correction = (1 / n) + (n_test / n_train_frac)
t_stat = mean_diff / np.sqrt(correction * var_diff)
p_value = 2 * t.sf(abs(t_stat), df=n - 1)
return t_stat, p_value
t_stat, p_val = corrected_resampled_ttest(scores_svm, scores_lr)
print(f"LR: {scores_lr.mean():.4f} ± {scores_lr.std():.4f}")
print(f"SVM: {scores_svm.mean():.4f} ± {scores_svm.std():.4f}")
print(f"Corrected t-test: t={t_stat:.3f}, p={p_val:.4f}")
print(f"Difference is {'statistically significant (p<0.05)' if p_val < 0.05 else 'NOT significant (p≥0.05)'}")
YouTube Resources
| Video | Channel | Focus |
|---|---|---|
| Cross Validation explained | StatQuest | k-Fold intuition, bias-variance |
| Nested Cross-Validation | StatQuest | Hyperparameter selection bias |
| Time Series Cross-Validation | ritvikmath | Walk-forward validation |
| Data Leakage in ML | Krish Naik | Leakage detection and prevention |
| sklearn Pipelines and CV | Data School | Correct pipeline-based CV |
Interview Questions
Q1: Why is a single 80/20 train-test split unreliable, and what does CV actually fix?
A single split is one sample from the distribution of all possible splits. Reshuffle the data and you get a different accuracy - the variance of a single split estimate is high. k-Fold reduces this variance by averaging over non-overlapping test evaluations, each covering different samples. The estimator variance decreases roughly as (ignoring positive correlations between folds). Additionally, k-Fold uses all data as test data at some point, making it especially valuable for small datasets where reserving 20% permanently for testing is wasteful. The trade-off is that k model fits are required, and fold scores are positively correlated (shared training data), so variance reduction is less than .
Q2: What's the difference between model evaluation and model selection, and why does mixing them create a problem?
Model evaluation asks: "How well does this model generalise to new data?" It requires test data that was never used to make any model decisions. Model selection asks: "Which hyperparameters/architecture gives the best generalisation?" It requires held-out validation data to compare alternatives. Mixing them - selecting hyperparameters based on CV scores and then reporting those same CV scores as your generalisation estimate - creates optimism bias. The selected hyperparameters were chosen because they happened to perform well on those specific folds, including some luck. Nested CV separates the concerns: the inner loop selects hyperparameters, the outer loop evaluates the selected model on data never seen in the inner loop. The outer loop estimate is unbiased.
Q3: Explain exactly how data leakage occurs when you scale features before cross-validation, and how to prevent it.
Standard scaling computes and from the data, then applies . If you call scaler.fit_transform(X) on the entire dataset before doing CV, the and include statistics from the test folds. When the model then sees the scaled test fold, it's been normalized using its own statistics - it's like the model has already "peeked" at the test set. Concretely: if the test fold contains extreme values, those shift /, making the test fold's scaled features artificially well-behaved. The fix is always wrapping preprocessing in sklearn.pipeline.Pipeline. Inside cross_val_score, the pipeline calls fit_transform only on the training fold and transform on the test fold - correct train-test isolation.
Q4: How is time-series cross-validation different from standard k-Fold, and what is the gap parameter?
Standard k-Fold randomly shuffles samples across folds, which violates temporal causality in time series - the model can train on future data and predict the past. TimeSeriesSplit enforces that test data always comes after training data in time, simulating the production scenario of predicting the future from historical data. The gap parameter drops samples between the end of training and the start of testing. This prevents leakage when features use lag windows: if your model uses the last 7 days as features, a test point at time has its feature vector computed from . Without a gap, some of may overlap with training samples. Setting gap=7 ensures the closest training sample is at or earlier.
Q5: What is LOOCV optimism, and when should you use LOOCV vs 5-Fold?
Counterintuitively, LOOCV can have higher variance than 5-Fold despite training on samples. Each test set has a single binary outcome (correct or incorrect prediction), and the 0/1 scores are highly variable. The resulting estimator variance can be larger than k-Fold's, even though LOOCV has lower bias. Empirically, for classification with small , 5-Fold or 10-Fold often gives better bias-variance trade-off than LOOCV. Use LOOCV when: , training is fast (linear models), and you absolutely need to minimise bias. For linear regression, use the LOOCV shortcut formula (requires only 1 fit). For most practical purposes (n > 100), 10-Fold Repeated CV outperforms LOOCV in the bias-variance trade-off.
Q6: Describe when you would use GroupKFold and how it changes the evaluation semantics.
GroupKFold is required whenever your test question is "does the model generalise to new entities?" - where "entity" means a patient, user, customer, audio speaker, geographic region, etc. Random k-Fold evaluates "does the model generalise to new samples from the same entities it trained on?" - a fundamentally different and usually easier question. Concretely: if a model trains on visits 1-9 of Patient 7 and tests on visit 10 of Patient 7, it can exploit patient-specific patterns (individual baseline, typical values) that won't exist for new patients in production. GroupKFold ensures entire patients are withheld, matching the production deployment where you'll encounter new patients. The resulting CV score is typically lower but more honest.
Q7: How would you implement cross-validation for a custom model that doesn't follow the sklearn API?
I'd implement it manually using sklearn's KFold.split() which yields index arrays:
from sklearn.model_selection import StratifiedKFold
import numpy as np
def custom_cross_validate(model_class, model_kwargs, X, y, n_splits=5):
cv = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)
scores = []
for train_idx, test_idx in cv.split(X, y):
X_train, X_test = X[train_idx], X[test_idx]
y_train, y_test = y[train_idx], y[test_idx]
model = model_class(**model_kwargs)
model.fit(X_train, y_train)
score = model.score(X_test, y_test)
scores.append(score)
return np.array(scores)
For models with custom preprocessing, I'd apply the preprocessing inside the loop, fitting only on train_idx data.
Q8: What's the correct way to perform cross-validation when you also need to calibrate probabilities?
Probability calibration (Platt scaling, isotonic regression) should be included inside the CV loop, not applied as a post-processing step on the full dataset. The calibrator must be fit on held-out data from training (not test fold), which means it requires its own validation split within each outer fold. The correct approach uses sklearn.calibration.CalibratedClassifierCV with cv='prefit' or cv=k, where cv=k performs internal calibration splitting. When using nested CV, the calibration wrapping goes inside the inner loop. The key principle: any component that learns from the data (scaler, feature selector, hyperparameter, calibrator) must be fit only on training data and applied to test data - and this must be enforced for every level of the CV hierarchy.
:::tip Role-Specific Angles MLE Interview: Nested CV, data leakage (preprocessing, feature selection, target leakage), Pipeline design, statistical testing of CV results Research Engineer: LOOCV optimism, corrected resampled t-test, bias-variance of CV estimators, LOOCV closed-form for linear models MLOps Interview: Time-series CV, walk-forward validation, production evaluation protocol design Data Scientist: Stratified CV for imbalanced classes, repeated CV for small datasets, GroupKFold for entity-level data :::
:::tip 🎮 Interactive Playground
Visualize this concept: Try the K-Fold Cross-Validation demo on the EngineersOfAI Playground - no code required.
:::
