Train / Validation / Test Split Strategy
Reading time: ~35 minutes | Level: ML Foundations | Role: MLE, Data Scientist, MLOps, AI Engineer
A team builds a document classification model. The test accuracy is 96%. They deploy. Performance in production: 72%. The post-mortem reveals the problem: the dataset contained documents from the same source files. Random splitting put paragraphs from the same document into both training and test. The model memorised document-level patterns and generalized poorly to the real distribution of new documents.
This failure is completely preventable - if you understand data splitting properly. The split strategy is one of the most consequential decisions in any ML project, yet it's often dismissed as an afterthought. A wrong split doesn't just give you an optimistic metric - it can make an actually bad model look excellent, leading to costly deployment decisions.
What You Will Learn
- Why the three-way split (train / val / test) exists and what each partition does
- How to choose split ratios based on dataset size and model complexity
- The taxonomy of data leakage - temporal, group, feature, preprocessing, target
- Temporal splits for time-series and event-driven data
- Group splits for correlated, hierarchical, and entity-based datasets
- Stratified splits for class-imbalanced and rare-label distributions
- Anti-leakage architecture: the correct ordering of all operations
- Production evaluation design - when to update the test set and how
- Nine interview Q&As at senior MLE level
Part 1 - The Three-Way Split and Why It Exists
The fundamental problem:
1. You need data to train the model → training set
2. You need data to tune hyperparameters → validation set
3. You need uncontaminated data for final → test set
performance estimation
Why can't you reuse the training set for validation?
→ The model's parameters were optimized on training set
so it will always look great there.
Why can't you reuse the validation set for the test?
→ You used the validation set to make MODEL DECISIONS
(which architecture, which C value, which learning rate)
The model is implicitly "fitted" to the val set through your choices.
Report val score as final → optimism bias.
DATA FLOW:
Training Set ──────────────────────────────────────
→ Model parameters learned here │
│
Validation Set ─────────────────────────────────── │
→ Hyperparameter selection │
→ Architecture decisions │
→ Early stopping signals │
→ Feature engineering decisions │
ALL of these decisions implicitly use the val set │
│
Test Set ────────────────────────────────────────── │
→ Touched EXACTLY ONCE │
→ Only after all decisions are final │
→ Gives unbiased generalisation estimate │
→ If you look at it, retune, and look again │
→ It's now a second validation set │
The golden rule of the test set: you get ONE look. If you use that look to make any decision, that data is no longer a test set.
Part 2 - Choosing Split Ratios
There is no universal optimal ratio. It depends on dataset size and model complexity.
The Classical Ratios
Traditional (pre-deep-learning, small datasets):
Training: 60%
Validation: 20%
Test: 20%
Common modern ratio:
Training: 70-80%
Validation: 10-15%
Test: 10-15%
The Right Framework: Minimum Samples Per Partition
The question is not "what percentage?" but "do I have enough samples?"
Validation set must be large enough to:
✓ Give reliable metric estimates (low variance)
✓ Represent the true distribution
✓ Detect meaningful performance differences between models
Test set must be large enough to:
✓ Give a tight confidence interval on the final metric
✓ Have statistical power to detect real differences from baseline
import numpy as np
from scipy import stats
def minimum_test_size_for_accuracy(target_margin: float, confidence: float = 0.95,
baseline_acc: float = 0.5) -> int:
"""
How many test samples do you need to estimate accuracy within ±margin
at a given confidence level?
Uses the Wilson score interval for proportions.
Example: margin=0.02, confidence=0.95, baseline_acc=0.85
→ I need to detect whether accuracy is 83% or 87%
"""
z = stats.norm.ppf((1 + confidence) / 2) # z-score for CI
p = baseline_acc
# From Wilson score: margin = z * sqrt(p*(1-p)/n) → n = z² * p*(1-p) / margin²
n = (z**2 * p * (1 - p)) / (target_margin**2)
return int(np.ceil(n))
print("Minimum test set size for accuracy estimation:")
print(f"{'Desired margin':>20} {'Baseline acc':>15} {'Min test size':>15}")
print("-" * 55)
for margin in [0.01, 0.02, 0.05, 0.10]:
for acc in [0.7, 0.9, 0.95]:
n = minimum_test_size_for_accuracy(margin, baseline_acc=acc)
print(f"{f'±{margin*100:.0f}%':>20} {f'{acc*100:.0f}%':>15} {n:>15,}")
print()
Split Ratios vs Dataset Size
Dataset Size | Train | Val | Test | Notes
───────────────┼───────┼─────┼───────┼──────────────────────────────
< 1,000 | 60% | 20% | 20% | Consider k-fold instead
1K – 10K | 70% | 15% | 15% | Balance between compute and stat power
10K – 100K | 80% | 10% | 10% | Test set > 1K samples: good CI
100K – 1M | 90% | 5% | 5% | Even 5% = 50K samples - very reliable
> 1M | 98% | 1% | 1% | 1% = 10K+ samples - excellent CI
from sklearn.model_selection import train_test_split
import numpy as np
def smart_split(X, y, val_size=0.15, test_size=0.15, random_state=42, stratify=None):
"""
Three-way split with optional stratification.
Handles the sklearn train_test_split's two-step process correctly.
"""
n = len(X)
relative_val_size = val_size / (1 - test_size) # val fraction of (train+val)
# Step 1: split off test set
X_trainval, X_test, y_trainval, y_test = train_test_split(
X, y,
test_size=test_size,
random_state=random_state,
stratify=stratify # preserves class balance in test
)
# Step 2: split remaining into train and val
stratify_inner = y_trainval if stratify is not None else None
X_train, X_val, y_train, y_val = train_test_split(
X_trainval, y_trainval,
test_size=relative_val_size,
random_state=random_state + 1,
stratify=stratify_inner
)
print(f"Dataset: n={n}")
print(f" Train: n={len(X_train)} ({len(X_train)/n*100:.1f}%)")
print(f" Val: n={len(X_val)} ({len(X_val)/n*100:.1f}%)")
print(f" Test: n={len(X_test)} ({len(X_test)/n*100:.1f}%)")
if stratify is not None:
print(f" Target distribution - Train: {np.bincount(y_train) / len(y_train)}")
print(f" Target distribution - Val: {np.bincount(y_val) / len(y_val)}")
print(f" Target distribution - Test: {np.bincount(y_test) / len(y_test)}")
return X_train, X_val, X_test, y_train, y_val, y_test
# Demo
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=5000, n_features=20,
weights=[0.8, 0.2], random_state=42)
X_train, X_val, X_test, y_train, y_val, y_test = smart_split(
X, y, stratify=y
)
Part 3 - The Complete Taxonomy of Data Leakage
Data leakage is when information from outside the training set illegitimately reaches the model during training, causing the model to appear better than it is. There are six distinct forms:
Leakage Type 1: Preprocessing Leakage
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.impute import SimpleImputer
X = np.random.randn(1000, 10)
X[np.random.choice(1000, 50, replace=False), :] = np.nan # inject missing values
X_train, X_val = X[:800], X[800:]
y_train, y_val = np.random.randint(0, 2, 800), np.random.randint(0, 2, 200)
# WRONG: fit imputer and scaler on all data
imputer_wrong = SimpleImputer(strategy='mean')
scaler_wrong = StandardScaler()
X_all_imputed = imputer_wrong.fit_transform(X) # uses val distribution!
X_all_scaled = scaler_wrong.fit_transform(X_all_imputed) # uses val stats!
# RIGHT: fit imputer and scaler only on training data
imputer_right = SimpleImputer(strategy='mean')
scaler_right = StandardScaler()
X_train_clean = scaler_right.fit_transform(imputer_right.fit_transform(X_train))
X_val_clean = scaler_right.transform(imputer_right.transform(X_val))
# Note: transform (not fit_transform) on val - uses training statistics only
print("Wrong approach: val statistics contaminate the scaler/imputer")
print(f" Imputer mean (wrong): computed on all 1000 samples")
print(f" Imputer mean (right): computed on train 800 samples only")
# Check: train and val should not have the same mean after CORRECT scaling
print(f"\nCorrect - val scaled mean != 0 (scaler fit on train only):")
print(f" X_val_clean col 0 mean: {X_val_clean[:, 0].mean():.4f} (expected: ≠ 0)")
Leakage Type 2: Feature Engineering Leakage
import pandas as pd
import numpy as np
# Scenario: customer features with some computed from the full dataset
n_customers = 10000
df = pd.DataFrame({
'customer_id': range(n_customers),
'age': np.random.randint(18, 70, n_customers),
'monthly_spend': np.random.exponential(200, n_customers),
'category': np.random.choice(['A', 'B', 'C'], n_customers),
'churned': np.random.randint(0, 2, n_customers),
})
# WRONG: compute global target-based statistics before splitting
# "mean_spend_by_category_and_churn" encodes target information globally
df['mean_spend_by_churn'] = df.groupby('churned')['monthly_spend'].transform('mean')
# ↑ This uses the target (churned) to compute a feature - direct target leakage
# WRONG: global mean encoding without split
df['category_mean_spend'] = df.groupby('category')['monthly_spend'].transform('mean')
# ↑ Computed globally - val customers' spending inflates their own features
# RIGHT: compute everything within the training split only
train_df = df[:8000].copy()
val_df = df[8000:].copy()
# Compute mean spend per category from training data only
category_means = train_df.groupby('category')['monthly_spend'].mean()
train_df['cat_mean_spend'] = train_df['category'].map(category_means)
val_df['cat_mean_spend'] = val_df['category'].map(category_means) # apply train stats
Leakage Type 3: Target Leakage (Most Dangerous)
Target leakage occurs when a feature is causally downstream of the target - it's only known after the target is determined.
import pandas as pd
import numpy as np
"""
Classic target leakage examples:
Problem: Predict loan default (target: default=1)
LEAKY features:
- "was_collection_called" - only happens AFTER default
- "days_overdue" - only defined if payment is missed (→ default)
- "settlement_amount" - only recorded if loan was defaulted
SAFE features:
- credit score at application time
- income at application time
- loan-to-value ratio
Problem: Predict customer churn (target: churned=1)
LEAKY features:
- "days_since_last_login" - long gap BECAUSE they churned
- "support_tickets_before_cancel" - only happen in the churn process
- "account_status = CANCELLED" - IS the churn event!
SAFE features:
- login frequency in months 1-6
- number of support tickets in the first 90 days
- plan tier, payment method
Key diagnostic question: "Is this feature available at prediction time
in production, BEFORE the target event occurs?"
"""
def detect_target_leakage(df: pd.DataFrame, target: str, threshold: float = 0.9):
"""
Quick heuristic: features with very high correlation to target
are suspicious - may indicate target leakage.
(Not conclusive - needs domain knowledge)
"""
print(f"\nTarget leakage suspicion scan (target='{target}'):")
print(f"{'Feature':<30} {'Correlation':>15} {'Suspicious?':>12}")
print("-" * 60)
for col in df.columns:
if col == target:
continue
if df[col].dtype in [np.float64, np.int64]:
corr = abs(df[col].corr(df[target]))
suspicious = corr > threshold
flag = " ← INVESTIGATE!" if suspicious else ""
print(f"{col:<30} {corr:>15.4f} {str(suspicious):>12}{flag}")
# Example
np.random.seed(42)
n = 1000
df_churn = pd.DataFrame({
'login_freq_6mo': np.random.normal(10, 3, n),
'support_calls_6mo': np.random.poisson(2, n),
'plan_tier': np.random.randint(1, 4, n),
'churned': np.random.randint(0, 2, n),
})
# Inject leaky feature: "days_since_login" = high if churned
df_churn['days_since_login'] = np.where(
df_churn['churned'] == 1,
np.random.exponential(60, n),
np.random.exponential(5, n)
)
detect_target_leakage(df_churn, 'churned', threshold=0.5)
Leakage Type 4: Group Leakage
"""
Group leakage: samples from the same "entity" appear in both train and test.
Examples:
- Same patient's records in train and test → model learns patient baseline
- Same user's sessions → model learns user preferences, not general patterns
- Same document's sentences → model learns document style
- Same product's reviews → model learns product-level sentiment
The model learns entity-specific features that don't generalise to new entities.
In production you'll encounter NEW patients, NEW users, NEW documents.
"""
import numpy as np
from sklearn.model_selection import GroupShuffleSplit
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
np.random.seed(42)
n_entities = 100
samples_per_entity = 20
n = n_entities * samples_per_entity
# Strong entity-level effect (simulates group leakage risk)
entity_ids = np.repeat(np.arange(n_entities), samples_per_entity)
entity_effects = np.random.randn(n_entities) # entity-specific signal
X_common = np.random.randn(n, 5) # shared features
# Entity-specific feature (would leak in random split)
entity_feature = entity_effects[entity_ids].reshape(-1, 1)
X = np.hstack([X_common, entity_feature])
# Target: a mix of common features and entity effect
y = (X_common[:, 0] + entity_effects[entity_ids] + np.random.randn(n) * 0.3 > 0).astype(int)
# Wrong: random split - entity information leaks
X_train_wrong, X_test_wrong, y_train_wrong, y_test_wrong = (
X[:1600], X[1600:], y[:1600], y[1600:] # entities mixed
)
# Note: row-based split accidentally puts same entities in both train/test
# Better demonstration: use GroupShuffleSplit
# Right: group-based split
gss = GroupShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
train_idx, test_idx = next(gss.split(X, y, groups=entity_ids))
model = LogisticRegression(max_iter=500)
# Wrong split
model.fit(X[entity_ids < 80], y[entity_ids < 80]) # entities 0-79 in train
acc_wrong = accuracy_score(y[entity_ids >= 80], model.predict(X[entity_ids >= 80]))
# Right split (no entity leakage because we used group split)
model.fit(X[train_idx], y[train_idx])
acc_right = accuracy_score(y[test_idx], model.predict(X[test_idx]))
print(f"Accuracy (entity in both train/test): {acc_wrong:.4f}")
print(f"Accuracy (entities held out from test): {acc_right:.4f}")
Leakage Type 5: Temporal Leakage
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
"""
Temporal leakage: the model uses information from the future
to predict the past. This happens with random splits on temporal data.
Examples:
- Stock price prediction: train on future prices to predict past
- Demand forecasting: use next month's demand as a feature
- Fraud detection: use account closure date (post-fraud event) as feature
- User engagement: use features computed from events after the label date
"""
# Simulate a time series dataset
dates = pd.date_range('2023-01-01', '2023-12-31', freq='D')
df_ts = pd.DataFrame({
'date': dates,
'sales': 1000 + 0.5 * np.arange(len(dates)) + \
50 * np.sin(2 * np.pi * np.arange(len(dates)) / 30) + \
np.random.normal(0, 30, len(dates)),
})
# WRONG: random split on time series
np.random.seed(42)
random_idx = np.random.permutation(len(df_ts))
train_wrong = df_ts.iloc[random_idx[:260]]
test_wrong = df_ts.iloc[random_idx[260:]]
print(f"Random split test dates: {test_wrong['date'].min().date()} to {test_wrong['date'].max().date()}")
print("Mixed past and future in both sets - leakage!")
# RIGHT: temporal split - test is always strictly after train
split_date = pd.Timestamp('2023-10-01')
train_right = df_ts[df_ts['date'] < split_date]
test_right = df_ts[df_ts['date'] >= split_date]
print(f"\nTemporal split:")
print(f" Train: {train_right['date'].min().date()} to {train_right['date'].max().date()} (n={len(train_right)})")
print(f" Test: {test_right['date'].min().date()} to {test_right['date'].max().date()} (n={len(test_right)})")
Leakage Type 6: Duplicate / Near-Duplicate Leakage
"""
Duplicate leakage: near-identical samples appear in both train and test.
The model memorises specific records rather than learning general patterns.
Common causes:
- Web-scraped data with duplicated pages
- Database with repeated rows (natural duplicates)
- Data augmentation applied before splitting
- Sensor data with very high sampling rate (autocorrelated neighbors)
"""
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
def detect_near_duplicates(X_train: np.ndarray, X_test: np.ndarray,
similarity_threshold: float = 0.99) -> dict:
"""
Find test samples that are near-duplicates of training samples.
High-dimensional data: use cosine similarity.
"""
from sklearn.preprocessing import normalize
X_train_norm = normalize(X_train)
X_test_norm = normalize(X_test)
leaky_test_indices = []
for i, test_row in enumerate(X_test_norm):
sims = X_train_norm @ test_row # cosine similarity
if sims.max() >= similarity_threshold:
leaky_test_indices.append(i)
return {
'n_leaky': len(leaky_test_indices),
'fraction': len(leaky_test_indices) / len(X_test),
'indices': leaky_test_indices
}
# Demo
X_base = np.random.randn(100, 50)
X_train_demo = X_base[:80]
# Inject 5 near-duplicates into test
X_test_demo = np.vstack([
X_base[80:95], # 15 fresh samples
X_base[:5] + np.random.randn(5, 50) * 0.01 # 5 near-duplicates from train
])
result = detect_near_duplicates(X_train_demo, X_test_demo, threshold=0.99)
print(f"Near-duplicates in test: {result['n_leaky']} ({result['fraction']*100:.1f}%)")
print(f"Indices: {result['indices']}")
Part 4 - Temporal Splits in Practice
When your data has a natural time ordering, the split strategy must respect chronological order.
Deployment reality:
Training data: all events up to date T
Production: predict events after date T
↳ Your evaluation must mirror this: test = events after T
Common mistake: splitting by row (which may be ordered by customer_id, not time)
Safe approach: always split by timestamp
import pandas as pd
import numpy as np
from typing import Tuple
def temporal_split(
df: pd.DataFrame,
timestamp_col: str,
val_start: str,
test_start: str,
gap_days: int = 0,
) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
"""
Three-way temporal split with an optional gap between train/val/test.
The gap drops samples near the split boundary to avoid leakage
from lag-based features (e.g., 7-day rolling averages).
Example:
train: Jan–Aug 2023
gap: 7 days
val: Sep–Oct 2023
test: Nov–Dec 2023
"""
df = df.copy()
df[timestamp_col] = pd.to_datetime(df[timestamp_col])
val_start_ts = pd.Timestamp(val_start)
test_start_ts = pd.Timestamp(test_start)
gap = pd.Timedelta(days=gap_days)
train = df[df[timestamp_col] < val_start_ts - gap]
val = df[(df[timestamp_col] >= val_start_ts) &
(df[timestamp_col] < test_start_ts - gap)]
test = df[df[timestamp_col] >= test_start_ts]
print(f"Temporal split:")
print(f" Train: {train[timestamp_col].min().date()} → {train[timestamp_col].max().date()} (n={len(train):,})")
print(f" Val: {val[timestamp_col].min().date()} → {val[timestamp_col].max().date()} (n={len(val):,})")
print(f" Test: {test[timestamp_col].min().date()} → {test[timestamp_col].max().date()} (n={len(test):,})")
print(f" Gap: {gap_days} days between train/val and val/test")
return train, val, test
# Simulate event data
np.random.seed(42)
n_events = 10000
event_dates = pd.date_range('2023-01-01', '2023-12-31', periods=n_events)
df_events = pd.DataFrame({
'timestamp': np.sort(event_dates),
'feature_1': np.random.randn(n_events),
'feature_2': np.random.exponential(2, n_events),
'target': np.random.randint(0, 2, n_events),
})
train_df, val_df, test_df = temporal_split(
df_events, 'timestamp',
val_start='2023-10-01',
test_start='2023-11-15',
gap_days=7
)
Multiple Test Periods (Walk-Forward Evaluation)
def walk_forward_evaluation(
df: pd.DataFrame,
timestamp_col: str,
train_start: str,
test_periods: list, # list of (test_start, test_end) tuples
model_factory, # callable returning a fresh model
feature_cols: list,
target_col: str,
min_train_samples: int = 1000
):
"""
Evaluate a model across multiple forward-expanding training windows.
Simulates how the model would perform if deployed at each test period.
"""
import pandas as pd
from sklearn.metrics import mean_absolute_error, accuracy_score
df[timestamp_col] = pd.to_datetime(df[timestamp_col])
results = []
for test_start, test_end in test_periods:
test_start_ts = pd.Timestamp(test_start)
test_end_ts = pd.Timestamp(test_end)
train = df[df[timestamp_col] < test_start_ts]
test = df[(df[timestamp_col] >= test_start_ts) &
(df[timestamp_col] < test_end_ts)]
if len(train) < min_train_samples or len(test) == 0:
continue
model = model_factory()
model.fit(train[feature_cols], train[target_col])
preds = model.predict(test[feature_cols])
# Use appropriate metric
try:
score = accuracy_score(test[target_col], preds)
metric = 'accuracy'
except Exception:
score = mean_absolute_error(test[target_col], preds)
metric = 'mae'
results.append({
'test_start': test_start,
'test_end': test_end,
'n_train': len(train),
'n_test': len(test),
metric: score,
})
print(f"Period {test_start} → {test_end}: n_train={len(train):,}, "
f"n_test={len(test):,}, {metric}={score:.4f}")
return pd.DataFrame(results)
Part 5 - Stratified Splits
For classification problems with class imbalance, random splitting can result in very different class ratios across partitions. Stratified splitting preserves the original class distribution.
from sklearn.model_selection import train_test_split
import numpy as np
np.random.seed(42)
n = 10000
# 95% class 0, 4% class 1, 1% class 2 (rare disease detection)
y_imbalanced = np.random.choice([0, 1, 2], size=n, p=[0.95, 0.04, 0.01])
X_imbalanced = np.random.randn(n, 20)
print(f"Overall distribution: {np.bincount(y_imbalanced) / n}")
# Non-stratified split
X_tr, X_te, y_tr, y_te = train_test_split(X_imbalanced, y_imbalanced,
test_size=0.2, random_state=42)
print(f"\nRandom split:")
print(f" Train distribution: {np.bincount(y_tr) / len(y_tr)}")
print(f" Test distribution: {np.bincount(y_te) / len(y_te)}")
print(f" Test rare class count: {(y_te==2).sum()}")
# Stratified split
X_tr_s, X_te_s, y_tr_s, y_te_s = train_test_split(
X_imbalanced, y_imbalanced,
test_size=0.2, random_state=42, stratify=y_imbalanced
)
print(f"\nStratified split:")
print(f" Train distribution: {np.bincount(y_tr_s) / len(y_tr_s)}")
print(f" Test distribution: {np.bincount(y_te_s) / len(y_te_s)}")
print(f" Test rare class count: {(y_te_s==2).sum()}")
Multi-Label Stratification
"""
Standard StratifiedKFold only stratifies by a single label.
For multi-label problems, use iterative stratification.
Install: pip install iterative-stratification
"""
from iterstrat.ml_stratifiers import MultilabelStratifiedShuffleSplit
import numpy as np
n = 1000
# 5 binary labels with different prevalences
y_multilabel = np.column_stack([
np.random.binomial(1, p, n)
for p in [0.7, 0.3, 0.1, 0.05, 0.02]
])
X_ml = np.random.randn(n, 20)
msss = MultilabelStratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
train_idx, test_idx = next(msss.split(X_ml, y_multilabel))
print("Multi-label stratification (each column = separate label):")
print(f"{'Label':<10} {'Overall':>10} {'Train':>10} {'Test':>10}")
for i in range(5):
print(f"Label {i:<4} {y_multilabel[:, i].mean():>10.3f} "
f"{y_multilabel[train_idx, i].mean():>10.3f} "
f"{y_multilabel[test_idx, i].mean():>10.3f}")
Part 6 - Anti-Leakage Operation Ordering
The single most important rule: train → val → test must be treated as completely separate datasets from the moment you define them.
CORRECT OPERATION ORDERING:
──────────────────────────────────────────────────────────────────
1. Split data (stratified/temporal/group as appropriate)
→ train_df, val_df, test_df
2. Exploratory Data Analysis
→ EDA on TRAIN ONLY
→ Never look at val/test distributions for feature engineering decisions
→ (Knowing val distribution lets you cheat)
3. Preprocessing
→ Fit imputers, scalers, encoders on TRAIN only
→ Apply (transform) to val and test using train statistics
4. Feature engineering
→ Compute all statistics (mean encoding, interaction terms) on TRAIN only
→ Apply to val and test
5. Model training
→ Fit model on train
6. Validation
→ Evaluate on val
→ Use val score to select hyperparameters, architecture
→ Iterate: modify model → retrain → re-evaluate on val
→ This is fine - val is for development decisions
7. Freeze everything
→ Once final model is selected, NO more changes
8. Final evaluation
→ ONE evaluation on test
→ Report this number
→ If you're unhappy and modify the model → test is now contaminated
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
import numpy as np
import pandas as pd
"""
The sklearn Pipeline is the primary anti-leakage tool.
Everything that involves fitting goes into the pipeline.
The pipeline ensures: fit on train, transform on val/test.
"""
def build_evaluation_pipeline():
"""
Production-grade pipeline that prevents preprocessing leakage.
"""
return Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler()),
('model', LogisticRegression(max_iter=500, random_state=42))
])
# Correct workflow
X_train, X_val, X_test, y_train, y_val, y_test = (
np.random.randn(800, 10), np.random.randn(100, 10), np.random.randn(100, 10),
np.random.randint(0, 2, 800), np.random.randint(0, 2, 100), np.random.randint(0, 2, 100)
)
pipe = build_evaluation_pipeline()
# Training: fit on train
pipe.fit(X_train, y_train)
# Validation: only transform (no fit)
val_acc = pipe.score(X_val, y_val)
print(f"Validation accuracy: {val_acc:.4f}")
# Hyperparameter tuning (repeat above with different model params)
# After deciding on final model...
# Test: ONE final evaluation
test_acc = pipe.score(X_test, y_test)
print(f"Test accuracy (FINAL - never touched before): {test_acc:.4f}")
Part 7 - Production Evaluation Protocol
In production, data accumulates over time. The test set you used last year is now "old news" - the model has been retrained on it. How do you maintain an honest evaluation protocol?
PRODUCTION EVALUATION LIFECYCLE:
T=0: Model v1 deployed
Train: Jan-Sep 2023
Val: Oct 2023
Test: Nov-Dec 2023 ← hold-out at deployment time
T=6months: Retraining model v2
New data available: Jan-Dec 2023, Jan-Apr 2024
OLD test set (Nov-Dec 2023) is now in retraining scope
→ Create NEW hold-out: May-Jun 2024
→ Evaluate v2 on new hold-out
RULES:
1. The hold-out window slides forward with retraining
2. Past test sets can be included in training when model is updated
3. Report metrics on the CURRENT hold-out, not historical ones
4. Always maintain a shadow test set from the most recent period
class ProductionEvaluationProtocol:
"""
Manages the test set lifecycle in a production ML system.
Ensures honest evaluation as the model is retrained over time.
"""
def __init__(self, holdout_period_days: int = 60):
self.holdout_period_days = holdout_period_days
self.evaluation_history = []
def get_split_dates(self, current_date: str, training_cutoff: str):
"""
Given current date and training cutoff, compute holdout window.
Example:
current_date: 2024-06-01
training_cutoff: 2024-04-01
holdout_window: 2024-04-01 to 2024-06-01 (60 days)
"""
import pandas as pd
cutoff = pd.Timestamp(training_cutoff)
now = pd.Timestamp(current_date)
return {
'train_end': cutoff,
'test_start': cutoff,
'test_end': now,
'test_days': (now - cutoff).days,
}
def record_evaluation(self, version: str, metric: str, score: float,
test_start: str, test_end: str, n_test: int):
self.evaluation_history.append({
'version': version, 'metric': metric, 'score': score,
'test_start': test_start, 'test_end': test_end, 'n_test': n_test
})
def summarize(self):
import pandas as pd
if not self.evaluation_history:
return "No evaluations recorded."
df = pd.DataFrame(self.evaluation_history)
print("\nProduction Evaluation History:")
print(df.to_string(index=False))
return df
# Usage
protocol = ProductionEvaluationProtocol(holdout_period_days=60)
for v, date, score in [
('v1', '2024-01-01', 0.89),
('v2', '2024-03-01', 0.91),
('v3', '2024-05-01', 0.88),
]:
info = protocol.get_split_dates(date, '2023-11-01')
protocol.record_evaluation(
version=v, metric='accuracy', score=score,
test_start=str(info['test_start'].date()),
test_end=str(info['train_end'].date()),
n_test=np.random.randint(1000, 5000)
)
protocol.summarize()
Part 8 - Split Strategy Decision Flowchart
YouTube Resources
| Video | Channel | Focus |
|---|---|---|
| Train/Test Split in Python | Sentdex | sklearn train_test_split |
| Data Leakage in Machine Learning | Krish Naik | Leakage types and detection |
| How to Avoid Data Leakage | ritvikmath | Preprocessing pipeline design |
| Time Series Train-Test Split | StatQuest | Temporal split best practices |
| Cross Validation vs Train-Test Split | Data School | When to use each |
Interview Questions
Q1: What is the difference between a validation set and a test set?
The validation set is used during model development for hyperparameter tuning, architecture selection, feature engineering decisions, and early stopping. Every time you use the validation score to make a modeling decision, you're implicitly optimising for that partition - so it becomes "contaminated" in the sense that your model's parameters (and meta-parameters) have been selected to perform well on it. The test set is touched exactly once - after all decisions are frozen - to get an unbiased estimate of generalisation. If you look at the test score, dislike it, and modify the model, the test set is now a second validation set and your final metric is no longer unbiased.
Q2: Describe six types of data leakage you've encountered or know about.
(1) Preprocessing leakage: fitting scalers/imputers on the full dataset including test data, so test statistics influence preprocessing. (2) Feature selection leakage: selecting features using all data's correlation with the target, so test-set label information informs which features are kept. (3) Target leakage: a feature that is causally downstream of the target - it's only observed after the target event occurs (e.g., "days_overdue" for loan default prediction). (4) Group leakage: samples from the same entity (patient, user, document) appear in both train and test; the model learns entity-specific patterns that don't generalise. (5) Temporal leakage: future data appears in training (random splits on time-series), or features computed using future aggregates. (6) Near-duplicate leakage: near-identical records in both train and test, causing memorisation rather than generalisation.
Q3: Your model achieves 98% accuracy on the validation set. On deployment, performance drops to 72%. List the possible causes in order of likelihood.
(1) Data leakage - the most common cause of this magnitude of drop; validation set shares information with training through preprocessing, group membership, or target leakage. (2) Distribution shift - the deployment data comes from a different distribution than training (different time period, different user population, different geography). (3) Overfitting to the validation set - hyperparameters were tuned excessively on the validation set; the model doesn't generalise beyond the validation distribution. (4) Train-test mismatch - the data collection process differs between training and production (e.g., training data was cleaned/filtered, production data is raw). (5) Concept drift - the relationship between features and target has changed since data collection. I'd investigate in this order: check for leakage (most high-impact, easiest to detect), then analyze feature distributions between training and production data.
Q4: How would you split data for a fraud detection model where each user can have multiple transactions?
User-level splitting using GroupShuffleSplit with user ID as the group key. The concern is that random transaction-level splitting leaks user-level patterns: if 9 of a user's 10 transactions are in training and 1 in test, the model can learn that user's baseline spending pattern, identity features, and behaviour history - then trivially recognise it in the test transaction. In production, you'll encounter completely new users. User-level splitting ensures the validation and test sets contain only users not seen during training. Additionally, I'd use a temporal split within each user (train on earlier transactions, evaluate on later ones) if the fraud patterns are time-dependent, using GroupTimeSeriesSplit or a custom implementation.
Q5: What is temporal leakage and how do you detect it?
Temporal leakage occurs when information from future timestamps is used to predict past events. In random splits on time-series, this happens automatically: the model trains on t=100 and predicts t=50 - the future is used to train the model for predicting the past. Detection: (1) check if your split was random on temporal data (any timestamp column?); (2) look for suspiciously high test accuracy - temporal leakage often inflates it by 10-30%; (3) check features for any that include forward-looking aggregates (next-week-sales as a feature, closing price of a trade as a feature). Prevention: always split time-series data by timestamp cutoff, not random row shuffling. Add a gap between training cutoff and test start to prevent leakage from lag-window features.
Q6: How do you handle the split when you have a multi-level hierarchy in your data (e.g., patients > hospital visits > lab tests)?
The split must be done at the highest level of the hierarchy that matches your deployment scenario. If in production you'll encounter new patients (most common in healthcare), split at the patient level - all visits and lab tests for a patient are in one partition. If instead you'll always predict on existing patients but new visits, you can split at the visit level while keeping the patient's first N visits in training. The danger is splitting at the wrong level: splitting by lab test (lowest level) while patients span partitions causes the model to learn patient-specific baselines that generalise trivially to new lab tests from the same patient but fail on new patients entirely. Always ask: "What is the unit of deployment? What will be new at inference time?"
Q7: You're building a model on 500 samples with 50 features. How do you split the data?
With n=500 and p=50 (decent ratio), I wouldn't do a three-way split - the resulting partitions would be too small for reliable estimates. A test set of 20% = 100 samples gives a 95% CI for accuracy of roughly ±10%, which is very wide. Instead, I'd use repeated stratified k-Fold (e.g., 5 folds × 10 repeats = 50 evaluations) for all model development. For the final model evaluation, I'd either: (a) use the entire 500 samples for nested CV to get a reliable generalisation estimate, or (b) if a single held-out test set is required (e.g., regulatory purposes), hold out exactly 100 samples before any model development and use k-Fold on the remaining 400 for development. The held-out 100 gives a CI of ~±10%, which should be documented alongside the reported accuracy.
Q8: How does the test set "expire" in a production ML system, and how do you handle it?
When a model is retrained on new data, the old test set is typically included in the new training set - it's no longer held out. The evaluation must shift to a new holdout window from after the retraining cutoff. This means your reported test accuracy changes not because the model changed, but because the holdout period changed. Best practice: maintain a rolling holdout window (e.g., always evaluate on the most recent 60 days of data), document which periods were used as holdout for each model version, and compare model versions fairly by evaluating them on the same holdout window (even if one model didn't train on that period). Shadow testing - running old and new models in parallel on new production traffic - is the gold standard for production comparison.
Q9: What is distribution shift and how does it affect your choice of split strategy?
Distribution shift occurs when the distribution of production data differs from training data . Types: (1) covariate shift - changes but stays the same (user demographics shift, but given demographics, churn rate is stable); (2) concept drift - changes (the relationship between features and target evolves over time); (3) label shift - changes (prevalence of a condition increases). For choosing split strategy: if covariate shift is expected (a known new user segment), include representative samples in the test set; if concept drift is expected (seasonal patterns), use temporal splitting so the test set represents the most recent period, matching deployment conditions. Evaluate on recent data weighted more heavily than old data. Monitor production metrics separately from offline test metrics.
:::tip Role-Specific Angles MLE Interview: Data leakage taxonomy, anti-leakage Pipeline design, temporal vs group splits, nested CV for evaluation Data Scientist: Stratified splits for imbalanced classes, validation vs test purpose, production evaluation lifecycle MLOps Interview: Production evaluation protocol, test set expiration, rolling holdout windows, shadow testing AI Engineer: Embedding-based near-duplicate detection, distribution shift detection, feature leakage in feature stores :::
:::tip 🎮 Interactive Playground
Visualize this concept: Try the K-Fold Cross-Validation demo on the EngineersOfAI Playground - no code required.
:::
