Handling Imbalanced Data - When 99% Accuracy Means Nothing
Reading time: ~28 min | Interview relevance: Critical | Roles: MLE, Data Scientist, AI Engineer, Applied Scientist
The Real Interview Moment
The interviewer slides a classification report across the table: "Your fraud detection model has 99.7% accuracy. The product team is thrilled. But I'm not. Tell me why."
You glance at the report. Of 100,000 transactions, only 300 are fraudulent (0.3%). The model predicts "not fraud" for every single transaction. It achieves 99.7% accuracy by doing absolutely nothing useful. This is the class imbalance trap, and it catches an alarming number of candidates who focus on accuracy without asking about the data distribution first.
The interviewer is testing whether you understand that evaluation metrics, resampling strategies, and loss functions all need to be rethought when classes are imbalanced - and whether you can reason about the business cost of different types of errors.
What You Will Master
- Why accuracy is meaningless for imbalanced data and what to use instead
- Resampling strategies: SMOTE, ADASYN, random oversampling/undersampling
- Algorithmic approaches: class weights, focal loss, cost-sensitive learning
- Threshold tuning and operating point selection
- When resampling helps vs. hurts performance
- Production considerations: calibration after resampling, monitoring in deployment
- How to structure a complete imbalance-handling answer in interviews
Self-Assessment: Where Are You Now?
| Level | Description | Target |
|---|---|---|
| Beginner | "I know you can oversample or undersample" | Read all parts carefully |
| Intermediate | "I know SMOTE and class weights but unsure when each is best" | Focus on Parts 2-3 and the decision flowchart |
| Advanced | "I can handle imbalance but want to nail the production and calibration details" | Jump to Part 3, practice problems, and cheat sheet |
Part 1 - Understanding the Problem
What Makes Data Imbalanced?
Class imbalance occurs when one class significantly outnumbers others:
| Imbalance Ratio | Example | Severity |
|---|---|---|
| 1:10 | Customer churn prediction | Mild |
| 1:100 | Fraud detection | Moderate |
| 1:1,000 | Manufacturing defect detection | Severe |
| 1:10,000 | Rare disease diagnosis | Extreme |
| 1:100,000 | Network intrusion detection | Extreme |
"Imbalanced data is when one class vastly outnumbers the other. The core problem isn't the imbalance itself - it's that standard algorithms and metrics assume balanced classes. Accuracy becomes meaningless because predicting the majority class always achieves a high score. You need to fix three things: the evaluation metric (use PR-AUC, not accuracy), the training signal (class weights or resampling), and the decision threshold (don't default to 0.5). The right approach depends on the imbalance ratio, dataset size, and the business cost of each error type."
Why Standard Algorithms Fail
Most ML algorithms optimize for overall accuracy (or equivalently, average loss). With 99:1 imbalance:
- Loss contribution: Each majority sample contributes 99x more gradient signal than each minority sample
- Decision boundary: The model learns to place the boundary to minimize total errors, which means classifying everything as majority
- Probability estimates: The model learns the base rate (0.01) as the default prediction, making it hard to distinguish minority samples
Why Accuracy Fails: The Metrics Problem
Confusion Matrix for "predict all negative" on 1:99 imbalanced data:
Predicted Neg Predicted Pos
Actual Neg 9,900 0
Actual Pos 100 0
Accuracy: 99.0% ← looks great!
Precision: 0/0 ← undefined (no positive predictions)
Recall: 0/100 ← 0% (missed every fraud case)
F1: 0 ← reveals the failure
Metrics that work for imbalanced data:
| Metric | Formula | When to Use |
|---|---|---|
| Precision | TP / (TP + FP) | When false positives are costly (spam filter) |
| Recall (Sensitivity) | TP / (TP + FN) | When false negatives are costly (cancer screening) |
| F1 Score | 2 * P * R / (P + R) | When you need to balance precision and recall |
| F-beta | (1 + beta^2) * P * R / (beta^2 * P + R) | When you want to weight recall higher (beta > 1) |
| PR-AUC | Area under Precision-Recall curve | Threshold-independent; sensitive to minority class |
| ROC-AUC | Area under ROC curve | Threshold-independent; can be misleading for severe imbalance |
| Average Precision | Weighted mean of precisions at each threshold | Similar to PR-AUC, often preferred |
| Matthews Correlation Coefficient | See formula | Single number that accounts for all four quadrants |
"I'd use ROC-AUC for imbalanced data." ROC-AUC can be misleadingly high with severe imbalance because the false positive rate (FP/N) stays small when N is huge. A model with 1000 false positives out of 99,000 negatives has FPR = 0.01, which looks great on the ROC curve but is disastrous if you're flagging 1000 legitimate transactions. PR-AUC is more informative because precision directly penalizes false positives relative to true positives.
PR-AUC vs ROC-AUC - When they disagree:
Scenario: 10,000 negatives, 100 positives
Model A: TP=80, FP=200, FN=20, TN=9800
ROC-AUC: 0.96 (looks great - FPR is only 2%)
PR-AUC: 0.35 (reality check - only 80/(80+200) = 28.6% precision)
Model B: TP=70, FP=30, FN=30, TN=9970
ROC-AUC: 0.94 (slightly worse)
PR-AUC: 0.68 (much better - 70/(70+30) = 70% precision)
Model B is clearly better for production, but ROC-AUC would mislead you.
Part 2 - Techniques for Handling Imbalance
The Imbalance Handling Decision Flowchart
Resampling Strategies
Random Oversampling
Duplicate minority class samples randomly.
from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(random_state=42)
X_resampled, y_resampled = ros.fit_resample(X_train, y_train)
Pros: Simple, no information loss, works with any model Cons: Exact duplicates can cause overfitting; model memorizes minority samples instead of learning patterns
Random Undersampling
Remove majority class samples randomly.
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(random_state=42)
X_resampled, y_resampled = rus.fit_resample(X_train, y_train)
Pros: Reduces training time, can improve speed for large datasets Cons: Throws away potentially valuable majority class data; high variance if dataset is small
SMOTE (Synthetic Minority Oversampling TEchnique)
Instead of duplicating, SMOTE creates synthetic minority samples by interpolating between existing ones.
Algorithm:
- For each minority sample, find its k nearest minority neighbors (default k=5)
- Randomly choose one neighbor
- Create a new sample at a random point on the line segment between the sample and its neighbor
Original minority sample: x = [2.0, 3.0]
Nearest neighbor: n = [4.0, 5.0]
Random factor (0-1): lambda = 0.4
Synthetic sample: x_new = x + lambda * (n - x)
= [2.0, 3.0] + 0.4 * [2.0, 2.0]
= [2.8, 3.8]
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42, k_neighbors=5)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
When a candidate mentions SMOTE, I immediately probe: "What are the assumptions SMOTE makes about the data?" The answer I'm looking for: SMOTE assumes that the region between minority samples belongs to the minority class. This fails when classes overlap or when the minority class has multiple disconnected clusters. Candidates who know this show genuine understanding, not just API familiarity.
SMOTE variants:
| Variant | How It Works | When to Use |
|---|---|---|
| SMOTE | Interpolates between any minority neighbors | General use |
| Borderline-SMOTE | Only oversamples minority points near the decision boundary | When boundary samples matter most |
| SVM-SMOTE | Uses SVM support vectors to guide oversampling | When you want to reinforce the decision boundary |
| SMOTE-Tomek | SMOTE + removes Tomek links (overlapping boundary samples) | When you want cleaner boundaries |
| SMOTE-ENN | SMOTE + Edited Nearest Neighbors cleanup | More aggressive noise removal than Tomek |
ADASYN (Adaptive Synthetic Sampling)
Like SMOTE, but generates more synthetic samples for minority instances that are harder to learn (near the decision boundary) and fewer for those in dense, easy-to-classify regions.
from imblearn.over_sampling import ADASYN
adasyn = ADASYN(random_state=42)
X_resampled, y_resampled = adasyn.fit_resample(X_train, y_train)
Key difference from SMOTE: ADASYN adapts the density of synthetic samples based on difficulty. In regions where minority samples are surrounded by majority samples (hard to classify), it generates more synthetic data. This focuses the model on the hardest examples.
"I always use SMOTE because it's better than random oversampling." SMOTE isn't universally better. It can create noisy synthetic samples in overlapping class regions, degrading performance. For high-dimensional sparse data (text, genomics), SMOTE's nearest-neighbor assumption often fails because distances become unreliable in high dimensions. Always compare against simpler baselines.
When Resampling Helps vs. Hurts
Summary table:
| Scenario | Resampling | Why |
|---|---|---|
| Clear class separation, low dimensions | Helps | Synthetic samples fill in sensible regions |
| Overlapping classes | Hurts | Synthetic samples add noise in ambiguous regions |
| High-dimensional sparse data | Usually hurts | Distance metrics unreliable; neighbors may be irrelevant |
| Very few minority samples (<20) | Risky | Not enough neighbors for meaningful interpolation |
| Tree-based models | Often unnecessary | Trees handle imbalance naturally with class weights |
| Deep learning | Varies | Focal loss or class weights usually better than resampling |
Algorithmic Approaches
Class Weights
Scale the loss function so that minority class errors are penalized more heavily.
# Scikit-learn: automatic weight computation
from sklearn.ensemble import RandomForestClassifier
# weight_i = n_samples / (n_classes * n_samples_i)
# For 1:99 ratio: class 0 gets weight ~0.5, class 1 gets weight ~50
model = RandomForestClassifier(class_weight='balanced')
# Manual weights
model = RandomForestClassifier(class_weight={0: 1, 1: 99})
How it works mathematically:
Standard loss:
Weighted loss:
Where and for binary classification with balanced weights.
Effect: The gradient from each minority sample is amplified by the weight ratio, giving minority samples proportionally more influence on the model update.
"Class weights vs. oversampling - what's the difference in practice?" The mathematical effect is similar for linear models (weighting the loss is equivalent to duplicating samples). For non-linear models (trees, neural nets), they can differ because oversampling changes the data distribution that the model sees during training (affecting splits, batch composition), while class weights only change the gradient magnitude.
Focal Loss
Introduced in the RetinaNet paper for object detection. Focal loss down-weights easy examples and focuses training on hard, misclassified examples.
Standard cross-entropy:
Focal loss:
Where is the predicted probability of the true class and is the focusing parameter (typically 2).
When p_t = 0.9 (easy example, correctly classified):
CE loss: -log(0.9) = 0.105
Focal loss: -(0.1)^2 * log(0.9) = 0.001 ← 100x smaller!
When p_t = 0.1 (hard example, misclassified):
CE loss: -log(0.1) = 2.302
Focal loss: -(0.9)^2 * log(0.1) = 1.864 ← similar magnitude
import torch
import torch.nn.functional as F
def focal_loss(logits, targets, alpha=0.25, gamma=2.0):
"""
alpha: weight for the positive class (typically set to inverse class frequency)
gamma: focusing parameter (0 = standard CE, 2 = typical)
"""
bce_loss = F.binary_cross_entropy_with_logits(logits, targets, reduction='none')
probs = torch.sigmoid(logits)
p_t = probs * targets + (1 - probs) * (1 - targets)
alpha_t = alpha * targets + (1 - alpha) * (1 - targets)
focal_weight = alpha_t * (1 - p_t) ** gamma
loss = focal_weight * bce_loss
return loss.mean()
When to use focal loss:
- Deep learning models with severe imbalance
- Object detection (many background patches, few objects)
- When the model quickly achieves high accuracy on the majority class but struggles with the minority
Cost-Sensitive Learning
Generalize beyond class weights to define a cost matrix that captures the business impact of each error type.
Predicted Legit Predicted Fraud
Actual Legit $0 $10 (investigation cost)
Actual Fraud $1,000 (loss) $0
The cost of missing fraud (10). This matrix directly informs:
- Class weights: weight_fraud = 100 * weight_legit
- Threshold: lower the threshold to catch more fraud, accepting more false alarms
- Business decisions: how many analysts to staff for reviewing flags
# Cost-sensitive threshold tuning
from sklearn.metrics import precision_recall_curve
y_probs = model.predict_proba(X_test)[:, 1]
precisions, recalls, thresholds = precision_recall_curve(y_test, y_probs)
# Find threshold that minimizes expected cost
cost_fn = 1000 # Cost of false negative (missed fraud)
cost_fp = 10 # Cost of false positive (investigation)
best_threshold = 0.5
min_cost = float('inf')
for threshold in thresholds:
y_pred = (y_probs >= threshold).astype(int)
fn = ((y_pred == 0) & (y_test == 1)).sum()
fp = ((y_pred == 1) & (y_test == 0)).sum()
total_cost = fn * cost_fn + fp * cost_fp
if total_cost < min_cost:
min_cost = total_cost
best_threshold = threshold
print(f"Optimal threshold: {best_threshold:.3f}")
print(f"Minimum expected cost: ${min_cost:,.0f}")
Threshold Tuning
Most classifiers default to a threshold of 0.5 for binary classification. With imbalanced data, this is almost always suboptimal.
How to tune the threshold:
- Train your model and get probability predictions on the validation set
- Sweep thresholds from 0 to 1
- Compute your target metric (F1, F-beta, cost) at each threshold
- Select the threshold that optimizes your metric
- Apply this threshold to the test set
from sklearn.metrics import f1_score
import numpy as np
y_probs = model.predict_proba(X_val)[:, 1]
thresholds = np.arange(0.01, 1.0, 0.01)
f1_scores = []
for t in thresholds:
y_pred = (y_probs >= t).astype(int)
f1_scores.append(f1_score(y_val, y_pred))
optimal_threshold = thresholds[np.argmax(f1_scores)]
print(f"Optimal threshold: {optimal_threshold:.2f}")
print(f"Best F1: {max(f1_scores):.3f}")
# Apply to test set
y_test_pred = (model.predict_proba(X_test)[:, 1] >= optimal_threshold).astype(int)
"I always use 0.5 as the classification threshold." For imbalanced data, 0.5 almost always produces a model that predicts the majority class for most samples. The optimal threshold is often much lower (e.g., 0.1 or 0.05 for fraud detection). Candidates who don't mention threshold tuning show a fundamental gap in understanding classification systems.
Ensemble Approaches for Imbalance
BalancedBaggingClassifier
Train each base estimator on a bootstrap sample that balances classes through undersampling.
from imblearn.ensemble import BalancedBaggingClassifier
from sklearn.tree import DecisionTreeClassifier
bbc = BalancedBaggingClassifier(
estimator=DecisionTreeClassifier(),
n_estimators=100,
sampling_strategy='auto',
random_state=42
)
bbc.fit(X_train, y_train)
EasyEnsemble
Train multiple AdaBoost classifiers, each on a different undersampled subset. This gets the benefit of undersampling (balanced training) without losing majority class data (different subsets cover different majority samples).
from imblearn.ensemble import EasyEnsembleClassifier
eec = EasyEnsembleClassifier(n_estimators=10, random_state=42)
eec.fit(X_train, y_train)
Part 3 - Production Considerations
Calibration After Resampling
Resampling changes the class distribution during training. If you train on a 50:50 resampled dataset but deploy to a 1:99 real distribution, your predicted probabilities will be systematically wrong - the model will overestimate the probability of the minority class. You must calibrate after resampling.
Why calibration breaks: A model trained on balanced data learns P(fraud) ~ 0.5 as the base rate. In production, P(fraud) ~ 0.01. The model's probability estimates are shifted upward.
Two calibration approaches:
-
Platt scaling: Fit a logistic regression on the model's raw predictions using a validation set with the original class distribution.
-
Isotonic regression: Non-parametric calibration - more flexible but requires more data.
from sklearn.calibration import CalibratedClassifierCV
# Train model on resampled data
model.fit(X_resampled, y_resampled)
# Calibrate on original-distribution validation data
calibrated_model = CalibratedClassifierCV(model, cv='prefit', method='sigmoid')
calibrated_model.fit(X_val, y_val) # X_val, y_val have original distribution
# Now predicted probabilities are calibrated
y_probs = calibrated_model.predict_proba(X_test)[:, 1]
Checking calibration - reliability diagram:
from sklearn.calibration import calibration_curve
import matplotlib.pyplot as plt
prob_true, prob_pred = calibration_curve(y_test, y_probs, n_bins=10)
plt.plot(prob_pred, prob_true, 's-', label='Model')
plt.plot([0, 1], [0, 1], '--', label='Perfectly calibrated')
plt.xlabel('Mean predicted probability')
plt.ylabel('Fraction of positives')
plt.title('Calibration Curve')
plt.legend()
A well-calibrated model: when it says "10% chance of fraud," approximately 10% of those cases are actually fraud.
Monitoring Imbalanced Models in Production
| What to Monitor | Why | How |
|---|---|---|
| Class distribution drift | If fraud rate changes from 0.3% to 1%, model assumptions shift | Track label distribution over time |
| Prediction distribution | If model starts predicting more/fewer positives | Monitor positive prediction rate daily |
| Precision at operating point | False alarm rate directly impacts analyst workload | Track precision weekly |
| Recall (via delayed labels) | Missing fraud cases has huge business cost | Check against confirmed labels with lag |
| Calibration drift | Probability estimates can degrade over time | Re-run calibration curve monthly |
The Complete Production Pipeline
"I applied SMOTE to the entire dataset, then split into train and test." This is data leakage. Synthetic samples in the test set are generated from training data. You're testing on data that was derived from the training set. SMOTE (or any resampling) must happen after splitting, on the training set only.
Resampling Inside Cross-Validation
When using cross-validation with resampling, you must resample inside each fold:
from imblearn.pipeline import Pipeline as ImbPipeline
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, StratifiedKFold
# imbalanced-learn Pipeline handles resampling inside CV
imb_pipe = ImbPipeline([
('smote', SMOTE(random_state=42)),
('scaler', StandardScaler()),
('clf', LogisticRegression())
])
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(imb_pipe, X, y, cv=cv, scoring='average_precision')
Note the use of imblearn.pipeline.Pipeline (not sklearn's), which supports resampling steps that only apply during fit, not during predict.
Part 4 - Advanced Techniques
Multi-Class Imbalance
With multiple classes, some may be underrepresented while others are overrepresented:
Class A: 10,000 samples
Class B: 5,000 samples
Class C: 100 samples
Class D: 50 samples
Strategies:
- Per-class weights: Set weights inversely proportional to frequency
- Hierarchical approach: First classify common vs. rare, then distinguish among rare classes
- Per-class threshold: Each class gets its own decision threshold
from sklearn.utils.class_weight import compute_class_weight
import numpy as np
class_weights = compute_class_weight('balanced', classes=np.unique(y), y=y)
weight_dict = dict(zip(np.unique(y), class_weights))
# {0: 0.5, 1: 1.0, 2: 50.0, 3: 100.0}
Anomaly Detection as an Alternative
For extreme imbalance (1:10,000+), reframe the problem:
Instead of classification (learn both classes), treat it as anomaly detection (learn the majority class, flag deviations).
| Method | Approach | When to Use |
|---|---|---|
| Isolation Forest | Randomly partition data; anomalies are isolated quickly | Tabular data, moderate dimensions |
| One-Class SVM | Learn a boundary around normal data | Small to medium datasets |
| Autoencoder | Train to reconstruct normal data; high reconstruction error = anomaly | High-dimensional data, sequences |
| Local Outlier Factor | Compare local density of a point to its neighbors | When anomalies are in sparse regions |
from sklearn.ensemble import IsolationForest
# Train only on majority class
X_normal = X_train[y_train == 0]
iso_forest = IsolationForest(contamination=0.01, random_state=42)
iso_forest.fit(X_normal)
# Predict: -1 = anomaly, 1 = normal
predictions = iso_forest.predict(X_test)
Google/Meta: Often deal with extreme imbalance in ad fraud and content moderation. They typically use class weights + threshold tuning rather than resampling, combined with cascading classifiers (cheap model filters 99%, expensive model classifies the rest).
Amazon: Fraud detection systems often use cost-sensitive learning with explicit dollar values for each error type, directly optimizing expected cost.
Healthcare/Biotech: Often use SMOTE or ADASYN because datasets are small and each minority sample is precious. Calibration is critical for clinical decision-making.
Data-Level Strategies Beyond Resampling
| Strategy | Description | When It Helps |
|---|---|---|
| Collect more minority data | Active learning, targeted data collection | When feasible and cost-effective |
| Data augmentation | Domain-specific transformations (rotate images, paraphrase text) | When you can create meaningful variations |
| Transfer learning | Pre-train on related balanced task, fine-tune on imbalanced | When a related task exists |
| Semi-supervised learning | Use unlabeled data to improve minority class representation | When unlabeled data is abundant |
| Synthetic data generation | Train a generative model (GAN, VAE) on minority class | When you need diverse minority samples |
Practice Problems
Problem 1: Fraud Detection Design (Mid-Level)
Scenario: You're building a credit card fraud detection system. Your dataset has 1 million transactions, 2,000 of which are fraudulent (0.2%). You have 50 features. The business wants to catch at least 90% of fraud while keeping false positive rate manageable for the review team (20 analysts).
Question: Design the complete ML pipeline, including resampling strategy, model choice, evaluation metrics, and threshold selection.
Hint 1 - Direction
Start with the business constraints: 90% recall minimum and analyst capacity. Work backward from these to determine the threshold, then think about training strategy.
Hint 2 - Insight
With 1M transactions and 0.2% fraud rate, SMOTE may not be necessary - there are still 2,000 positive samples, which is enough for most models. Class weights might be sufficient and simpler. The key constraint is the 90% recall requirement with manageable FP volume.
Hint 3 - Full Solution
Pipeline Design:
-
Splitting: Time-based split (fraud patterns evolve). Train on months 1-9, validate on month 10, test on months 11-12.
-
Resampling decision: With 2,000 fraud cases in training, SMOTE is likely unnecessary. Use class weights instead - simpler and no synthetic data issues.
-
Model: XGBoost with
scale_pos_weight = 500(ratio of negatives to positives). -
Evaluation: PR-AUC as primary metric (not ROC-AUC). Also track recall@precision curves.
-
Threshold tuning:
- Fix recall >= 0.90 on validation set
- Find the highest threshold that maintains 90% recall
- Calculate the resulting FP volume: if threshold gives 1% FPR, that's 10,000 false alarms per million transactions
- 20 analysts reviewing ~500 cases/day = 10,000 cases/20 days - feasible?
-
Calibration: Apply Platt scaling on validation set (original distribution).
-
Production: Two-stage system
- Stage 1: Fast model (logistic regression) filters obvious non-fraud (95% of traffic)
- Stage 2: Complex model (XGBoost) evaluates remaining 5%
- Reduces latency and cost
import xgboost as xgb
from sklearn.calibration import CalibratedClassifierCV
# Class-weighted XGBoost
model = xgb.XGBClassifier(
scale_pos_weight=500,
max_depth=6,
n_estimators=300,
learning_rate=0.1,
eval_metric='aucpr'
)
model.fit(X_train, y_train, eval_set=[(X_val, y_val)], early_stopping_rounds=20)
# Calibrate
cal_model = CalibratedClassifierCV(model, cv='prefit', method='isotonic')
cal_model.fit(X_val, y_val)
# Find threshold for 90% recall
from sklearn.metrics import precision_recall_curve
precisions, recalls, thresholds = precision_recall_curve(
y_val, cal_model.predict_proba(X_val)[:, 1]
)
# Find threshold where recall >= 0.90
valid_idx = recalls[:-1] >= 0.90
optimal_threshold = thresholds[valid_idx][-1] # Highest threshold with 90%+ recall
Scoring Rubric:
- Strong Hire: Time-based split, class weights over SMOTE (justified), threshold tuning with business constraints, calibration, two-stage system consideration
- Lean Hire: Correct high-level approach but misses calibration or threshold tuning
- No Hire: Random split, accuracy as metric, or SMOTE on entire dataset before splitting
Problem 2: SMOTE Gone Wrong (Mid-Level)
Scenario: A colleague applied SMOTE to their text classification dataset (TF-IDF features, 10,000 dimensions) and performance decreased compared to the baseline without resampling. Why?
Question: Explain why SMOTE failed and propose alternatives.
Hint 1 - Direction
Think about what SMOTE does geometrically and why that might not work in high-dimensional sparse spaces.
Hint 2 - Insight
In 10,000-dimensional TF-IDF space, nearest neighbors are unreliable (curse of dimensionality). Interpolating between two sparse document vectors creates dense synthetic vectors that don't look like any real document.
Hint 3 - Full Solution
Why SMOTE failed:
-
Curse of dimensionality: In 10,000-dimensional sparse TF-IDF space, all points are roughly equidistant. "Nearest neighbors" are not truly similar documents.
-
Sparse to dense artifact: TF-IDF vectors are sparse (most entries are 0). Interpolating between two sparse vectors creates a dense vector with small non-zero values everywhere. This synthetic "document" has words from both parent documents but resembles neither.
-
Semantic incoherence: A synthetic sample between "bank fraud investigation" and "insurance claim denial" might have equal weight on all words from both documents, creating a nonsensical feature vector.
Alternatives:
- Class weights: Simple and effective - just weight the loss function
- Random oversampling: Duplicates real documents, preserving sparsity
- Text augmentation: Back-translation, synonym replacement, paraphrasing (creates semantically valid new samples)
- Dimensionality reduction first: Apply PCA/SVD to reduce dimensions, then SMOTE on the dense lower-dimensional representation
- Pre-trained embeddings: Use BERT/sentence-transformer embeddings (dense, 768-dim), then SMOTE works better
Scoring Rubric:
- Strong Hire: Explains the curse of dimensionality AND the sparse-to-dense problem, proposes multiple alternatives with tradeoffs
- Lean Hire: Identifies high dimensionality as the issue but can't articulate the mechanism
- No Hire: Doesn't understand why SMOTE would fail or suggests "just use more SMOTE"
Problem 3: Threshold vs. Resampling (Senior-Level)
Scenario: Your model achieves PR-AUC of 0.65 on imbalanced data. You try three approaches: (A) SMOTE, (B) class weights, (C) threshold tuning. All three improve F1 on the validation set. How do you decide which to use in production?
Hint 1 - Direction
PR-AUC measures ranking quality and is threshold-independent. If PR-AUC is 0.65 without resampling, think about what resampling can actually improve vs. what threshold tuning can improve.
Hint 2 - Insight
Threshold tuning doesn't change the model - it just picks a better operating point on the existing PR curve. SMOTE and class weights actually change the model and might improve (or degrade) PR-AUC itself. The question is: do you need a better model or just a better operating point?
Hint 3 - Full Solution
Analysis framework:
-
Compare PR-AUC across methods (not just F1):
- If SMOTE or class weights improve PR-AUC, the model is better - use them
- If they don't improve PR-AUC but F1 improves, they're just shifting the implicit threshold - threshold tuning achieves the same effect more cleanly
-
Production complexity:
- Threshold tuning: simplest - no change to training pipeline, just adjust decision boundary
- Class weights: moderate - requires retraining with weights but no data pipeline changes
- SMOTE: most complex - requires resampling step, increases training data size, needs calibration afterward
-
Calibration impact:
- Threshold tuning: preserves calibrated probabilities
- Class weights: slightly distorts probabilities (easy to recalibrate)
- SMOTE: significantly distorts probabilities (mandatory recalibration)
-
Recommended decision process:
- Start with threshold tuning (free improvement, no retraining)
- If PR-AUC is insufficient, try class weights (simple, effective)
- If still insufficient, try SMOTE (but check if it actually improves PR-AUC)
- Always calibrate and validate on original-distribution data
Scoring Rubric:
- Strong Hire: Distinguishes between ranking improvement (PR-AUC) and threshold improvement (F1), considers production complexity, discusses calibration
- Lean Hire: Proposes a reasonable comparison but doesn't articulate the PR-AUC vs. F1 distinction
- No Hire: Picks whichever gives the highest F1 without considering production implications
Problem 4: Multi-Stage Imbalance (Staff-Level)
Scenario: You're building a content moderation system for a social platform. Content types include: clean (95%), mildly inappropriate (3%), policy violation (1.5%), illegal content (0.5%). The cost of missing illegal content is orders of magnitude higher than any other error. Latency requirement: <100ms per item.
Question: Design the complete classification system, including model architecture, handling of the multi-class imbalance, and error cost management.
Hint 1 - Direction
Think about cascading classifiers: can you quickly filter out obviously clean content, then spend more compute on borderline cases? Also think about the asymmetric costs across the four classes.
Hint 2 - Insight
A single model optimizing a single loss can't capture the 1000x cost differential between missing illegal content vs. misclassifying clean content. Consider a hierarchical approach with different thresholds per class and a separate high-recall model specifically for illegal content.
Hint 3 - Full Solution
System design:
Stage 1: Fast Binary Filter (<10ms)
- Lightweight model (logistic regression or distilled BERT)
- Binary: "definitely clean" vs. "needs review"
- Set threshold for 99.9% recall on non-clean content
- ~90% of content passes through as clean, 10% goes to Stage 2
Stage 2: Multi-Class Classifier (remaining 10%, <50ms)
- Full model (fine-tuned transformer)
- Four-class output with per-class cost-weighted loss:
class_costs = {
'clean': 1,
'mildly_inappropriate': 5,
'policy_violation': 50,
'illegal': 5000
}
Stage 3: Illegal Content Safety Net (parallel, <50ms)
- Separate high-recall binary classifier: "illegal" vs. "not illegal"
- Trained specifically on illegal content detection
- Threshold set for 99.99% recall (accept high FPR)
- Runs in parallel with Stage 2 on flagged content
- Any disagreement between Stage 2 and Stage 3 escalates to human review
Per-class threshold tuning:
- Each class gets its own threshold based on cost
- Illegal: very low threshold (flag even slight suspicion)
- Clean: high threshold (need high confidence to pass)
Training strategy:
- Collect and augment illegal content samples (even 0.5% of a large platform is substantial in absolute terms)
- Use focal loss with alpha inversely proportional to class frequency AND adjusted by cost
- Curriculum learning: start training on balanced batches, gradually introduce natural distribution
Scoring Rubric:
- Strong Hire: Multi-stage architecture with latency awareness, separate safety net for illegal content, per-class thresholds, cost-weighted loss, discusses human-in-the-loop escalation
- Lean Hire: Reasonable multi-class approach but misses the cascading architecture or the separate safety net
- No Hire: Single model with SMOTE, accuracy as metric, no consideration of differential costs
Problem 5: Calibration Crisis (Senior-Level)
Scenario: Your medical diagnosis model was trained with SMOTE (1:1 balanced) on a dataset where disease prevalence is 2%. The model outputs P(disease) = 0.45 for a patient. What's the actual probability? How do you fix this for clinical use?
Hint 1 - Direction
The model learned on balanced data where the base rate was 50%. In reality, the base rate is 2%. Apply Bayes' theorem to correct the probability.
Hint 2 - Insight
You can analytically adjust the probability using the odds correction formula. If the model outputs probability p_s (trained on balanced data), the corrected probability p_c for the real prior is: adjust the odds by the ratio of priors.
Hint 3 - Full Solution
Analytical correction using odds adjustment:
The model learned odds based on a 50:50 prior. The real prior is 2:98.
Where (true prevalence) and (SMOTE-balanced).
import numpy as np
p_model = 0.45
pi_real = 0.02
pi_train = 0.50
# Convert to odds
odds_model = p_model / (1 - p_model) # 0.818
# Correction factor
correction = (pi_real / pi_train) * ((1 - pi_train) / (1 - pi_real))
# = (0.02 / 0.50) * (0.50 / 0.98) = 0.04 * 0.5102 = 0.0204
odds_corrected = odds_model * correction # 0.0167
p_corrected = odds_corrected / (1 + odds_corrected) # 0.0164
print(f"Model output: {p_model:.2f}")
print(f"Corrected probability: {p_corrected:.4f}")
# Actual probability: ~1.64%, NOT 45%!
Better approach for production: Use Platt scaling or isotonic regression calibration on a held-out set with the original class distribution:
from sklearn.calibration import CalibratedClassifierCV
calibrated = CalibratedClassifierCV(model, cv='prefit', method='isotonic')
calibrated.fit(X_val_original_distribution, y_val_original_distribution)
Why this matters clinically:
- A doctor seeing P(disease) = 0.45 would order invasive tests
- The real probability of ~1.6% suggests watchful waiting
- Miscalibrated probabilities in medicine can lead to unnecessary procedures or missed diagnoses
Scoring Rubric:
- Strong Hire: Derives the odds correction, computes the corrected probability, explains clinical impact, recommends calibration for production
- Lean Hire: Knows the probability is wrong and suggests calibration but can't derive the correction
- No Hire: Thinks 0.45 is the real probability or suggests "just lower the threshold"
Interview Cheat Sheet
| Question | Key Points |
|---|---|
| Why does accuracy fail? | Majority class prediction achieves high accuracy; use PR-AUC, F1 instead |
| SMOTE vs. random oversampling? | SMOTE creates synthetic points by interpolation; avoids exact duplicates; fails in high dimensions |
| When does SMOTE hurt? | High-dimensional sparse data, overlapping classes, very few minority samples |
| Class weights vs. resampling? | Mathematically similar for linear models; weights are simpler in production; no calibration needed |
| What is focal loss? | Down-weights easy examples: ; gamma=2 typical; for deep learning |
| How to tune threshold? | Sweep thresholds on validation set; optimize for target metric (F1, cost, recall@precision) |
| Why calibrate after resampling? | Resampling shifts the learned base rate; probabilities are wrong for original distribution |
| PR-AUC vs. ROC-AUC? | PR-AUC better for imbalanced data; ROC-AUC inflated by large true negative count |
| Cost-sensitive learning? | Define cost matrix per error type; weight loss by business cost; tune threshold accordingly |
| Anomaly detection alternative? | For extreme imbalance (1:10K+); learn normal class only; flag deviations |
| Resampling + CV? | Resample INSIDE each CV fold; use imblearn Pipeline; never resample before splitting |
| Production monitoring? | Track prediction distribution, precision@threshold, recall with delayed labels, calibration drift |
Spaced Repetition Checkpoints
Day 0 - Initial Learning
- Explain why 99% accuracy can be meaningless
- List 3 resampling techniques and describe how SMOTE works
- Explain the difference between class weights and oversampling
- Define PR-AUC and explain why it's better than ROC-AUC for imbalanced data
Day 3 - Recall
- Draw the SMOTE algorithm step by step
- Explain when SMOTE fails (at least 3 scenarios)
- Write the focal loss formula and explain the gamma parameter
- Explain why you must resample inside CV folds, not before
Day 7 - Application
- Given a fraud detection scenario, design the full pipeline
- Explain calibration after resampling and the odds correction formula
- Compare 5 approaches to imbalance and recommend one for a given scenario
- Solve Practice Problem 1 without hints
Day 14 - Integration
- Design a multi-stage classifier with differential error costs
- Explain the connection between threshold tuning and PR curve operating points
- Derive the analytical probability correction after SMOTE training
- Solve Practice Problem 4 (multi-stage) in under 15 minutes
Day 21 - Mastery
- Teach imbalanced data handling end-to-end to someone else
- Critique a flawed imbalanced data pipeline and fix all issues
- Design evaluation and monitoring for an imbalanced model in production
- Confidently answer: "When would you NOT handle imbalance at all?"
