Handling Imbalanced Data - When 99% Accuracy Means Nothing

Reading time: ~28 min | Interview relevance: Critical | Roles: MLE, Data Scientist, AI Engineer, Applied Scientist

The Real Interview Moment

The interviewer slides a classification report across the table: "Your fraud detection model has 99.7% accuracy. The product team is thrilled. But I'm not. Tell me why."

You glance at the report. Of 100,000 transactions, only 300 are fraudulent (0.3%). The model predicts "not fraud" for every single transaction. It achieves 99.7% accuracy by doing absolutely nothing useful. This is the class imbalance trap, and it catches an alarming number of candidates who focus on accuracy without asking about the data distribution first.

The interviewer is testing whether you understand that evaluation metrics, resampling strategies, and loss functions all need to be rethought when classes are imbalanced - and whether you can reason about the business cost of different types of errors.

What You Will Master

Why accuracy is meaningless for imbalanced data and what to use instead
Resampling strategies: SMOTE, ADASYN, random oversampling/undersampling
Algorithmic approaches: class weights, focal loss, cost-sensitive learning
Threshold tuning and operating point selection
When resampling helps vs. hurts performance
Production considerations: calibration after resampling, monitoring in deployment
How to structure a complete imbalance-handling answer in interviews

Self-Assessment: Where Are You Now?

Level	Description	Target
Beginner	"I know you can oversample or undersample"	Read all parts carefully
Intermediate	"I know SMOTE and class weights but unsure when each is best"	Focus on Parts 2-3 and the decision flowchart
Advanced	"I can handle imbalance but want to nail the production and calibration details"	Jump to Part 3, practice problems, and cheat sheet

Part 1 - Understanding the Problem

What Makes Data Imbalanced?

Class imbalance occurs when one class significantly outnumbers others:

Imbalance Ratio	Example	Severity
1:10	Customer churn prediction	Mild
1:100	Fraud detection	Moderate
1:1,000	Manufacturing defect detection	Severe
1:10,000	Rare disease diagnosis	Extreme
1:100,000	Network intrusion detection	Extreme

60-Second Answer

"Imbalanced data is when one class vastly outnumbers the other. The core problem isn't the imbalance itself - it's that standard algorithms and metrics assume balanced classes. Accuracy becomes meaningless because predicting the majority class always achieves a high score. You need to fix three things: the evaluation metric (use PR-AUC, not accuracy), the training signal (class weights or resampling), and the decision threshold (don't default to 0.5). The right approach depends on the imbalance ratio, dataset size, and the business cost of each error type."

Why Standard Algorithms Fail

Most ML algorithms optimize for overall accuracy (or equivalently, average loss). With 99:1 imbalance:

Loss contribution: Each majority sample contributes 99x more gradient signal than each minority sample
Decision boundary: The model learns to place the boundary to minimize total errors, which means classifying everything as majority
Probability estimates: The model learns the base rate (0.01) as the default prediction, making it hard to distinguish minority samples

Why Accuracy Fails: The Metrics Problem

Confusion Matrix for "predict all negative" on 1:99 imbalanced data:

                 Predicted Neg    Predicted Pos
Actual Neg          9,900              0
Actual Pos            100              0

Accuracy: 99.0%     ← looks great!
Precision: 0/0      ← undefined (no positive predictions)
Recall: 0/100       ← 0% (missed every fraud case)
F1: 0               ← reveals the failure

Metrics that work for imbalanced data:

Metric	Formula	When to Use
Precision	TP / (TP + FP)	When false positives are costly (spam filter)
Recall (Sensitivity)	TP / (TP + FN)	When false negatives are costly (cancer screening)
F1 Score	2 * P * R / (P + R)	When you need to balance precision and recall
F-beta	(1 + beta^2) * P * R / (beta^2 * P + R)	When you want to weight recall higher (beta > 1)
PR-AUC	Area under Precision-Recall curve	Threshold-independent; sensitive to minority class
ROC-AUC	Area under ROC curve	Threshold-independent; can be misleading for severe imbalance
Average Precision	Weighted mean of precisions at each threshold	Similar to PR-AUC, often preferred
Matthews Correlation Coefficient	See formula	Single number that accounts for all four quadrants

Common Trap

"I'd use ROC-AUC for imbalanced data." ROC-AUC can be misleadingly high with severe imbalance because the false positive rate (FP/N) stays small when N is huge. A model with 1000 false positives out of 99,000 negatives has FPR = 0.01, which looks great on the ROC curve but is disastrous if you're flagging 1000 legitimate transactions. PR-AUC is more informative because precision directly penalizes false positives relative to true positives.

PR-AUC vs ROC-AUC - When they disagree:

Scenario: 10,000 negatives, 100 positives

Model A: TP=80, FP=200, FN=20, TN=9800
  ROC-AUC: 0.96 (looks great - FPR is only 2%)
  PR-AUC:  0.35 (reality check - only 80/(80+200) = 28.6% precision)

Model B: TP=70, FP=30, FN=30, TN=9970
  ROC-AUC: 0.94 (slightly worse)
  PR-AUC:  0.68 (much better - 70/(70+30) = 70% precision)

Model B is clearly better for production, but ROC-AUC would mislead you.

Part 2 - Techniques for Handling Imbalance

The Imbalance Handling Decision Flowchart

Imbalanced Dataset Handling Flowchart

Resampling Strategies

Random Oversampling

Duplicate minority class samples randomly.

from imblearn.over_sampling import RandomOverSampler

ros = RandomOverSampler(random_state=42)
X_resampled, y_resampled = ros.fit_resample(X_train, y_train)

Pros: Simple, no information loss, works with any model Cons: Exact duplicates can cause overfitting; model memorizes minority samples instead of learning patterns

Random Undersampling

Remove majority class samples randomly.

from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler(random_state=42)
X_resampled, y_resampled = rus.fit_resample(X_train, y_train)

Pros: Reduces training time, can improve speed for large datasets Cons: Throws away potentially valuable majority class data; high variance if dataset is small

SMOTE (Synthetic Minority Oversampling TEchnique)

Instead of duplicating, SMOTE creates synthetic minority samples by interpolating between existing ones.

Algorithm:

For each minority sample, find its k nearest minority neighbors (default k=5)
Randomly choose one neighbor
Create a new sample at a random point on the line segment between the sample and its neighbor

Original minority sample: x = [2.0, 3.0]
Nearest neighbor:         n = [4.0, 5.0]
Random factor (0-1):      lambda = 0.4

Synthetic sample: x_new = x + lambda * (n - x)
                       = [2.0, 3.0] + 0.4 * [2.0, 2.0]
                       = [2.8, 3.8]

from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42, k_neighbors=5)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

Interviewer's Perspective

When a candidate mentions SMOTE, I immediately probe: "What are the assumptions SMOTE makes about the data?" The answer I'm looking for: SMOTE assumes that the region between minority samples belongs to the minority class. This fails when classes overlap or when the minority class has multiple disconnected clusters. Candidates who know this show genuine understanding, not just API familiarity.

SMOTE variants:

Variant	How It Works	When to Use
SMOTE	Interpolates between any minority neighbors	General use
Borderline-SMOTE	Only oversamples minority points near the decision boundary	When boundary samples matter most
SVM-SMOTE	Uses SVM support vectors to guide oversampling	When you want to reinforce the decision boundary
SMOTE-Tomek	SMOTE + removes Tomek links (overlapping boundary samples)	When you want cleaner boundaries
SMOTE-ENN	SMOTE + Edited Nearest Neighbors cleanup	More aggressive noise removal than Tomek

ADASYN (Adaptive Synthetic Sampling)

Like SMOTE, but generates more synthetic samples for minority instances that are harder to learn (near the decision boundary) and fewer for those in dense, easy-to-classify regions.

from imblearn.over_sampling import ADASYN

adasyn = ADASYN(random_state=42)
X_resampled, y_resampled = adasyn.fit_resample(X_train, y_train)

Key difference from SMOTE: ADASYN adapts the density of synthetic samples based on difficulty. In regions where minority samples are surrounded by majority samples (hard to classify), it generates more synthetic data. This focuses the model on the hardest examples.

Common Trap

"I always use SMOTE because it's better than random oversampling." SMOTE isn't universally better. It can create noisy synthetic samples in overlapping class regions, degrading performance. For high-dimensional sparse data (text, genomics), SMOTE's nearest-neighbor assumption often fails because distances become unreliable in high dimensions. Always compare against simpler baselines.

When Resampling Helps vs. Hurts

Summary table:

Scenario	Resampling	Why
Clear class separation, low dimensions	Helps	Synthetic samples fill in sensible regions
Overlapping classes	Hurts	Synthetic samples add noise in ambiguous regions
High-dimensional sparse data	Usually hurts	Distance metrics unreliable; neighbors may be irrelevant
Very few minority samples (<20)	Risky	Not enough neighbors for meaningful interpolation
Tree-based models	Often unnecessary	Trees handle imbalance naturally with class weights
Deep learning	Varies	Focal loss or class weights usually better than resampling

Algorithmic Approaches

Class Weights

Scale the loss function so that minority class errors are penalized more heavily.

# Scikit-learn: automatic weight computation
from sklearn.ensemble import RandomForestClassifier

# weight_i = n_samples / (n_classes * n_samples_i)
# For 1:99 ratio: class 0 gets weight ~0.5, class 1 gets weight ~50
model = RandomForestClassifier(class_weight='balanced')

# Manual weights
model = RandomForestClassifier(class_weight={0: 1, 1: 99})

How it works mathematically:

Standard loss: $L = -\sum_{i=1}^{n} \log(p(y_i | x_i))$

Weighted loss: $L_w = -\sum_{i=1}^{n} w_{y_i} \cdot \log(p(y_i | x_i))$

Where $w_1 = \frac{n}{2 \cdot n_1}$ and $w_0 = \frac{n}{2 \cdot n_0}$ for binary classification with balanced weights.

Effect: The gradient from each minority sample is amplified by the weight ratio, giving minority samples proportionally more influence on the model update.

Interviewer's Perspective

"Class weights vs. oversampling - what's the difference in practice?" The mathematical effect is similar for linear models (weighting the loss is equivalent to duplicating samples). For non-linear models (trees, neural nets), they can differ because oversampling changes the data distribution that the model sees during training (affecting splits, batch composition), while class weights only change the gradient magnitude.

Focal Loss

Introduced in the RetinaNet paper for object detection. Focal loss down-weights easy examples and focuses training on hard, misclassified examples.

Standard cross-entropy: $L_{CE} = -\log(p_t)$

Focal loss: $L_{FL} = -(1 - p_t)^\gamma \cdot \log(p_t)$

Where $p_t$ is the predicted probability of the true class and $\gamma$ is the focusing parameter (typically 2).

When p_t = 0.9 (easy example, correctly classified):
  CE loss:    -log(0.9) = 0.105
  Focal loss: -(0.1)^2 * log(0.9) = 0.001  ← 100x smaller!

When p_t = 0.1 (hard example, misclassified):
  CE loss:    -log(0.1) = 2.302
  Focal loss: -(0.9)^2 * log(0.1) = 1.864  ← similar magnitude

import torch
import torch.nn.functional as F

def focal_loss(logits, targets, alpha=0.25, gamma=2.0):
    """
    alpha: weight for the positive class (typically set to inverse class frequency)
    gamma: focusing parameter (0 = standard CE, 2 = typical)
    """
    bce_loss = F.binary_cross_entropy_with_logits(logits, targets, reduction='none')
    probs = torch.sigmoid(logits)
    p_t = probs * targets + (1 - probs) * (1 - targets)
    alpha_t = alpha * targets + (1 - alpha) * (1 - targets)
    focal_weight = alpha_t * (1 - p_t) ** gamma
    loss = focal_weight * bce_loss
    return loss.mean()

When to use focal loss:

Deep learning models with severe imbalance
Object detection (many background patches, few objects)
When the model quickly achieves high accuracy on the majority class but struggles with the minority

Cost-Sensitive Learning

Generalize beyond class weights to define a cost matrix that captures the business impact of each error type.

                    Predicted Legit    Predicted Fraud
Actual Legit             $0               $10 (investigation cost)
Actual Fraud           $1,000 (loss)        $0

The cost of missing fraud ( $1,000) is 100x the cost of a false alarm ($ 10). This matrix directly informs:

Class weights: weight_fraud = 100 * weight_legit
Threshold: lower the threshold to catch more fraud, accepting more false alarms
Business decisions: how many analysts to staff for reviewing flags

# Cost-sensitive threshold tuning
from sklearn.metrics import precision_recall_curve

y_probs = model.predict_proba(X_test)[:, 1]
precisions, recalls, thresholds = precision_recall_curve(y_test, y_probs)

# Find threshold that minimizes expected cost
cost_fn = 1000  # Cost of false negative (missed fraud)
cost_fp = 10    # Cost of false positive (investigation)

best_threshold = 0.5
min_cost = float('inf')
for threshold in thresholds:
    y_pred = (y_probs >= threshold).astype(int)
    fn = ((y_pred == 0) & (y_test == 1)).sum()
    fp = ((y_pred == 1) & (y_test == 0)).sum()
    total_cost = fn * cost_fn + fp * cost_fp
    if total_cost < min_cost:
        min_cost = total_cost
        best_threshold = threshold

print(f"Optimal threshold: {best_threshold:.3f}")
print(f"Minimum expected cost: ${min_cost:,.0f}")

Threshold Tuning

Most classifiers default to a threshold of 0.5 for binary classification. With imbalanced data, this is almost always suboptimal.

How to tune the threshold:

Train your model and get probability predictions on the validation set
Sweep thresholds from 0 to 1
Compute your target metric (F1, F-beta, cost) at each threshold
Select the threshold that optimizes your metric
Apply this threshold to the test set

from sklearn.metrics import f1_score
import numpy as np

y_probs = model.predict_proba(X_val)[:, 1]

thresholds = np.arange(0.01, 1.0, 0.01)
f1_scores = []

for t in thresholds:
    y_pred = (y_probs >= t).astype(int)
    f1_scores.append(f1_score(y_val, y_pred))

optimal_threshold = thresholds[np.argmax(f1_scores)]
print(f"Optimal threshold: {optimal_threshold:.2f}")
print(f"Best F1: {max(f1_scores):.3f}")

# Apply to test set
y_test_pred = (model.predict_proba(X_test)[:, 1] >= optimal_threshold).astype(int)

Instant Rejection

"I always use 0.5 as the classification threshold." For imbalanced data, 0.5 almost always produces a model that predicts the majority class for most samples. The optimal threshold is often much lower (e.g., 0.1 or 0.05 for fraud detection). Candidates who don't mention threshold tuning show a fundamental gap in understanding classification systems.

Ensemble Approaches for Imbalance

BalancedBaggingClassifier

Train each base estimator on a bootstrap sample that balances classes through undersampling.

from imblearn.ensemble import BalancedBaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bbc = BalancedBaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=100,
    sampling_strategy='auto',
    random_state=42
)
bbc.fit(X_train, y_train)

EasyEnsemble

Train multiple AdaBoost classifiers, each on a different undersampled subset. This gets the benefit of undersampling (balanced training) without losing majority class data (different subsets cover different majority samples).

from imblearn.ensemble import EasyEnsembleClassifier

eec = EasyEnsembleClassifier(n_estimators=10, random_state=42)
eec.fit(X_train, y_train)

Part 3 - Production Considerations

Calibration After Resampling

Common Trap

Resampling changes the class distribution during training. If you train on a 50:50 resampled dataset but deploy to a 1:99 real distribution, your predicted probabilities will be systematically wrong - the model will overestimate the probability of the minority class. You must calibrate after resampling.

Why calibration breaks: A model trained on balanced data learns P(fraud) ~ 0.5 as the base rate. In production, P(fraud) ~ 0.01. The model's probability estimates are shifted upward.

Two calibration approaches:

Platt scaling: Fit a logistic regression on the model's raw predictions using a validation set with the original class distribution.
Isotonic regression: Non-parametric calibration - more flexible but requires more data.

from sklearn.calibration import CalibratedClassifierCV

# Train model on resampled data
model.fit(X_resampled, y_resampled)

# Calibrate on original-distribution validation data
calibrated_model = CalibratedClassifierCV(model, cv='prefit', method='sigmoid')
calibrated_model.fit(X_val, y_val)  # X_val, y_val have original distribution

# Now predicted probabilities are calibrated
y_probs = calibrated_model.predict_proba(X_test)[:, 1]

Checking calibration - reliability diagram:

from sklearn.calibration import calibration_curve
import matplotlib.pyplot as plt

prob_true, prob_pred = calibration_curve(y_test, y_probs, n_bins=10)

plt.plot(prob_pred, prob_true, 's-', label='Model')
plt.plot([0, 1], [0, 1], '--', label='Perfectly calibrated')
plt.xlabel('Mean predicted probability')
plt.ylabel('Fraction of positives')
plt.title('Calibration Curve')
plt.legend()

A well-calibrated model: when it says "10% chance of fraud," approximately 10% of those cases are actually fraud.

Monitoring Imbalanced Models in Production

What to Monitor	Why	How
Class distribution drift	If fraud rate changes from 0.3% to 1%, model assumptions shift	Track label distribution over time
Prediction distribution	If model starts predicting more/fewer positives	Monitor positive prediction rate daily
Precision at operating point	False alarm rate directly impacts analyst workload	Track precision weekly
Recall (via delayed labels)	Missing fraud cases has huge business cost	Check against confirmed labels with lag
Calibration drift	Probability estimates can degrade over time	Re-run calibration curve monthly

The Complete Production Pipeline

Imbalance Handling Production Pipeline

Instant Rejection

"I applied SMOTE to the entire dataset, then split into train and test." This is data leakage. Synthetic samples in the test set are generated from training data. You're testing on data that was derived from the training set. SMOTE (or any resampling) must happen after splitting, on the training set only.

Resampling Inside Cross-Validation

When using cross-validation with resampling, you must resample inside each fold:

from imblearn.pipeline import Pipeline as ImbPipeline
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, StratifiedKFold

# imbalanced-learn Pipeline handles resampling inside CV
imb_pipe = ImbPipeline([
    ('smote', SMOTE(random_state=42)),
    ('scaler', StandardScaler()),
    ('clf', LogisticRegression())
])

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(imb_pipe, X, y, cv=cv, scoring='average_precision')

Note the use of imblearn.pipeline.Pipeline (not sklearn's), which supports resampling steps that only apply during fit, not during predict.

Part 4 - Advanced Techniques

Multi-Class Imbalance

With multiple classes, some may be underrepresented while others are overrepresented:

Class A: 10,000 samples
Class B: 5,000 samples
Class C: 100 samples
Class D: 50 samples

Strategies:

Per-class weights: Set weights inversely proportional to frequency
Hierarchical approach: First classify common vs. rare, then distinguish among rare classes
Per-class threshold: Each class gets its own decision threshold

from sklearn.utils.class_weight import compute_class_weight
import numpy as np

class_weights = compute_class_weight('balanced', classes=np.unique(y), y=y)
weight_dict = dict(zip(np.unique(y), class_weights))
# {0: 0.5, 1: 1.0, 2: 50.0, 3: 100.0}

Anomaly Detection as an Alternative

For extreme imbalance (1:10,000+), reframe the problem:

Instead of classification (learn both classes), treat it as anomaly detection (learn the majority class, flag deviations).

Method	Approach	When to Use
Isolation Forest	Randomly partition data; anomalies are isolated quickly	Tabular data, moderate dimensions
One-Class SVM	Learn a boundary around normal data	Small to medium datasets
Autoencoder	Train to reconstruct normal data; high reconstruction error = anomaly	High-dimensional data, sequences
Local Outlier Factor	Compare local density of a point to its neighbors	When anomalies are in sparse regions

from sklearn.ensemble import IsolationForest

# Train only on majority class
X_normal = X_train[y_train == 0]
iso_forest = IsolationForest(contamination=0.01, random_state=42)
iso_forest.fit(X_normal)

# Predict: -1 = anomaly, 1 = normal
predictions = iso_forest.predict(X_test)

Company Variation

Google/Meta: Often deal with extreme imbalance in ad fraud and content moderation. They typically use class weights + threshold tuning rather than resampling, combined with cascading classifiers (cheap model filters 99%, expensive model classifies the rest).

Amazon: Fraud detection systems often use cost-sensitive learning with explicit dollar values for each error type, directly optimizing expected cost.

Healthcare/Biotech: Often use SMOTE or ADASYN because datasets are small and each minority sample is precious. Calibration is critical for clinical decision-making.

Data-Level Strategies Beyond Resampling

Strategy	Description	When It Helps
Collect more minority data	Active learning, targeted data collection	When feasible and cost-effective
Data augmentation	Domain-specific transformations (rotate images, paraphrase text)	When you can create meaningful variations
Transfer learning	Pre-train on related balanced task, fine-tune on imbalanced	When a related task exists
Semi-supervised learning	Use unlabeled data to improve minority class representation	When unlabeled data is abundant
Synthetic data generation	Train a generative model (GAN, VAE) on minority class	When you need diverse minority samples

Practice Problems

Problem 1: Fraud Detection Design (Mid-Level)

Scenario: You're building a credit card fraud detection system. Your dataset has 1 million transactions, 2,000 of which are fraudulent (0.2%). You have 50 features. The business wants to catch at least 90% of fraud while keeping false positive rate manageable for the review team (20 analysts).

Question: Design the complete ML pipeline, including resampling strategy, model choice, evaluation metrics, and threshold selection.

Hint 1 - Direction

Start with the business constraints: 90% recall minimum and analyst capacity. Work backward from these to determine the threshold, then think about training strategy.

Hint 2 - Insight

With 1M transactions and 0.2% fraud rate, SMOTE may not be necessary - there are still 2,000 positive samples, which is enough for most models. Class weights might be sufficient and simpler. The key constraint is the 90% recall requirement with manageable FP volume.

Hint 3 - Full Solution

Pipeline Design:

Splitting: Time-based split (fraud patterns evolve). Train on months 1-9, validate on month 10, test on months 11-12.
Resampling decision: With 2,000 fraud cases in training, SMOTE is likely unnecessary. Use class weights instead - simpler and no synthetic data issues.
Model: XGBoost with scale_pos_weight = 500 (ratio of negatives to positives).
Evaluation: PR-AUC as primary metric (not ROC-AUC). Also track recall@precision curves.
Threshold tuning:
- Fix recall >= 0.90 on validation set
- Find the highest threshold that maintains 90% recall
- Calculate the resulting FP volume: if threshold gives 1% FPR, that's 10,000 false alarms per million transactions
- 20 analysts reviewing ~500 cases/day = 10,000 cases/20 days - feasible?
Calibration: Apply Platt scaling on validation set (original distribution).
Production: Two-stage system
- Stage 1: Fast model (logistic regression) filters obvious non-fraud (95% of traffic)
- Stage 2: Complex model (XGBoost) evaluates remaining 5%
- Reduces latency and cost

import xgboost as xgb
from sklearn.calibration import CalibratedClassifierCV

# Class-weighted XGBoost
model = xgb.XGBClassifier(
    scale_pos_weight=500,
    max_depth=6,
    n_estimators=300,
    learning_rate=0.1,
    eval_metric='aucpr'
)
model.fit(X_train, y_train, eval_set=[(X_val, y_val)], early_stopping_rounds=20)

# Calibrate
cal_model = CalibratedClassifierCV(model, cv='prefit', method='isotonic')
cal_model.fit(X_val, y_val)

# Find threshold for 90% recall
from sklearn.metrics import precision_recall_curve
precisions, recalls, thresholds = precision_recall_curve(
    y_val, cal_model.predict_proba(X_val)[:, 1]
)
# Find threshold where recall >= 0.90
valid_idx = recalls[:-1] >= 0.90
optimal_threshold = thresholds[valid_idx][-1]  # Highest threshold with 90%+ recall

Scoring Rubric:

Strong Hire: Time-based split, class weights over SMOTE (justified), threshold tuning with business constraints, calibration, two-stage system consideration
Lean Hire: Correct high-level approach but misses calibration or threshold tuning
No Hire: Random split, accuracy as metric, or SMOTE on entire dataset before splitting

Problem 2: SMOTE Gone Wrong (Mid-Level)

Scenario: A colleague applied SMOTE to their text classification dataset (TF-IDF features, 10,000 dimensions) and performance decreased compared to the baseline without resampling. Why?

Question: Explain why SMOTE failed and propose alternatives.

Hint 1 - Direction

Think about what SMOTE does geometrically and why that might not work in high-dimensional sparse spaces.

Hint 2 - Insight

In 10,000-dimensional TF-IDF space, nearest neighbors are unreliable (curse of dimensionality). Interpolating between two sparse document vectors creates dense synthetic vectors that don't look like any real document.

Hint 3 - Full Solution

Why SMOTE failed:

Curse of dimensionality: In 10,000-dimensional sparse TF-IDF space, all points are roughly equidistant. "Nearest neighbors" are not truly similar documents.
Sparse to dense artifact: TF-IDF vectors are sparse (most entries are 0). Interpolating between two sparse vectors creates a dense vector with small non-zero values everywhere. This synthetic "document" has words from both parent documents but resembles neither.
Semantic incoherence: A synthetic sample between "bank fraud investigation" and "insurance claim denial" might have equal weight on all words from both documents, creating a nonsensical feature vector.

Alternatives:

Class weights: Simple and effective - just weight the loss function
Random oversampling: Duplicates real documents, preserving sparsity
Text augmentation: Back-translation, synonym replacement, paraphrasing (creates semantically valid new samples)
Dimensionality reduction first: Apply PCA/SVD to reduce dimensions, then SMOTE on the dense lower-dimensional representation
Pre-trained embeddings: Use BERT/sentence-transformer embeddings (dense, 768-dim), then SMOTE works better

Scoring Rubric:

Strong Hire: Explains the curse of dimensionality AND the sparse-to-dense problem, proposes multiple alternatives with tradeoffs
Lean Hire: Identifies high dimensionality as the issue but can't articulate the mechanism
No Hire: Doesn't understand why SMOTE would fail or suggests "just use more SMOTE"

Problem 3: Threshold vs. Resampling (Senior-Level)

Scenario: Your model achieves PR-AUC of 0.65 on imbalanced data. You try three approaches: (A) SMOTE, (B) class weights, (C) threshold tuning. All three improve F1 on the validation set. How do you decide which to use in production?

Hint 1 - Direction

PR-AUC measures ranking quality and is threshold-independent. If PR-AUC is 0.65 without resampling, think about what resampling can actually improve vs. what threshold tuning can improve.

Hint 2 - Insight

Threshold tuning doesn't change the model - it just picks a better operating point on the existing PR curve. SMOTE and class weights actually change the model and might improve (or degrade) PR-AUC itself. The question is: do you need a better model or just a better operating point?

Hint 3 - Full Solution

Analysis framework:

Compare PR-AUC across methods (not just F1):
- If SMOTE or class weights improve PR-AUC, the model is better - use them
- If they don't improve PR-AUC but F1 improves, they're just shifting the implicit threshold - threshold tuning achieves the same effect more cleanly
Production complexity:
- Threshold tuning: simplest - no change to training pipeline, just adjust decision boundary
- Class weights: moderate - requires retraining with weights but no data pipeline changes
- SMOTE: most complex - requires resampling step, increases training data size, needs calibration afterward
Calibration impact:
- Threshold tuning: preserves calibrated probabilities
- Class weights: slightly distorts probabilities (easy to recalibrate)
- SMOTE: significantly distorts probabilities (mandatory recalibration)
Recommended decision process:
- Start with threshold tuning (free improvement, no retraining)
- If PR-AUC is insufficient, try class weights (simple, effective)
- If still insufficient, try SMOTE (but check if it actually improves PR-AUC)
- Always calibrate and validate on original-distribution data

Scoring Rubric:

Strong Hire: Distinguishes between ranking improvement (PR-AUC) and threshold improvement (F1), considers production complexity, discusses calibration
Lean Hire: Proposes a reasonable comparison but doesn't articulate the PR-AUC vs. F1 distinction
No Hire: Picks whichever gives the highest F1 without considering production implications

Problem 4: Multi-Stage Imbalance (Staff-Level)

Scenario: You're building a content moderation system for a social platform. Content types include: clean (95%), mildly inappropriate (3%), policy violation (1.5%), illegal content (0.5%). The cost of missing illegal content is orders of magnitude higher than any other error. Latency requirement: <100ms per item.

Question: Design the complete classification system, including model architecture, handling of the multi-class imbalance, and error cost management.

Hint 1 - Direction

Think about cascading classifiers: can you quickly filter out obviously clean content, then spend more compute on borderline cases? Also think about the asymmetric costs across the four classes.

Hint 2 - Insight

A single model optimizing a single loss can't capture the 1000x cost differential between missing illegal content vs. misclassifying clean content. Consider a hierarchical approach with different thresholds per class and a separate high-recall model specifically for illegal content.

Hint 3 - Full Solution

System design:

Stage 1: Fast Binary Filter (<10ms)

Lightweight model (logistic regression or distilled BERT)
Binary: "definitely clean" vs. "needs review"
Set threshold for 99.9% recall on non-clean content
~90% of content passes through as clean, 10% goes to Stage 2

Stage 2: Multi-Class Classifier (remaining 10%, <50ms)

Full model (fine-tuned transformer)
Four-class output with per-class cost-weighted loss:

class_costs = {
    'clean': 1,
    'mildly_inappropriate': 5,
    'policy_violation': 50,
    'illegal': 5000
}

Stage 3: Illegal Content Safety Net (parallel, <50ms)

Separate high-recall binary classifier: "illegal" vs. "not illegal"
Trained specifically on illegal content detection
Threshold set for 99.99% recall (accept high FPR)
Runs in parallel with Stage 2 on flagged content
Any disagreement between Stage 2 and Stage 3 escalates to human review

Per-class threshold tuning:

Each class gets its own threshold based on cost
Illegal: very low threshold (flag even slight suspicion)
Clean: high threshold (need high confidence to pass)

Training strategy:

Collect and augment illegal content samples (even 0.5% of a large platform is substantial in absolute terms)
Use focal loss with alpha inversely proportional to class frequency AND adjusted by cost
Curriculum learning: start training on balanced batches, gradually introduce natural distribution

Scoring Rubric:

Strong Hire: Multi-stage architecture with latency awareness, separate safety net for illegal content, per-class thresholds, cost-weighted loss, discusses human-in-the-loop escalation
Lean Hire: Reasonable multi-class approach but misses the cascading architecture or the separate safety net
No Hire: Single model with SMOTE, accuracy as metric, no consideration of differential costs

Problem 5: Calibration Crisis (Senior-Level)

Scenario: Your medical diagnosis model was trained with SMOTE (1:1 balanced) on a dataset where disease prevalence is 2%. The model outputs P(disease) = 0.45 for a patient. What's the actual probability? How do you fix this for clinical use?

Hint 1 - Direction

The model learned on balanced data where the base rate was 50%. In reality, the base rate is 2%. Apply Bayes' theorem to correct the probability.

Hint 2 - Insight

You can analytically adjust the probability using the odds correction formula. If the model outputs probability p_s (trained on balanced data), the corrected probability p_c for the real prior is: adjust the odds by the ratio of priors.

Hint 3 - Full Solution

Analytical correction using odds adjustment:

The model learned odds based on a 50:50 prior. The real prior is 2:98.

$\text{odds}_{corrected} = \text{odds}_{model} \times \frac{\pi_{real}}{\pi_{train}} \times \frac{1 - \pi_{train}}{1 - \pi_{real}}$

Where $\pi_{real} = 0.02$ (true prevalence) and $\pi_{train} = 0.50$ (SMOTE-balanced).

import numpy as np

p_model = 0.45
pi_real = 0.02
pi_train = 0.50

# Convert to odds
odds_model = p_model / (1 - p_model)  # 0.818

# Correction factor
correction = (pi_real / pi_train) * ((1 - pi_train) / (1 - pi_real))
# = (0.02 / 0.50) * (0.50 / 0.98) = 0.04 * 0.5102 = 0.0204

odds_corrected = odds_model * correction  # 0.0167
p_corrected = odds_corrected / (1 + odds_corrected)  # 0.0164

print(f"Model output: {p_model:.2f}")
print(f"Corrected probability: {p_corrected:.4f}")
# Actual probability: ~1.64%, NOT 45%!

Better approach for production: Use Platt scaling or isotonic regression calibration on a held-out set with the original class distribution:

from sklearn.calibration import CalibratedClassifierCV

calibrated = CalibratedClassifierCV(model, cv='prefit', method='isotonic')
calibrated.fit(X_val_original_distribution, y_val_original_distribution)

Why this matters clinically:

A doctor seeing P(disease) = 0.45 would order invasive tests
The real probability of ~1.6% suggests watchful waiting
Miscalibrated probabilities in medicine can lead to unnecessary procedures or missed diagnoses

Scoring Rubric:

Strong Hire: Derives the odds correction, computes the corrected probability, explains clinical impact, recommends calibration for production
Lean Hire: Knows the probability is wrong and suggests calibration but can't derive the correction
No Hire: Thinks 0.45 is the real probability or suggests "just lower the threshold"

Interview Cheat Sheet

Question	Key Points
Why does accuracy fail?	Majority class prediction achieves high accuracy; use PR-AUC, F1 instead
SMOTE vs. random oversampling?	SMOTE creates synthetic points by interpolation; avoids exact duplicates; fails in high dimensions
When does SMOTE hurt?	High-dimensional sparse data, overlapping classes, very few minority samples
Class weights vs. resampling?	Mathematically similar for linear models; weights are simpler in production; no calibration needed
What is focal loss?	Down-weights easy examples: $-(1-p_t)^\gamma \log(p_t)$ ; gamma=2 typical; for deep learning
How to tune threshold?	Sweep thresholds on validation set; optimize for target metric (F1, cost, recall@precision)
Why calibrate after resampling?	Resampling shifts the learned base rate; probabilities are wrong for original distribution
PR-AUC vs. ROC-AUC?	PR-AUC better for imbalanced data; ROC-AUC inflated by large true negative count
Cost-sensitive learning?	Define cost matrix per error type; weight loss by business cost; tune threshold accordingly
Anomaly detection alternative?	For extreme imbalance (1:10K+); learn normal class only; flag deviations
Resampling + CV?	Resample INSIDE each CV fold; use imblearn Pipeline; never resample before splitting
Production monitoring?	Track prediction distribution, precision@threshold, recall with delayed labels, calibration drift

Spaced Repetition Checkpoints

Day 0 - Initial Learning

Explain why 99% accuracy can be meaningless
List 3 resampling techniques and describe how SMOTE works
Explain the difference between class weights and oversampling
Define PR-AUC and explain why it's better than ROC-AUC for imbalanced data

Day 3 - Recall

Draw the SMOTE algorithm step by step
Explain when SMOTE fails (at least 3 scenarios)
Write the focal loss formula and explain the gamma parameter
Explain why you must resample inside CV folds, not before

Day 7 - Application

Given a fraud detection scenario, design the full pipeline
Explain calibration after resampling and the odds correction formula
Compare 5 approaches to imbalance and recommend one for a given scenario
Solve Practice Problem 1 without hints

Day 14 - Integration

Design a multi-stage classifier with differential error costs
Explain the connection between threshold tuning and PR curve operating points
Derive the analytical probability correction after SMOTE training
Solve Practice Problem 4 (multi-stage) in under 15 minutes

Day 21 - Mastery

Teach imbalanced data handling end-to-end to someone else
Critique a flawed imbalanced data pipeline and fix all issues
Design evaluation and monitoring for an imbalanced model in production
Confidently answer: "When would you NOT handle imbalance at all?"

The Real Interview Moment​

What You Will Master​

Self-Assessment: Where Are You Now?​

Part 1 - Understanding the Problem​

What Makes Data Imbalanced?​

Why Standard Algorithms Fail​

Why Accuracy Fails: The Metrics Problem​

Part 2 - Techniques for Handling Imbalance​

The Imbalance Handling Decision Flowchart​

Resampling Strategies​

Random Oversampling​

Random Undersampling​

SMOTE (Synthetic Minority Oversampling TEchnique)​

ADASYN (Adaptive Synthetic Sampling)​

When Resampling Helps vs. Hurts​

Algorithmic Approaches​

Class Weights​

Focal Loss​

Cost-Sensitive Learning​

Threshold Tuning​

Ensemble Approaches for Imbalance​

BalancedBaggingClassifier​

EasyEnsemble​

Part 3 - Production Considerations​

Calibration After Resampling​

Monitoring Imbalanced Models in Production​

The Complete Production Pipeline​

Resampling Inside Cross-Validation​

Part 4 - Advanced Techniques​

Multi-Class Imbalance​

Anomaly Detection as an Alternative​

Data-Level Strategies Beyond Resampling​

Practice Problems​

Problem 1: Fraud Detection Design (Mid-Level)​

Problem 2: SMOTE Gone Wrong (Mid-Level)​

Problem 3: Threshold vs. Resampling (Senior-Level)​

Problem 4: Multi-Stage Imbalance (Staff-Level)​

Problem 5: Calibration Crisis (Senior-Level)​

Interview Cheat Sheet​

Spaced Repetition Checkpoints​

Day 0 - Initial Learning​

Day 3 - Recall​

Day 7 - Application​

Day 14 - Integration​

Day 21 - Mastery​

The Real Interview Moment

What You Will Master

Self-Assessment: Where Are You Now?

Part 1 - Understanding the Problem

What Makes Data Imbalanced?

Why Standard Algorithms Fail

Why Accuracy Fails: The Metrics Problem

Part 2 - Techniques for Handling Imbalance

The Imbalance Handling Decision Flowchart

Resampling Strategies

Random Oversampling

Random Undersampling

SMOTE (Synthetic Minority Oversampling TEchnique)

ADASYN (Adaptive Synthetic Sampling)

When Resampling Helps vs. Hurts

Algorithmic Approaches

Class Weights

Focal Loss

Cost-Sensitive Learning

Threshold Tuning

Ensemble Approaches for Imbalance

BalancedBaggingClassifier

EasyEnsemble

Part 3 - Production Considerations

Calibration After Resampling

Monitoring Imbalanced Models in Production

The Complete Production Pipeline

Resampling Inside Cross-Validation

Part 4 - Advanced Techniques

Multi-Class Imbalance

Anomaly Detection as an Alternative

Data-Level Strategies Beyond Resampling

Practice Problems

Problem 1: Fraud Detection Design (Mid-Level)

Problem 2: SMOTE Gone Wrong (Mid-Level)

Problem 3: Threshold vs. Resampling (Senior-Level)

Problem 4: Multi-Stage Imbalance (Staff-Level)

Problem 5: Calibration Crisis (Senior-Level)

Interview Cheat Sheet

Spaced Repetition Checkpoints

Day 0 - Initial Learning

Day 3 - Recall

Day 7 - Application

Day 14 - Integration

Day 21 - Mastery