Design: Fraud Detection - Real-Time Classification Under Extreme Imbalance
Reading time: ~25 min | Interview relevance: Critical | Roles: MLE
The Real Interview Moment
"Design a fraud detection system for a payment platform processing 10,000 transactions per second." You start describing a random forest classifier. The interviewer asks: "What's your positive rate?" You say "maybe 1%?" The interviewer responds: "In reality, it's 0.1%. With 10K TPS, that's 10 fraud cases per second and 9,990 legitimate ones. If your model has 99% accuracy, it still misses 10 frauds per second and falsely blocks 100 legitimate transactions per second. How do you handle this?"
Fraud detection is the interview question that tests whether you understand the real-world consequences of ML decisions - every false negative costs the company money, every false positive costs a customer their purchase.
What You Will Master
- Handling extreme class imbalance (0.1% positive rate)
- Real-time feature engineering for streaming transactions
- Precision-recall trade-offs with business impact analysis
- Adversarial model evolution (fraudsters adapt)
- Multi-layer defense architecture
- Online model updates to catch new fraud patterns
- Rule-based + ML hybrid systems
The Complete Design
Step 1: Requirements (5 min)
Functional requirements:
- Score every transaction in real-time: approve, decline, or send to manual review
- Process 10K transactions per second (TPS)
- Detect card-not-present (CNP) fraud, account takeover, and promo abuse
Non-functional requirements:
- Latency: <50ms per transaction (synchronous in payment flow)
- Precision: >95% at operating threshold (false positive rate <0.5%)
- Recall: >80% (catch 80%+ of fraud)
- Dollar-weighted recall: >90% (high-value fraud matters more)
The candidate who says "I'd optimize for F1 score" gets a Lean Hire at best. The candidate who says "I'd optimize for dollar-weighted recall at a fixed false positive rate of 0.5%, because blocking a legitimate 50 in lost revenue plus customer lifetime value, while missing a 500 directly" gets a Strong Hire. Translate ML metrics to business impact.
Step 2: Problem Formulation (5 min)
| Business Goal | ML Objective | Primary Metric | Guardrails |
|---|---|---|---|
| Minimize fraud losses while maintaining customer experience | Binary classification: fraud vs. legitimate | Precision@Recall=0.8, Dollar-saved rate | False positive rate <0.5%, latency <50ms |
Three-tier decision system:
Step 3: Features & Data (8 min)
Feature Categories
| Category | Examples | Computation |
|---|---|---|
| Transaction | Amount, currency, merchant category, card type, time of day | Available instantly |
| Velocity | Transactions in last 1h/24h/7d, unique merchants in 24h, amount spent in 24h | Streaming aggregation |
| Behavioral | Deviation from user's normal spending pattern, unusual merchant category, time-of-day anomaly | Requires user profile |
| Device/Network | IP geolocation, device fingerprint, proxy/VPN detection, distance from last transaction | Real-time lookup |
| Graph | Shared device with known fraudster, merchant fraud rate, card-merchant pair frequency | Batch + real-time |
The Most Predictive Features (from industry experience)
- Velocity features: Number and total amount of transactions in the last hour
- Deviation features: How different is this transaction from the user's normal pattern?
- Network features: Is this IP/device associated with previous fraud?
- Merchant risk score: Historical fraud rate at this merchant
Many candidates list features without thinking about computation feasibility. "Average transaction amount over the last 30 days" is easy in batch but requires a streaming aggregation pipeline at 10K TPS. Always specify: Can this feature be computed within 50ms? For each feature, state whether it's pre-computed (batch), streamed (near-real-time), or available instantly.
Training Data
- Labels: Chargebacks (30-90 day delay), manual review decisions (1-24 hour delay)
- Challenge: Label delay - you train on yesterday's labels but serve on today's transactions
- Imbalance: 0.1% positive rate → use SMOTE, class weights, or focal loss
- Adversarial drift: Fraud patterns change weekly as fraudsters adapt
Step 4: Model (8 min)
The Progression
Why XGBoost is the production standard for fraud detection:
- Handles tabular data with mixed feature types
- Robust to missing values
- Fast inference (<5ms per transaction)
- Interpretable feature importance (needed for regulatory compliance)
- Works well with class imbalance via
scale_pos_weight
Why NOT deep learning (initially):
- Tabular data - tree models typically outperform neural networks
- Latency constraint - deep models are slower
- Interpretability requirement - regulators require explainable decisions
- Data volume - 0.1% positive rate means limited positive examples
Handling Class Imbalance
| Technique | How It Works | When to Use |
|---|---|---|
| Class weights | Upweight positive class in loss function | Always - simplest approach |
| SMOTE | Generate synthetic positive examples | When positive examples are very few |
| Focal loss | Down-weight easy negatives | Neural network models |
| Threshold tuning | Adjust decision threshold post-training | Always - separate model from business decision |
| Cost-sensitive learning | Weight by transaction amount | When dollar impact matters more than count |
Step 5: Serving (8 min)
Key Architecture Decisions
| Component | Decision | Rationale |
|---|---|---|
| Rules engine first | Block known fraud patterns before ML | Deterministic, fast, catches known attacks |
| Feature store | Redis with streaming updates (Kafka + Flink) | Sub-10ms feature lookups for velocity features |
| Model serving | XGBoost in C++ (treelite) | <5ms inference, no GPU needed |
| Fallback | Rules-only mode if ML is down | Higher false positive rate but still catches obvious fraud |
| Model updates | Retrain daily, deploy with shadow scoring | Fraud patterns evolve - stale models miss new attacks |
Why Rules + ML (Not Just ML)
| Layer | Catches | Example |
|---|---|---|
| Rules engine | Known fraud patterns, sanctions, blacklists | Card on blocklist → instant decline |
| ML model | Complex, subtle patterns | Unusual velocity + new device + high amount → likely fraud |
| Manual review | Edge cases ML is uncertain about | Score 0.3-0.7, high-value transaction |
Step 6: Evaluation & Iteration (8 min)
Offline Evaluation
| Metric | Definition | Target |
|---|---|---|
| Precision @ FPR=0.5% | How precise when we block 0.5% of legitimate traffic | > 80% |
| Recall | % of fraud caught | > 80% |
| Dollar recall | % of fraud dollars caught | > 90% |
| AUC-PR | Area under precision-recall curve | > 0.7 |
Why not AUC-ROC? With 0.1% positive rate, AUC-ROC is inflated and misleading. AUC-PR is much more informative for imbalanced problems.
Adversarial Evolution
This arms race means:
- Retrain frequently: Daily or weekly, not monthly
- Monitor for drift: Track fraud rate by segment
- A/B test carefully: Don't expose treatment group to higher fraud risk
- Shadow scoring: Score with new model but use old model's decisions, compare
Practice Problems
Problem 1: Account Takeover Detection
Direction
A fraudster gains access to a legitimate user's account and makes transactions. The transactions look normal for that user. How do you detect this?
Key Insight
Account takeover (ATO) is harder than card fraud because the transactions match the user's profile. Key signals: login from new device/IP, password change followed by purchase, session behavior anomaly (navigation speed, click patterns), geographic impossibility (login from NYC then London in 1 hour). Build a separate ATO model that focuses on session and device features rather than transaction features.
Problem 2: Explain a Fraud Decision
Direction
A customer calls complaining their transaction was blocked. Your XGBoost model scored it as 0.85 (fraud). How do you explain the decision?
Key Insight
Use SHAP values to explain individual predictions: "This transaction was flagged because: (1) It was 5x your typical transaction amount (+0.15), (2) It came from a new device we haven't seen before (+0.12), (3) It was at a merchant category you've never used (+0.08)." This is not just a nice-to-have - financial regulations (EU AI Act, ECOA) require explainable automated decisions.
Interview Cheat Sheet
| Question Pattern | Framework | Key Phrases |
|---|---|---|
| "Design fraud detection" | Rules + ML + manual review three-tier | "Rules catch known patterns, ML catches subtle ones, manual review handles uncertainty" |
| "How do you handle imbalance?" | Multiple techniques | "Class weights, SMOTE for training, threshold tuning for deployment, evaluate with AUC-PR not AUC-ROC" |
| "How do you handle adversarial evolution?" | Continuous retraining | "Daily retraining, drift monitoring by segment, shadow scoring before deployment" |
| "Precision vs. recall?" | Business impact analysis | "Each false positive costs Y in fraud - optimize the total cost" |
Spaced Repetition Checkpoints
- Day 0: Draw the three-tier architecture (rules → ML → manual review). Explain why each layer exists.
- Day 3: Explain 5 techniques for handling class imbalance. When would you use each?
- Day 7: Design fraud detection for a ride-sharing platform in 45 minutes.
- Day 14: Explain adversarial model drift and your retraining strategy.
- Day 21: Mock interview with follow-ups on explainability, regulatory requirements, and real-time feature engineering.
What's Next
- Ad Click Prediction - Another real-time classification problem with calibration requirements
- Anomaly Detection - Unsupervised approach to detecting unusual patterns
