Skip to main content

Design: Fraud Detection - Real-Time Classification Under Extreme Imbalance

Reading time: ~25 min | Interview relevance: Critical | Roles: MLE

The Real Interview Moment

"Design a fraud detection system for a payment platform processing 10,000 transactions per second." You start describing a random forest classifier. The interviewer asks: "What's your positive rate?" You say "maybe 1%?" The interviewer responds: "In reality, it's 0.1%. With 10K TPS, that's 10 fraud cases per second and 9,990 legitimate ones. If your model has 99% accuracy, it still misses 10 frauds per second and falsely blocks 100 legitimate transactions per second. How do you handle this?"

Fraud detection is the interview question that tests whether you understand the real-world consequences of ML decisions - every false negative costs the company money, every false positive costs a customer their purchase.

What You Will Master

  • Handling extreme class imbalance (0.1% positive rate)
  • Real-time feature engineering for streaming transactions
  • Precision-recall trade-offs with business impact analysis
  • Adversarial model evolution (fraudsters adapt)
  • Multi-layer defense architecture
  • Online model updates to catch new fraud patterns
  • Rule-based + ML hybrid systems

The Complete Design

Step 1: Requirements (5 min)

Functional requirements:

  • Score every transaction in real-time: approve, decline, or send to manual review
  • Process 10K transactions per second (TPS)
  • Detect card-not-present (CNP) fraud, account takeover, and promo abuse

Non-functional requirements:

  • Latency: <50ms per transaction (synchronous in payment flow)
  • Precision: >95% at operating threshold (false positive rate <0.5%)
  • Recall: >80% (catch 80%+ of fraud)
  • Dollar-weighted recall: >90% (high-value fraud matters more)
Interviewer's Perspective

The candidate who says "I'd optimize for F1 score" gets a Lean Hire at best. The candidate who says "I'd optimize for dollar-weighted recall at a fixed false positive rate of 0.5%, because blocking a legitimate 500purchasecostsus500 purchase costs us 50 in lost revenue plus customer lifetime value, while missing a 500fraudcostsus500 fraud costs us 500 directly" gets a Strong Hire. Translate ML metrics to business impact.

Step 2: Problem Formulation (5 min)

Business GoalML ObjectivePrimary MetricGuardrails
Minimize fraud losses while maintaining customer experienceBinary classification: fraud vs. legitimatePrecision@Recall=0.8, Dollar-saved rateFalse positive rate <0.5%, latency <50ms

Three-tier decision system:

Three-Tier Fraud Decision System - Auto-Approve (score &lt; 0.1), Manual Review (0.1–0.7), Auto-Decline (≥ 0.7)

Step 3: Features & Data (8 min)

Feature Categories

CategoryExamplesComputation
TransactionAmount, currency, merchant category, card type, time of dayAvailable instantly
VelocityTransactions in last 1h/24h/7d, unique merchants in 24h, amount spent in 24hStreaming aggregation
BehavioralDeviation from user's normal spending pattern, unusual merchant category, time-of-day anomalyRequires user profile
Device/NetworkIP geolocation, device fingerprint, proxy/VPN detection, distance from last transactionReal-time lookup
GraphShared device with known fraudster, merchant fraud rate, card-merchant pair frequencyBatch + real-time

The Most Predictive Features (from industry experience)

  1. Velocity features: Number and total amount of transactions in the last hour
  2. Deviation features: How different is this transaction from the user's normal pattern?
  3. Network features: Is this IP/device associated with previous fraud?
  4. Merchant risk score: Historical fraud rate at this merchant
Common Trap

Many candidates list features without thinking about computation feasibility. "Average transaction amount over the last 30 days" is easy in batch but requires a streaming aggregation pipeline at 10K TPS. Always specify: Can this feature be computed within 50ms? For each feature, state whether it's pre-computed (batch), streamed (near-real-time), or available instantly.

Training Data

  • Labels: Chargebacks (30-90 day delay), manual review decisions (1-24 hour delay)
  • Challenge: Label delay - you train on yesterday's labels but serve on today's transactions
  • Imbalance: 0.1% positive rate → use SMOTE, class weights, or focal loss
  • Adversarial drift: Fraud patterns change weekly as fraudsters adapt

Step 4: Model (8 min)

The Progression

Fraud Detection Model Progression - Rules Engine → XGBoost → Ensemble → Online Learning with Graph

Why XGBoost is the production standard for fraud detection:

  • Handles tabular data with mixed feature types
  • Robust to missing values
  • Fast inference (<5ms per transaction)
  • Interpretable feature importance (needed for regulatory compliance)
  • Works well with class imbalance via scale_pos_weight

Why NOT deep learning (initially):

  • Tabular data - tree models typically outperform neural networks
  • Latency constraint - deep models are slower
  • Interpretability requirement - regulators require explainable decisions
  • Data volume - 0.1% positive rate means limited positive examples

Handling Class Imbalance

TechniqueHow It WorksWhen to Use
Class weightsUpweight positive class in loss functionAlways - simplest approach
SMOTEGenerate synthetic positive examplesWhen positive examples are very few
Focal lossDown-weight easy negativesNeural network models
Threshold tuningAdjust decision threshold post-trainingAlways - separate model from business decision
Cost-sensitive learningWeight by transaction amountWhen dollar impact matters more than count

Step 5: Serving (8 min)

Fraud Detection Real-Time Serving - Transaction → Rules Engine → Feature Store → ML Scoring → Decision

Key Architecture Decisions

ComponentDecisionRationale
Rules engine firstBlock known fraud patterns before MLDeterministic, fast, catches known attacks
Feature storeRedis with streaming updates (Kafka + Flink)Sub-10ms feature lookups for velocity features
Model servingXGBoost in C++ (treelite)<5ms inference, no GPU needed
FallbackRules-only mode if ML is downHigher false positive rate but still catches obvious fraud
Model updatesRetrain daily, deploy with shadow scoringFraud patterns evolve - stale models miss new attacks

Why Rules + ML (Not Just ML)

LayerCatchesExample
Rules engineKnown fraud patterns, sanctions, blacklistsCard on blocklist → instant decline
ML modelComplex, subtle patternsUnusual velocity + new device + high amount → likely fraud
Manual reviewEdge cases ML is uncertain aboutScore 0.3-0.7, high-value transaction

Step 6: Evaluation & Iteration (8 min)

Offline Evaluation

MetricDefinitionTarget
Precision @ FPR=0.5%How precise when we block 0.5% of legitimate traffic> 80%
Recall% of fraud caught> 80%
Dollar recall% of fraud dollars caught> 90%
AUC-PRArea under precision-recall curve> 0.7

Why not AUC-ROC? With 0.1% positive rate, AUC-ROC is inflated and misleading. AUC-PR is much more informative for imbalanced problems.

Adversarial Evolution

The Fraudster–Model Arms Race - Pattern A → Model V1 catches → Pattern B → Model V2 catches → Pattern C...

This arms race means:

  • Retrain frequently: Daily or weekly, not monthly
  • Monitor for drift: Track fraud rate by segment
  • A/B test carefully: Don't expose treatment group to higher fraud risk
  • Shadow scoring: Score with new model but use old model's decisions, compare

Practice Problems

Problem 1: Account Takeover Detection

Direction

A fraudster gains access to a legitimate user's account and makes transactions. The transactions look normal for that user. How do you detect this?

Key Insight

Account takeover (ATO) is harder than card fraud because the transactions match the user's profile. Key signals: login from new device/IP, password change followed by purchase, session behavior anomaly (navigation speed, click patterns), geographic impossibility (login from NYC then London in 1 hour). Build a separate ATO model that focuses on session and device features rather than transaction features.

Problem 2: Explain a Fraud Decision

Direction

A customer calls complaining their transaction was blocked. Your XGBoost model scored it as 0.85 (fraud). How do you explain the decision?

Key Insight

Use SHAP values to explain individual predictions: "This transaction was flagged because: (1) It was 5x your typical transaction amount (+0.15), (2) It came from a new device we haven't seen before (+0.12), (3) It was at a merchant category you've never used (+0.08)." This is not just a nice-to-have - financial regulations (EU AI Act, ECOA) require explainable automated decisions.

Interview Cheat Sheet

Question PatternFrameworkKey Phrases
"Design fraud detection"Rules + ML + manual review three-tier"Rules catch known patterns, ML catches subtle ones, manual review handles uncertainty"
"How do you handle imbalance?"Multiple techniques"Class weights, SMOTE for training, threshold tuning for deployment, evaluate with AUC-PR not AUC-ROC"
"How do you handle adversarial evolution?"Continuous retraining"Daily retraining, drift monitoring by segment, shadow scoring before deployment"
"Precision vs. recall?"Business impact analysis"Each false positive costs Xinlostrevenue,eachfalsenegativecostsX in lost revenue, each false negative costs Y in fraud - optimize the total cost"

Spaced Repetition Checkpoints

  • Day 0: Draw the three-tier architecture (rules → ML → manual review). Explain why each layer exists.
  • Day 3: Explain 5 techniques for handling class imbalance. When would you use each?
  • Day 7: Design fraud detection for a ride-sharing platform in 45 minutes.
  • Day 14: Explain adversarial model drift and your retraining strategy.
  • Day 21: Mock interview with follow-ups on explainability, regulatory requirements, and real-time feature engineering.

What's Next

© 2026 EngineersOfAI. All rights reserved.