What is fraud detection system design?

End-to-end design of a real-time fraud detection system - covering feature engineering, imbalanced learning, streaming scoring, delayed labels, and graph-based fraud ring detection.

How does real-time fraud scoring work in practice?

Designing a Fraud Detection System at Scale covers fraud detection system design, real-time fraud scoring, class imbalance ml from first principles with code examples. Free lesson at https://engineersofai.com/docs/ai-systems/case-studies/fraud-detection-system

What is the difference between fraud detection system design and class imbalance ml?

See the full breakdown at https://engineersofai.com/docs/ai-systems/case-studies/fraud-detection-system

:::tip 🎮 Interactive Playground Visualize this concept: Try the Fraud Detection Design demo on the EngineersOfAI Playground - no code required. :::

Designing a Fraud Detection System at Scale

The Adversarial Landscape

The fraud team at a payments company receives a call from a large merchant at 11 PM on a Friday. Over the past six hours, they have been hit by a wave of fraudulent transactions - cards stolen in a data breach, systematically tested and then used for large purchases. The attack pattern is new: the fraudsters are using residential proxies with IP addresses that perfectly mimic legitimate user traffic. The transaction amounts are carefully chosen to stay just below the merchant's manual review threshold. The cards are being used at exactly the frequency of real users, spaced with realistic time intervals.

The rule-based fraud system has missed all of them. The rules were designed for known attack patterns, and this one is novel. The ML model has missed most of them too - it has never seen this combination of features before. By 11 PM, $2.3 million in fraudulent transactions have cleared. By the time the fraud team identifies the pattern and updates the rules,$ 4.1 million total has been lost.

This is the fundamental challenge of fraud detection: it is an arms race against an adversary that constantly adapts. The fraudsters reverse-engineer your detection system by probing it with small transactions, observe what gets flagged, and adjust. Every detection improvement is eventually circumvented. The system must detect novel patterns before they cause significant damage, operate in real time under the transaction processing latency constraint, maintain an extremely low false positive rate (incorrectly flagging legitimate transactions causes merchant revenue loss and user frustration), and handle extreme class imbalance (fraud rate is 0.01-0.1% of transactions in typical payment systems).

Requirements

Functional requirements:

Score every transaction for fraud probability before authorization
Support manual review queue for high-risk transactions
Detect organized fraud rings (multiple accounts, coordinated attacks)
Provide explainable decisions for chargebacks and merchant disputes

Non-functional requirements:

Latency: scoring decision in under 100ms (must not delay payment authorization)
Throughput: 10,000 transactions per second at peak (major payment network scale)
False positive rate: under 0.1% (1 in 1,000 legitimate transactions incorrectly blocked)
Recall: catch at least 85% of fraud by dollar value

System Architecture

Feature Engineering

Feature engineering is the most impactful component of a fraud detection system. A gradient boosted tree model trained on 200 well-designed features consistently outperforms a deep neural network trained on 20 features.

Transaction Features

import numpy as np
from datetime import datetime, timezone
import hashlib


class FraudFeatureExtractor:
    """
    Extract fraud detection features for a single transaction.
    Combines transaction-level, behavioral, and device features.
    """

    def extract_transaction_features(self, txn: dict) -> dict:
        """
        Static features from the transaction itself.
        Available immediately at scoring time.
        """
        amount = txn["amount_usd"]
        ts = datetime.fromtimestamp(txn["timestamp"], tz=timezone.utc)

        return {
            "amount_usd": amount,
            "amount_log": np.log1p(amount),
            "amount_cents": amount * 100,         # detect round-dollar amounts
            "is_round_amount": float(amount % 1 == 0),
            "hour_of_day": ts.hour,
            "day_of_week": ts.weekday(),
            "is_weekend": float(ts.weekday() >= 5),
            "is_night": float(ts.hour < 6 or ts.hour > 22),
            "is_international": float(
                txn.get("card_country") != txn.get("merchant_country")
            ),
            "merchant_category": txn.get("merchant_category_code", 0),
            "currency_code": hash(txn.get("currency", "USD")) % 100,
        }

    def extract_velocity_features(
        self,
        user_id: str,
        card_id: str,
        amount: float,
        redis_client,
    ) -> dict:
        """
        Behavioral velocity features from Redis.
        Pre-computed by the streaming pipeline; fetched in under 2ms.
        """
        pipe = redis_client.pipeline()
        # User-level velocity
        pipe.get(f"user:{user_id}:txn_count_1h")
        pipe.get(f"user:{user_id}:txn_count_24h")
        pipe.get(f"user:{user_id}:txn_amount_24h")
        pipe.get(f"user:{user_id}:distinct_merchants_24h")
        # Card-level velocity
        pipe.get(f"card:{card_id}:txn_count_1h")
        pipe.get(f"card:{card_id}:txn_count_24h")
        pipe.get(f"card:{card_id}:amount_1h")
        # Merchant-level (for card testing attacks: many small transactions)
        pipe.get(f"card:{card_id}:declined_count_24h")

        results = pipe.execute()

        def safe_float(val, default=0.0):
            return float(val) if val else default

        txn_count_1h = safe_float(results[0])
        txn_count_24h = safe_float(results[1])
        txn_amount_24h = safe_float(results[2])
        distinct_merchants = safe_float(results[3])
        card_txn_1h = safe_float(results[4])
        card_txn_24h = safe_float(results[5])
        card_amount_1h = safe_float(results[6])
        card_declined_24h = safe_float(results[7])

        return {
            "user_txn_count_1h": txn_count_1h,
            "user_txn_count_24h": txn_count_24h,
            "user_txn_amount_24h": txn_amount_24h,
            "user_distinct_merchants_24h": distinct_merchants,
            "user_amount_velocity": txn_amount_24h / max(txn_count_24h, 1),
            "card_txn_count_1h": card_txn_1h,
            "card_txn_count_24h": card_txn_24h,
            "card_amount_1h": card_amount_1h,
            "card_declined_24h": card_declined_24h,
            # Velocity ratios (current transaction vs history)
            "amount_to_24h_avg_ratio": (
                amount / max(txn_amount_24h / max(txn_count_24h, 1), 1)
            ),
            "is_velocity_burst": float(card_txn_1h > 10),
        }

    def extract_device_features(self, txn: dict) -> dict:
        """
        Device fingerprinting features.
        Used to detect account takeover and device spoofing.
        """
        device_id = txn.get("device_id", "")
        ip = txn.get("ip_address", "")
        user_agent = txn.get("user_agent", "")

        return {
            "is_new_device": float(txn.get("is_new_device", True)),
            "is_vpn_ip": float(txn.get("is_vpn", False)),
            "is_tor_ip": float(txn.get("is_tor", False)),
            "is_datacenter_ip": float(txn.get("is_datacenter", False)),
            "device_risk_score": float(txn.get("device_risk_score", 0.5)),
            "ip_country_mismatch": float(
                txn.get("ip_country") != txn.get("card_country")
            ),
            "shipping_billing_mismatch": float(
                txn.get("shipping_address") != txn.get("billing_address")
            ),
        }

Handling Extreme Class Imbalance

Fraud rates are typically 0.01-0.1%. A naive model that always predicts "not fraud" achieves 99.9% accuracy while catching zero fraud. The problem: gradient boosted trees trained on imbalanced data will minimize error by predicting the majority class.

import lightgbm as lgb
import numpy as np
from sklearn.metrics import (
    precision_recall_curve, average_precision_score,
    classification_report
)


class ImbalancedFraudModel:
    """
    Fraud detection model with imbalanced dataset handling.
    Uses scale_pos_weight and PR-AUC for evaluation (not accuracy).
    """

    def __init__(self, fraud_rate: float = 0.001):
        # scale_pos_weight: ratio of negative to positive samples
        # For 0.1% fraud: scale_pos_weight = 999
        self.scale_pos_weight = (1 - fraud_rate) / fraud_rate

        self.model = lgb.LGBMClassifier(
            objective="binary",
            metric="average_precision",  # PR-AUC, not ROC-AUC
            n_estimators=1000,
            learning_rate=0.05,
            num_leaves=63,
            max_depth=7,
            scale_pos_weight=self.scale_pos_weight,
            subsample=0.8,
            colsample_bytree=0.7,
            min_child_samples=50,
            reg_alpha=0.1,
            reg_lambda=0.1,
        )

    def train(
        self,
        X_train: np.ndarray,
        y_train: np.ndarray,
        X_val: np.ndarray,
        y_val: np.ndarray,
    ) -> None:
        self.model.fit(
            X_train, y_train,
            eval_set=[(X_val, y_val)],
            callbacks=[lgb.early_stopping(50), lgb.log_evaluation(100)],
        )

    def evaluate(self, X: np.ndarray, y: np.ndarray) -> dict:
        """
        Evaluate using metrics appropriate for fraud detection.
        NOT accuracy (meaningless for imbalanced). NOT ROC-AUC (misleading).
        Use: PR-AUC, precision/recall at operating threshold.
        """
        proba = self.model.predict_proba(X)[:, 1]
        pr_auc = average_precision_score(y, proba)

        # Find threshold where precision is at least 99.9%
        # (0.1% false positive rate requirement)
        precisions, recalls, thresholds = precision_recall_curve(y, proba)
        target_precision = 0.999
        valid = precisions >= target_precision
        if valid.any():
            threshold = thresholds[valid][-1]  # lowest threshold that meets precision
            preds_at_threshold = (proba >= threshold).astype(int)
            report = classification_report(y, preds_at_threshold)
        else:
            threshold = 0.5
            report = "Cannot meet 99.9% precision requirement"

        return {
            "pr_auc": float(pr_auc),
            "operating_threshold": float(threshold),
            "classification_report": report,
        }

    def predict_score(self, X: np.ndarray) -> np.ndarray:
        """Return fraud probability scores for scoring service."""
        return self.model.predict_proba(X)[:, 1]

Why PR-AUC, Not ROC-AUC

For fraud detection with 0.1% fraud rate, ROC-AUC is misleading. A model with ROC-AUC of 0.98 can still have terrible precision (flagging 50% of legitimate transactions as fraud). PR-AUC (Precision-Recall AUC) measures the trade-off between catching fraud (recall) and false positive rate (precision) - the metrics that actually matter for fraud operations.

At the operating threshold, a good fraud detection system achieves: recall 80-90% (catching 80-90% of fraud by transaction count), precision 99%+ (less than 1% of flagged transactions are legitimate).

The Feedback Loop Problem: Delayed Labels

Fraud labels arrive days to weeks after the transaction. A credit card chargeback (the definitive fraud signal) takes 30-90 days to process. This creates a fundamental training data problem: if you use transactions from the last 30 days as training data, most of those transactions have no label yet - you do not know if they are fraud.

from datetime import datetime, timezone, timedelta
import pandas as pd


class FraudLabelJoiner:
    """
    Join transaction events with delayed fraud labels.
    Handles the temporal nature of fraud signals correctly.
    """

    LABEL_OBSERVATION_WINDOW_DAYS = 30

    def build_training_dataset(
        self,
        transactions_df: pd.DataFrame,   # all transactions with timestamps
        chargebacks_df: pd.DataFrame,     # chargeback events (delayed signal)
        as_of_date: datetime,            # training cutoff date
    ) -> pd.DataFrame:
        """
        Build a training dataset with correct point-in-time labels.

        Key: only include transactions that occurred at least
        OBSERVATION_WINDOW days before as_of_date.
        This ensures enough time has passed for fraud labels to arrive.
        """
        label_cutoff = as_of_date - timedelta(days=self.LABEL_OBSERVATION_WINDOW_DAYS)

        # Only include transactions old enough to have received labels
        eligible_txns = transactions_df[
            transactions_df["transaction_date"] <= label_cutoff
        ].copy()

        # Join with chargebacks
        chargebacks_joined = chargebacks_df[
            chargebacks_df["chargeback_date"] <= as_of_date
        ][["transaction_id", "chargeback_date"]]

        eligible_txns = eligible_txns.merge(
            chargebacks_joined,
            on="transaction_id",
            how="left",
        )

        # Label: 1 if chargeback received within observation window, else 0
        eligible_txns["label"] = (
            eligible_txns["chargeback_date"].notna() &
            (eligible_txns["chargeback_date"] <= as_of_date)
        ).astype(int)

        eligible_txns.drop(columns=["chargeback_date"], inplace=True)

        fraud_rate = eligible_txns["label"].mean()
        print(
            f"[Training Data] {len(eligible_txns):,} transactions. "
            f"Fraud rate: {fraud_rate:.4%}"
        )
        return eligible_txns

Graph-Based Fraud Ring Detection

Individual transaction scoring misses coordinated fraud rings - networks of accounts that share device IDs, IP addresses, phone numbers, or merchant relationships in ways that indicate coordination.

import networkx as nx
from typing import Optional


class FraudGraphAnalyzer:
    """
    Build a bipartite graph of accounts and shared attributes.
    Detect fraud rings by finding connected components with
    suspicious connectivity patterns.

    This runs asynchronously - not on the critical serving path.
    Results feed into a review queue.
    """

    def __init__(self):
        self.graph = nx.Graph()

    def add_transaction(self, txn: dict) -> None:
        """
        Add transaction to the fraud graph.
        Connect account to device, IP, and merchant via edges.
        """
        account_node = f"account:{txn['account_id']}"
        device_node = f"device:{txn['device_id']}"
        ip_node = f"ip:{txn['ip_address']}"
        merchant_node = f"merchant:{txn['merchant_id']}"

        self.graph.add_node(account_node, type="account")
        self.graph.add_node(device_node, type="device")
        self.graph.add_node(ip_node, type="ip")
        self.graph.add_node(merchant_node, type="merchant")

        self.graph.add_edge(account_node, device_node, weight=1)
        self.graph.add_edge(account_node, ip_node, weight=1)
        self.graph.add_edge(account_node, merchant_node,
                           amount=txn.get("amount_usd", 0))

    def detect_fraud_rings(
        self,
        min_accounts_in_ring: int = 3,
        max_accounts_in_ring: int = 200,
    ) -> list:
        """
        Find connected components that look like fraud rings.
        A fraud ring: many accounts sharing the same device or IP.
        """
        suspicious_rings = []

        for component in nx.connected_components(self.graph):
            subgraph = self.graph.subgraph(component)

            account_nodes = [
                n for n in component if n.startswith("account:")
            ]
            device_nodes = [
                n for n in component if n.startswith("device:")
            ]

            if len(account_nodes) < min_accounts_in_ring:
                continue
            if len(account_nodes) > max_accounts_in_ring:
                continue

            # Suspicious: many accounts sharing few devices (high account/device ratio)
            account_to_device_ratio = len(account_nodes) / max(len(device_nodes), 1)

            if account_to_device_ratio > 5:
                suspicious_rings.append({
                    "account_ids": [n.replace("account:", "") for n in account_nodes],
                    "device_ids": [n.replace("device:", "") for n in device_nodes],
                    "account_count": len(account_nodes),
                    "device_count": len(device_nodes),
                    "account_to_device_ratio": account_to_device_ratio,
                    "risk_score": min(account_to_device_ratio / 20, 1.0),
                })

        return sorted(suspicious_rings, key=lambda r: -r["risk_score"])

    def get_account_ring_risk(self, account_id: str) -> float:
        """
        Get the fraud ring risk score for a specific account.
        Used to enrich real-time scoring with graph signals.
        """
        account_node = f"account:{account_id}"
        if account_node not in self.graph:
            return 0.0

        component = nx.node_connected_component(self.graph, account_node)
        account_count = sum(1 for n in component if n.startswith("account:"))
        device_count = max(
            sum(1 for n in component if n.startswith("device:")), 1
        )
        ratio = account_count / device_count
        return min(ratio / 20, 1.0)

Real-Time Scoring Service

import time
from dataclasses import dataclass
from typing import Optional
import redis
import numpy as np


@dataclass
class FraudDecision:
    transaction_id: str
    fraud_score: float
    decision: str           # "allow", "review", "block"
    latency_ms: float
    rule_triggered: Optional[str]
    model_score: float
    graph_risk: float


class RealTimeFraudScorer:
    """
    Real-time fraud scoring service.
    Target: under 100ms for the full scoring decision.
    """

    BLOCK_THRESHOLD = 0.8
    REVIEW_THRESHOLD = 0.4

    def __init__(
        self,
        model,
        feature_extractor: FraudFeatureExtractor,
        redis_client: redis.Redis,
        graph_analyzer: FraudGraphAnalyzer,
    ):
        self.model = model
        self.extractor = feature_extractor
        self.redis = redis_client
        self.graph = graph_analyzer

    def score(self, txn: dict) -> FraudDecision:
        start = time.monotonic()

        # 1. Rule engine - fast, known patterns (1ms)
        rule_result = self._check_rules(txn)
        if rule_result:
            return FraudDecision(
                transaction_id=txn["transaction_id"],
                fraud_score=1.0,
                decision="block",
                latency_ms=(time.monotonic() - start) * 1000,
                rule_triggered=rule_result,
                model_score=1.0,
                graph_risk=1.0,
            )

        # 2. Feature extraction (5ms)
        txn_features = self.extractor.extract_transaction_features(txn)
        velocity_features = self.extractor.extract_velocity_features(
            txn["account_id"], txn["card_id"], txn["amount_usd"], self.redis
        )
        device_features = self.extractor.extract_device_features(txn)

        all_features = {**txn_features, **velocity_features, **device_features}
        feature_vector = np.array(
            list(all_features.values()), dtype=np.float32
        ).reshape(1, -1)

        # 3. ML model scoring (10ms)
        model_score = float(self.model.predict_score(feature_vector)[0])

        # 4. Graph risk signal (non-blocking Redis lookup, precomputed)
        # Real graph analysis runs async; we read precomputed risk score
        graph_risk_key = f"account_graph_risk:{txn['account_id']}"
        graph_risk_raw = self.redis.get(graph_risk_key)
        graph_risk = float(graph_risk_raw) if graph_risk_raw else 0.0

        # 5. Ensemble: weighted combination
        final_score = 0.7 * model_score + 0.3 * graph_risk

        decision = (
            "block" if final_score >= self.BLOCK_THRESHOLD
            else "review" if final_score >= self.REVIEW_THRESHOLD
            else "allow"
        )

        latency_ms = (time.monotonic() - start) * 1000
        return FraudDecision(
            transaction_id=txn["transaction_id"],
            fraud_score=final_score,
            decision=decision,
            latency_ms=latency_ms,
            rule_triggered=None,
            model_score=model_score,
            graph_risk=graph_risk,
        )

    def _check_rules(self, txn: dict) -> Optional[str]:
        """
        Hard rules for known fraud patterns.
        Returns rule name if triggered, None otherwise.
        """
        amount = txn.get("amount_usd", 0)

        # Block transactions from known fraudulent BINs
        card_bin = txn.get("card_number", "")[:6]
        if card_bin in self._blocked_bins():
            return "blocked_bin"

        # Block international transactions over $5000 on new accounts
        is_international = txn.get("card_country") != txn.get("merchant_country")
        account_age_days = txn.get("account_age_days", 365)
        if is_international and amount > 5000 and account_age_days < 30:
            return "new_account_high_value_international"

        return None

    def _blocked_bins(self) -> set:
        """Load blocked BINs from Redis or config."""
        blocked = self.redis.smembers("blocked_bins")
        return {b.decode() for b in blocked}

Concept Drift and Model Retraining

Fraud patterns evolve continuously. The model trained 6 months ago will have degraded recall because fraudsters have adapted. Monitor for concept drift by tracking the gap between predicted fraud rate and actual fraud rate (measured with a 30-day lag as chargebacks arrive).

from prometheus_client import Gauge

FRAUD_PREDICTED_RATE = Gauge(
    "fraud_predicted_rate",
    "Rolling average predicted fraud probability for all transactions",
)

FRAUD_ACTUAL_RATE = Gauge(
    "fraud_actual_rate",
    "Actual fraud rate from confirmed chargebacks (30-day lag)",
)

RECALL_AT_THRESHOLD = Gauge(
    "fraud_recall_at_threshold",
    "Fraction of confirmed fraud that was scored above the block threshold",
)

PRECISION_AT_THRESHOLD = Gauge(
    "fraud_precision_at_threshold",
    "Fraction of blocked transactions that were actually fraudulent",
)

Retrain the model at minimum weekly, or immediately when:

Recall drops below 80% on the last 7 days of confirmed fraud labels
A new fraud pattern is identified by the fraud operations team
The predicted vs actual fraud rate gap exceeds 0.5 percentage points

:::danger False Positive Rate Must Come First

In fraud detection, the temptation is to maximize recall (catch all fraud). But a system that blocks 10% of legitimate transactions will cost the business far more in merchant revenue than the fraud it prevents. A false positive on a legitimate user creates a terrible experience and reduces trust in the payment system.

Always set the operating threshold by precision first, recall second. Your business requirement is: precision over 99.9% (fewer than 1 in 1,000 blocked transactions is legitimate). Find the threshold that meets this precision target, then measure what recall you achieve. If recall is unacceptably low, improve the model - do not lower the precision threshold. :::

:::warning Training Data Leakage from Future Fraud Signals

The most common training data mistake in fraud: using features that include information from after the transaction time. For example, "account was later confirmed as compromised" is a perfect predictor of fraud, but you cannot know this at transaction time. Similarly, any feature derived from events that occur after the transaction timestamp must be excluded from training.

This includes: chargebacks on other transactions by the same account (which may have arrived after the training transaction), merchant risk scores that incorporate future fraud data, and IP address risk scores that were updated after the transaction occurred. Always compute features using only data available before or at the transaction timestamp. :::

Interview Q&A

Q1: How do you handle the extreme class imbalance in fraud detection (0.01% fraud rate)?

First, choose the right loss function and evaluation metric. Use scale_pos_weight in LightGBM (ratio of negatives to positives = 9999 for 0.01% fraud rate). Evaluate with PR-AUC and precision/recall at the operating threshold, never accuracy and never ROC-AUC alone (both are misleading for extreme imbalance).

For training data, use stratified sampling: randomly undersample the majority class (legitimate transactions) to achieve a more balanced training set (e.g., 10:1 negative to positive ratio). The model trains faster and learns better signal from the minority class. At inference time, recalibrate the predicted scores back to the true base rate using Platt scaling, since the model was trained on an artificially balanced distribution.

For the serving threshold: never use 0.5. At extreme imbalance, the score distribution is heavily skewed. Use precision/recall analysis on a holdout set to find the threshold that achieves your target precision (99.9%) and measure the recall at that threshold. Recompute this threshold monthly as the model and fraud patterns evolve.

Q2: How do you handle delayed labels in fraud detection?

Fraud labels (chargebacks, disputes) arrive 30-90 days after the transaction. The label observation window must be respected in training data preparation: only include transactions that occurred at least N days before the training run, where N is your label observation window. If you include recent transactions without labels, they will appear as negatives (not fraud) even though many are fraud - this poisons your training data.

The practical workflow: when preparing training data as of date D, include all transactions from before (D - 30 days) that have received labels (chargebacks or confirmation of no chargeback). This means your model is always trained on data that is at least 30 days old. To handle novel fraud patterns that appear in the last 30 days (before labels arrive), augment training with semi-supervised techniques: use the model's own high-confidence fraud predictions as pseudo-labels for the unlabeled recent transactions.

Q3: How does graph-based fraud detection complement individual transaction scoring?

Individual transaction scoring operates on a single transaction's features. It misses patterns that only become visible across multiple accounts and transactions. Fraud rings - groups of coordinated fraudsters using many synthetic accounts - are invisible to per-transaction models but visible in the account graph.

Graph-based detection builds a bipartite graph: accounts on one side, shared attributes (device IDs, IP addresses, phone numbers, email domains) on the other. Edges connect accounts to their attributes. A fraud ring appears as a dense subgraph: many accounts sharing few devices and IP addresses. The account-to-device ratio is the key signal - legitimate users rarely share devices; fraud rings share them extensively.

Graph Neural Networks (GNNs) extend this to learn richer representations. The GNN embeds each account node using its own features plus the features of its neighbors. A legitimate account surrounded by legitimate accounts has a different embedding than a suspicious account surrounded by other suspicious accounts. The GNN output is a risk score that captures the "guilt by association" signal that individual transaction models miss.

Q4: What is concept drift in fraud detection and how do you detect it?

Concept drift is when the statistical relationship between features and fraud changes over time. Fraudsters adapt: if the model learns that VPN usage predicts fraud, fraudsters stop using VPNs and start using residential proxies. The model's features stop being predictive, and recall degrades.

Drift detection: monitor the ratio of predicted fraud rate to actual fraud rate (with a 30-day lag for label arrival). A healthy model has a predicted/actual ratio close to 1.0. As drift occurs, recall drops - the model misses fraud that it previously caught - and the actual rate rises while predicted rate stays low.

Also monitor feature distribution drift: for each feature, run a KS test between the current distribution and the training-time distribution. Features that have drifted significantly (KS statistic above 0.2) may have lost predictive power.

Mitigation: retrain frequently (weekly or on-demand when drift is detected), use online learning for the fastest adaptation (gradient updates after each batch of confirmed fraud labels), and maintain a rule engine that can catch known novel attack patterns immediately while the ML model is retrained.

Q5: How would you design the serving infrastructure to meet the 100ms latency requirement at 10,000 TPS?

At 10,000 TPS, the fraud scoring service must process 10,000 transactions per second with p99 under 100ms. Breakdown of the latency budget: feature retrieval from Redis (5ms), rule engine (1ms), ML model scoring (10ms), ensemble and decision logic (2ms), network overhead (5ms) - total approximately 23ms. The 100ms target is comfortably met with this breakdown.

Infrastructure sizing: a single model server (8 vCPUs, XGBoost with 500 trees) can score roughly 5,000 transactions per second. Deploy 4 model servers behind a load balancer for 10x headroom at 10,000 TPS. Redis with 32 GB RAM and read replicas handles feature retrieval for this volume easily. The rule engine runs in-memory in each scoring pod - no external dependency.

For deployment: containerize the scoring service with a Docker image, deploy to Kubernetes with HPA (Horizontal Pod Autoscaler) keyed on CPU utilization. Set a minimum of 4 pods and maximum of 20. The scoring service should be stateless - all state lives in Redis. Monitor p99 latency at each stage using Prometheus and alert if any stage exceeds 80% of its latency budget.

Summary

A production fraud detection system combines a rule engine (fast, known patterns, zero-latency), an ML model (XGBoost/LightGBM with 200+ behavioral and device features, handling imbalance with scale_pos_weight), and graph analysis (detecting fraud rings via connected component analysis, running asynchronously). Feature engineering is the highest-leverage investment: velocity features from Redis, device fingerprinting, and behavioral history explain most of the model's predictive power. Delayed labels require disciplined training data preparation - only transactions old enough to have received labels. The operating threshold is set by precision first (less than 0.1% false positive rate), then recall. Concept drift requires weekly retraining and continuous monitoring of the predicted versus actual fraud rate gap.

The Adversarial Landscape​

Requirements​

System Architecture​

Feature Engineering​

Transaction Features​

Handling Extreme Class Imbalance​

Why PR-AUC, Not ROC-AUC​

The Feedback Loop Problem: Delayed Labels​

Graph-Based Fraud Ring Detection​

Real-Time Scoring Service​

Concept Drift and Model Retraining​

Interview Q&A​

Summary​