:::tip 🎮 Interactive Playground Visualize this concept: Try the Fraud Detection Design demo on the EngineersOfAI Playground - no code required. :::
Fraud Detection Systems
99.9% Precision, 50ms Latency, $4 Trillion in Transactions
In 2023, Stripe processed approximately 817 million in fraudulent transactions annually while ensuring that legitimate merchants and their customers experience essentially no friction. The cost of a false negative (missing a fraudulent transaction) is the chargeback plus processing cost. The cost of a false positive (blocking a legitimate transaction) is the lost sale, the customer support cost, and potential merchant churn.
A typical fraud model at this scale must operate at 99.9% precision for high-value transactions - if you block 100 transactions, at most 0.1 should be legitimate. It must make this decision in under 50 milliseconds. It must handle adversarial actors who study the system and adapt their fraud patterns to evade detection. And it must do all of this continuously as fraud patterns evolve weekly.
The payment fraud problem is one of the hardest in applied ML: extreme class imbalance (fraud rate under 0.1%), adversarial inputs (fraudsters adapt to the model), delayed labels (chargebacks arrive weeks after the transaction), and precision requirements that most classification literature ignores.
This case study walks through the full system design, from the rule-based baseline that every production fraud system starts with, through the graph-based fraud detection that catches ring fraud, to the adversarial robustness measures that make the system resilient to model probing.
Requirements Analysis
Functional requirements:
- Block fraudulent card-not-present transactions in real-time
- Support three decision outcomes: allow, block, step-up (challenge with 3DS)
- Provide fraud score and feature explanations for dispute resolution
- Support rules management by non-ML analysts
- Alert on concept drift and fraud pattern changes
Non-functional requirements:
- Latency: 50ms p99 for the fraud score decision
- Precision: 99.9% for block decisions (false positive rate under 0.1%)
- Recall: maximize, subject to precision constraint
- Throughput: 100K transactions per second (peak)
- Label delay: chargebacks arrive 30-120 days after transaction
Constraints:
- Cannot retrain models faster than weekly without risking overfitting to noise
- Card network rules limit which signals can be used
- Adversarial: fraudsters actively probe the system to understand decision boundaries
System Architecture
Component 1: Rules Engine (The Non-Negotiable Starting Point)
Every production fraud system begins with rules, and good systems keep rules even after deploying ML. Rules serve purposes that ML cannot:
Hard blocks: Some signals are so strong that no ML uncertainty is appropriate. Sanctioned entity lists (OFAC), known malicious IP addresses, card numbers on block lists, velocity limits exceeding any legitimate business pattern - these block immediately.
Hard allows: Some signals are so strong for legitimacy that the risk of false positives from ML is unacceptable. A returning customer with 5 years of clean transaction history, using their registered device, shipping to their registered address - the cost of challenging this customer is high and the fraud probability is near zero.
Regulatory requirements: Some jurisdictions require specific screening actions regardless of ML score. Rules make these explicit and auditable.
from dataclasses import dataclass
from typing import Optional, Literal
from enum import Enum
class Decision(Enum):
ALLOW = "allow"
BLOCK = "block"
CHALLENGE = "challenge"
CONTINUE = "continue" # Pass to ML
@dataclass
class RuleResult:
decision: Decision
rule_name: str
reason: str
confidence: float = 1.0
class RulesEngine:
def __init__(self, sanction_list, block_list, velocity_store):
self.sanction_list = sanction_list
self.block_list = block_list
self.velocity_store = velocity_store
def evaluate(self, transaction: dict) -> RuleResult:
"""
Evaluate hard rules. Returns BLOCK/ALLOW/CHALLENGE or CONTINUE to ML.
Rules are evaluated in priority order - first match wins.
"""
# Hard blocks - regulatory
if self.sanction_list.check(transaction["card_holder_name"]):
return RuleResult(Decision.BLOCK, "OFAC_SANCTION", "Sanctioned entity")
if transaction["card_number"] in self.block_list:
return RuleResult(Decision.BLOCK, "BLOCK_LIST", "Card on block list")
# Velocity limits
txn_count_1h = self.velocity_store.get_count(
transaction["card_number"], window_minutes=60
)
if txn_count_1h > 10:
return RuleResult(Decision.BLOCK, "VELOCITY_CARD", f"{txn_count_1h} txns in 1h")
amount_sum_24h = self.velocity_store.get_sum(
transaction["card_number"], field="amount", window_hours=24
)
if amount_sum_24h > 50_000:
return RuleResult(Decision.BLOCK, "VELOCITY_AMOUNT", "Exceeded 24h limit")
# Hard allows - trusted customers
if self._is_trusted_transaction(transaction):
return RuleResult(Decision.ALLOW, "TRUSTED_CUSTOMER", "High-confidence legitimate")
# Pass to ML scoring
return RuleResult(Decision.CONTINUE, "RULES_PASS", "Proceed to ML")
def _is_trusted_transaction(self, txn: dict) -> bool:
"""Combination of signals that strongly indicate legitimacy."""
return (
txn.get("returning_customer_years", 0) > 2
and txn.get("device_is_registered", False)
and txn.get("shipping_to_saved_address", False)
and txn.get("amount", 0) < txn.get("historical_avg_amount", 0) * 3
)
Component 2: Real-Time Feature Engineering
Feature engineering for fraud detection operates at three time windows simultaneously:
Velocity features (real-time, per second):
- Transactions in last 1h / 24h / 7d by card, merchant, IP, device
- Dollar amount in last 1h / 24h by card
- Declined transactions in last 1h (a strong signal - fraudsters probe)
Session features (current session):
- Time since last transaction on this device
- Number of tabs open / navigation pattern
- Typing speed and mouse movement entropy (behavioral biometrics)
- Time between page load and checkout (too fast = bot)
Historical aggregates (precomputed, periodic update):
- Card's historical merchant category distribution
- User's typical transaction time of day
- Historical chargeback rate for this merchant
- Card's typical transaction amount distribution
import redis
import time
from typing import Dict, Any
class VelocityFeatureStore:
"""
Redis-backed velocity counter for real-time fraud features.
Uses sorted sets with timestamps as scores for efficient window queries.
"""
def __init__(self, redis_client: redis.Redis):
self.r = redis_client
def record_transaction(
self,
card_number: str,
merchant_id: str,
amount: float,
ip: str,
device_id: str,
timestamp: float = None,
):
"""Record a new transaction event for velocity tracking."""
ts = timestamp or time.time()
pipe = self.r.pipeline()
# Card velocity: sorted set with timestamp as score
pipe.zadd(f"vel:card:{card_number}", {f"{ts}:{amount}": ts})
pipe.expire(f"vel:card:{card_number}", 7 * 86400) # 7-day TTL
# IP velocity
pipe.zadd(f"vel:ip:{ip}", {str(ts): ts})
pipe.expire(f"vel:ip:{ip}", 3600) # 1-hour TTL
# Device velocity
pipe.zadd(f"vel:dev:{device_id}", {str(ts): ts})
pipe.expire(f"vel:dev:{device_id}", 3600)
pipe.execute()
def get_velocity_features(
self,
card_number: str,
ip: str,
device_id: str,
) -> Dict[str, float]:
"""Compute velocity features for fraud scoring."""
now = time.time()
pipe = self.r.pipeline()
# Query card velocity at multiple windows
for window in [3600, 86400, 604800]: # 1h, 24h, 7d
pipe.zcount(f"vel:card:{card_number}", now - window, now)
# IP and device velocity (1h window only - privacy)
pipe.zcount(f"vel:ip:{ip}", now - 3600, now)
pipe.zcount(f"vel:dev:{device_id}", now - 3600, now)
results = pipe.execute()
return {
"card_txn_count_1h": results[0],
"card_txn_count_24h": results[1],
"card_txn_count_7d": results[2],
"ip_txn_count_1h": results[3],
"device_txn_count_1h": results[4],
}
Component 3: Graph-Based Fraud Detection
Individual transaction features miss ring fraud - coordinated networks of fraudsters using multiple cards, devices, merchants, and phone numbers in patterns that look legitimate individually but are suspicious in aggregate.
Graph-based fraud detection builds a heterogeneous graph where:
- Nodes: cards, devices, IP addresses, merchants, phone numbers, email addresses
- Edges: "card X was used on device Y," "device Y connected from IP Z," "card X used email E"
Fraud rings appear as dense subgraphs: multiple compromised cards all connected to the same device, or all using the same email domain.
Features derived from the graph (precomputed offline, updated hourly):
- Degree centrality of the card's node in the transaction graph
- Number of distinct cards that share any device with this card in the past 30 days
- Fraud rate of the card's connected components
- Shortest path to any known fraudulent card
import networkx as nx
from collections import defaultdict
class FraudGraph:
"""
Heterogeneous graph for fraud ring detection.
Nodes: cards, devices, IPs, emails, phone numbers.
Edges: connections between entities observed in transactions.
"""
def __init__(self):
self.G = nx.Graph()
self.fraud_labels = {} # {node_id: True/False/None}
def add_transaction(self, transaction: dict):
"""Add transaction to fraud graph, creating edges between entities."""
card = f"card:{transaction['card_number']}"
device = f"device:{transaction['device_id']}"
ip = f"ip:{transaction['ip']}"
email = f"email:{transaction['email']}"
self.G.add_edge(card, device, transaction_id=transaction["id"])
self.G.add_edge(device, ip)
self.G.add_edge(card, email)
def mark_fraud(self, card_number: str):
"""Mark a card as fraudulent (from chargeback label)."""
node_id = f"card:{card_number}"
self.fraud_labels[node_id] = True
def get_graph_features(self, card_number: str) -> Dict[str, float]:
"""Compute graph-based features for a card at transaction time."""
node_id = f"card:{card_number}"
if node_id not in self.G:
return {
"graph_neighbor_fraud_rate": 0.0,
"shared_device_card_count": 0,
"component_size": 1,
"degree": 0,
}
# Neighbors of this card (devices, IPs, emails)
neighbors = list(self.G.neighbors(node_id))
card_neighbors = [n for n in neighbors if n.startswith("device:")]
# Cards sharing any device with this card
shared_cards = set()
for device_node in card_neighbors:
for neighbor in self.G.neighbors(device_node):
if neighbor.startswith("card:") and neighbor != node_id:
shared_cards.add(neighbor)
# Fraud rate among connected cards
labeled_neighbors = {c: self.fraud_labels.get(c) for c in shared_cards if c in self.fraud_labels}
fraud_count = sum(1 for v in labeled_neighbors.values() if v is True)
fraud_rate = fraud_count / len(labeled_neighbors) if labeled_neighbors else 0.0
# Connected component size
component = nx.node_connected_component(self.G, node_id)
return {
"graph_neighbor_fraud_rate": fraud_rate,
"shared_device_card_count": len(shared_cards),
"component_size": len(component),
"degree": self.G.degree(node_id),
}
Component 4: ML Scoring Ensemble
The ML layer combines multiple model types into an ensemble:
Gradient Boosted Trees (XGBoost/LightGBM): Excellent on tabular features. Fast inference (1-2ms). Handles missing values natively. The workhorse of fraud detection - most production systems rely heavily on GBTs.
Neural Network: Captures non-linear interactions between features that GBTs miss. Takes embedding representations of categorical features (merchant category, country, card BIN). 3-5ms inference.
Anomaly score: An Isolation Forest trained on legitimate transactions. Measures how anomalous the current transaction is relative to the card's historical pattern.
import xgboost as xgb
import numpy as np
from sklearn.ensemble import IsolationForest
class FraudEnsemble:
"""
Ensemble of fraud detection models.
XGBoost handles tabular features.
Neural network handles embedding interactions.
Isolation Forest provides anomaly score.
"""
def __init__(self, xgb_model, neural_model, isolation_forest):
self.xgb = xgb_model
self.nn = neural_model
self.iforest = isolation_forest
# Ensemble weights - tuned on validation set
self.weights = {"xgb": 0.50, "nn": 0.35, "iforest": 0.15}
def score(self, features: dict) -> dict:
"""
Returns fraud probability and component scores.
"""
tabular_features = self._prepare_tabular(features)
embedding_features = self._prepare_embeddings(features)
# Model scores
xgb_score = float(self.xgb.predict_proba(tabular_features)[0, 1])
nn_score = float(self.nn(embedding_features).sigmoid().item())
# Isolation forest: -1 = anomaly, 1 = normal. Convert to [0, 1] probability.
if_score = float(-self.iforest.score_samples(tabular_features)[0])
if_prob = 1 / (1 + np.exp(-if_score * 3)) # scale and sigmoid
# Weighted ensemble
ensemble_score = (
self.weights["xgb"] * xgb_score
+ self.weights["nn"] * nn_score
+ self.weights["iforest"] * if_prob
)
return {
"fraud_probability": ensemble_score,
"xgb_score": xgb_score,
"nn_score": nn_score,
"anomaly_score": if_prob,
}
def _prepare_tabular(self, features: dict) -> np.ndarray:
cols = [
"amount", "hour_of_day", "day_of_week",
"card_txn_count_1h", "card_txn_count_24h",
"ip_txn_count_1h", "device_txn_count_1h",
"graph_neighbor_fraud_rate", "shared_device_card_count",
"is_international", "amount_zscore",
]
return np.array([[features.get(c, 0.0) for c in cols]])
def _prepare_embeddings(self, features: dict):
import torch
# Categorical features encoded as IDs for embedding lookup
return torch.tensor([[
features.get("merchant_category_id", 0),
features.get("country_id", 0),
features.get("card_bin_id", 0),
]])
Decision Thresholds and False Positive Cost Analysis
The output of the ensemble is a fraud probability score . Converting this to block/challenge/allow decisions requires thresholds calibrated against business costs.
Define:
- : cost of false negative (missed fraud) = chargeback amount + processing fee + penalty
- : cost of false positive (blocked legitimate transaction) = lost sale value + customer support cost
The optimal threshold minimizes expected cost:
For a C_{\text{FN}} = $110100 + fee C_{\text{FP}} = $5$ (lost sale opportunity cost), then:
Block transactions with fraud probability above 4.3%. This seems low, but it reflects that the cost of missing fraud is much higher than the cost of blocking a legitimate transaction.
In practice, three thresholds are used:
- : high confidence fraud, block automatically
- : uncertain, challenge with 3DS authentication
- Below challenge: allow
The challenge zone is crucial - it moves uncertain cases to 3DS (which adds human authentication) rather than blocking them, preserving the good-faith transaction while adding friction that fraudsters cannot easily bypass.
Adversarial Robustness
Fraudsters are not passive. Once deployed, ML fraud models are actively probed. Common adversarial strategies:
Low-and-slow probing: Make many small, legitimate-looking transactions to understand the decision boundary. Then make the fraudulent transaction just inside the "allow" region.
Mimicry attack: Study legitimate transaction patterns from compromised cards and mimic them. Use the legitimate card's typical merchant categories, amounts, and timing.
Model inversion: Use multiple probe transactions to infer the feature importance weights of the model. Design transactions that specifically avoid the high-weight features.
Defenses:
Randomization: Add calibrated noise to model outputs in the challenge zone. Instead of a deterministic threshold, decisions near the boundary are probabilistic. This prevents gradient estimation from probe sequences.
Canary features: Include features that are invisible to fraudsters (e.g., network-layer signals, device attestation certificates) and highly predictive. If a model probe is detected (high velocity of declined transactions), enter a higher-security mode.
Model stacking: Use multiple models with different feature sets. To evade the ensemble, a fraudster must fool all models simultaneously - much harder than fooling one.
Detecting probing behavior: A sequence of small transactions followed by a large transaction, especially across multiple merchants, is a classic probing pattern. Add a probe detection feature explicitly.
Production Engineering Notes
Handling Delayed Labels
Chargebacks arrive 30-120 days after the transaction. This means:
- You cannot evaluate model performance for 30-120 days after deployment
- Training data lags reality by 30-120 days - the model is always learning from the past
Mitigation:
- Proxy labels: Merchant dispute reports and network fraud alerts arrive faster (days, not weeks) and can be used as early training signal
- Online learning: Update model weights continuously on new confirmed fraud cases rather than waiting for weekly batch retraining
- Concept drift detection: Monitor feature distribution shift - if the distribution of merchant categories in fraud transactions changes significantly, trigger an alert even before chargeback labels arrive
Model Explainability for Disputes
When a legitimate customer's transaction is blocked, they call customer service. The agent needs to explain why. The ML model must provide a human-readable explanation of the top-3 factors:
import shap
class FraudExplainer:
def __init__(self, model, feature_names: list):
self.explainer = shap.TreeExplainer(model)
self.feature_names = feature_names
def explain(self, features: np.ndarray, top_k: int = 3) -> List[dict]:
"""Return top-k SHAP value explanations for a fraud decision."""
shap_values = self.explainer.shap_values(features)[1] # class 1 = fraud
top_indices = np.argsort(np.abs(shap_values[0]))[::-1][:top_k]
return [
{
"feature": self.feature_names[i],
"value": float(features[0, i]),
"contribution": float(shap_values[0, i]),
"direction": "increases risk" if shap_values[0, i] > 0 else "decreases risk",
}
for i in top_indices
]
Common Mistakes
Mistake: Optimizing for AUC instead of precision at your operating threshold.
High AUC means the model ranks fraud above legitimate transactions on average. But at the extreme precision requirement (99.9%), what matters is not the average - it is the score distribution at the 0.1% false positive rate. A model with AUC=0.97 may have far better precision at 0.1% FPR than a model with AUC=0.98, because the latter has a flatter ROC curve in the high-precision region. Always evaluate your fraud model at the operating threshold (precision-recall curve at target precision), not just AUC.
Mistake: Not monitoring for fraudster adaptation.
Fraudsters monitor their success rate. When the fraud model improves and starts blocking more of their transactions, they adapt. If you deploy a new model and see your false negative rate spike after 2-4 weeks, fraudsters have adapted. Build a fraud pattern drift monitor: if the feature distribution of transactions classified as fraud changes significantly week-over-week, alert the fraud ML team. Retraining frequency should increase when drift is detected.
Mistake: Training on imbalanced data without proper resampling.
At 0.1% fraud rate, a model that predicts "legitimate" for every transaction achieves 99.9% accuracy. Standard training on imbalanced data produces this degenerate solution. Use class-weighted loss (weight the fraud class by 1/fraud_rate), oversample the minority class with SMOTE, or undersample the majority class. Evaluate on precision-recall curves, not accuracy. Monitor the predicted fraud probability distribution - if it collapses to near-zero for everything, the model has learned to say "not fraud" always.
Interview Q&A
Q: Design a real-time fraud detection system for a payment processor with 100K TPS and 50ms latency.
A: I would design a four-layer system. Layer 1: Rules engine (5ms) handles hard blocks (OFAC sanctions, known bad cards) and hard allows (trusted customers). Most rules fire before ML. Layer 2: Real-time feature extraction (5ms) computes velocity features (card/IP/device transaction counts in sliding windows) from a Redis cluster. Layer 3: ML ensemble (20ms) runs XGBoost on tabular features and a neural network on embedding features, combines with an anomaly score. Graph features (precomputed hourly from a fraud graph) are joined in. Layer 4: Decision engine (5ms) applies thresholds: above 0.80 = block, 0.15-0.80 = 3DS challenge, below 0.15 = allow. The challenge zone is critical - it preserves revenue on uncertain cases while adding authentication friction. Total: 35ms, leaving 15ms for network overhead.
Q: How do you handle the extreme class imbalance in fraud detection (0.1% positive rate)?
A: Multiple strategies in combination. At the data level: undersample the majority class (legitimate transactions) to achieve a 10:1 or 20:1 ratio rather than 1000:1, which makes gradient descent tractable. Add synthetic minority oversampling (SMOTE) for the fraud class. At the training level: use class-weighted cross-entropy loss - give the fraud class weight 1/fraud_rate so each fraud example contributes as much gradient as 1000 legitimate examples. At evaluation level: never use accuracy (meaningless at 0.1% base rate). Use precision-recall curves and AUPRC. Set evaluation thresholds at the target operating point (99.9% precision). Monitor calibration - the model's output probability 0.05 should correspond to a 5% fraud rate in practice.
Q: How do you make a fraud detection model adversarially robust?
A: Adversarial robustness in fraud requires a different mindset than in computer vision. Fraudsters are not gradient-based attackers - they are rational economic agents who probe your API to understand your decision surface. Key defenses: First, add randomization to decisions near the boundary - replace deterministic thresholds with probabilistic outcomes (fraud probability 0.3 → block 30% of the time, allow 70%). This prevents fraudsters from learning the exact decision boundary from probe sequences. Second, use features that are difficult to observe or manipulate - device attestation, behavioral biometrics (typing patterns), network-layer signals. Third, detect probing behavior explicitly - a sequence of small transactions followed by a large one is a probe pattern. Add this as a feature. Fourth, do not expose model scores through any API - even indirect signals like challenge frequency reveal model behavior. Fifth, rotate model architecture periodically - once fraudsters adapt to model A, deploying model B with different features resets their knowledge.
Q: What is graph-based fraud detection and when does it help?
A: Graph-based fraud detection builds a heterogeneous graph connecting entities in transaction data - cards, devices, IP addresses, merchants, email addresses, phone numbers. Edges represent observed co-occurrences (card X was used on device Y, device Y connected from IP Z). Fraud rings - coordinated groups of fraudsters using multiple cards - appear as dense subgraphs: multiple compromised cards all connected to the same device or IP network. Graph features extracted from this structure (number of cards sharing a device with this card, fraud rate of connected components, proximity to known fraudulent nodes) are powerful signals that individual transaction features miss completely. Graph-based features help specifically when fraud is coordinated - new card testing rings, account takeover networks, triangulation fraud involving multiple merchants. They are less useful for isolated fraud (single compromised card used by an individual). At Stripe scale, building and maintaining this graph in real time at 100K TPS requires careful engineering: maintain the graph in a graph database (Neo4j) or compute graph features offline hourly and join them at inference time as precomputed features.
