Skip to main content

:::tip 🎮 Interactive Playground Visualize this concept: Try the Feedback Loops demo on the EngineersOfAI Playground - no code required. :::

Online Learning

The Fraud Model That Went Stale While You Slept

The fraud detection model had been in production for six weeks. Accuracy: 94.2%. False positive rate: 0.8%. The fraud team was satisfied.

Then a new attack pattern emerged. A fraud ring had discovered that transaction amounts just below 500themodelshighriskthresholdhadlowerfraudscores.ByTuesday,theywererunning30,000transactionsperdayat500 - the model's high-risk threshold - had lower fraud scores. By Tuesday, they were running 30,000 transactions per day at 497-$499. The fraud model had never seen this pattern in training. By the end of the week, the model's fraud detection rate on the new attack had dropped to 61%.

The root cause: the model was trained on historical data from six weeks ago. The fraud ring's behavior was new, post-training. The model could not adapt because it had no mechanism to learn from the transactions it was making decisions on.

The fix the team implemented: online learning. As transactions are processed and their fraud labels are eventually confirmed (either by customer dispute, manual review, or automated pattern detection), those labeled transactions are immediately used to update the model. Not in a weekly batch retrain - but within hours of the label being available.

Within 48 hours of enabling online learning, the fraud detection rate on the new attack pattern had climbed to 88%. By the end of the week, it reached 93.7% - nearly matching the model's overall accuracy. The fraud ring's effectiveness dropped precipitously.

This is what online learning enables: closing the gap between when the world changes and when your model adapts.


Why This Exists - The Staleness Problem

Every ML model is a frozen snapshot of the world at the time of training. The world changes. User behavior evolves. Attackers adapt. Product assortments change. Seasons shift. The trained model does not know about any of this.

The classical solution is periodic retraining: every week (or day, or hour), retrain on recent data. This works for slow-changing domains but has three problems:

Latency: there is an irreducible gap between when a pattern appears and when it is captured in a retraining batch, processed, trained upon, validated, and deployed. This gap is often days to weeks.

Cost: full retraining is expensive, especially for large models. Running a full training job daily for a 1B parameter model costs hundreds of GPU-hours per day.

Data imbalance: rare recent events may not appear enough in a training batch to learn from. The new fraud pattern may have 100 examples in the most recent 24 hours but be swamped by millions of normal transactions.

Online learning addresses all three: it updates the model continuously as new labeled examples arrive, with immediate effect, at a fraction of the compute cost of full retraining.


Historical Context

Online learning predates neural networks. The Perceptron algorithm (Rosenblatt, 1958) was the first online learning algorithm - it updates weights one example at a time from an infinite stream. Stochastic Gradient Descent (SGD) was formalized for online convex optimization by Shalev-Shwartz (2007) and Hazan (2006), providing convergence guarantees under mild conditions.

Vowpal Wabbit (VW, Langford et al., 2007 at Yahoo Research) became the production standard for online learning in recommendation and ad systems. VW's hashing trick (feature names hashed to indices, avoiding dictionary maintenance) and its parallelized SGD implementation enabled online learning at industrial scale - billions of examples per day.

The multi-armed bandit formulation (Thompson, 1933) was rediscovered for recommendation systems in the 2010s. Instead of learning from historical labels, bandit algorithms make decisions and learn from immediate feedback (clicks, purchases), enabling exploration-exploitation tradeoffs that are impossible in offline learning.

Neural network online learning is harder because of catastrophic forgetting (McCloskey and Cohen, 1989): when a neural network is trained on new data, it tends to overwrite weights that encoded old knowledge. The field of continual learning (Parisi et al., 2019) addresses this with techniques like EWC (Elastic Weight Consolidation), replay buffers, and progressive networks.


Online Learning vs Batch Retraining

The choice between online learning and periodic batch retraining depends on the domain's rate of change and the feedback loop latency:

AspectBatch RetrainingOnline Learning
Adaptation speedHours-weeksSeconds-hours
Compute costHigh (full training)Low (incremental)
StabilityHigh (validated batch)Lower (noisy updates)
Catastrophic forgettingNo issueMajor concern
Implementation complexityLowHigh
Best forStable distributionsFast-changing patterns

The Online Learning Math

Online learning minimizes the cumulative loss over a stream of examples:

minwt=1T(w,xt,yt)\min_w \sum_{t=1}^{T} \ell(w, x_t, y_t)

Where \ell is a convex loss function and (xt,yt)(x_t, y_t) is the tt-th example. The online gradient descent update:

wt+1=wtηt(wt,xt,yt)w_{t+1} = w_t - \eta_t \nabla \ell(w_t, x_t, y_t)

The learning rate schedule is critical. A fixed learning rate causes the model to "forget" old patterns as new ones are learned. A decaying learning rate (common in batch training) causes the model to stop adapting after enough examples. For non-stationary distributions, a constant but small learning rate is often best - it maintains adaptability:

ηt=η(constant)regret=O(T)\eta_t = \eta \quad \text{(constant)} \quad \Rightarrow \quad \text{regret} = O(\sqrt{T})

This means the online model's average loss exceeds the best fixed model by O(1/T)O(1/\sqrt{T}) - an acceptable cost for maintaining adaptability.

AdaGrad (Duchi et al., 2011) adapts the learning rate per-feature, giving larger updates to infrequent features:

ηt,i=ητ=1tgτ,i2\eta_{t,i} = \frac{\eta}{\sqrt{\sum_{\tau=1}^{t} g_{\tau,i}^2}}

This is particularly useful for sparse features (user IDs, item IDs) where most features are rarely seen.


Implementing Online Learning in Production

Mini-Batch Online Learning

Pure sample-by-sample online learning is noisy (high variance updates). Mini-batch online learning processes small batches (16-256 examples) that arrive continuously:

# mini_batch_online_learning.py - production online learning with Kafka
import asyncio
import torch
import torch.nn as nn
from aiokafka import AIOKafkaConsumer
from typing import List, Tuple
import json
import time

class OnlineLearner:
"""
Continuously updates a model from labeled examples arriving via Kafka.
Designed for fraud detection, recommendation, and click prediction.
"""

def __init__(
self,
model: nn.Module,
learning_rate: float = 1e-4,
mini_batch_size: int = 64,
max_batch_wait_ms: float = 500, # accumulate up to 500ms of examples
# Stability controls
gradient_clip_norm: float = 1.0,
momentum_alpha: float = 0.99, # for exponential moving average
):
self.model = model
self.optimizer = torch.optim.Adam(
model.parameters(),
lr=learning_rate,
betas=(0.9, 0.999),
)
self.mini_batch_size = mini_batch_size
self.max_batch_wait_ms = max_batch_wait_ms
self.gradient_clip = gradient_clip_norm

# Exponential moving average of model weights (for stability)
# EMA model is used for serving; raw model is used for learning
self.ema_model = self._copy_model(model)
self.ema_alpha = momentum_alpha

self.total_updates = 0
self.loss_criterion = nn.BCEWithLogitsLoss()

async def consume_and_learn(self, kafka_bootstrap_servers: str):
"""Main loop: consume labeled examples and update model continuously."""
consumer = AIOKafkaConsumer(
"labeled-transactions", # fraud labels arrive here
bootstrap_servers=kafka_bootstrap_servers,
group_id="online-learner",
auto_offset_reset="latest", # start from newest (not historical)
value_deserializer=lambda v: json.loads(v.decode()),
)

await consumer.start()
buffer: List[dict] = []
buffer_start_time = time.monotonic()

try:
async for message in consumer:
example = message.value
buffer.append(example)

# Flush when batch is full OR max wait time exceeded
batch_age_ms = (time.monotonic() - buffer_start_time) * 1000
if (len(buffer) >= self.mini_batch_size or
batch_age_ms >= self.max_batch_wait_ms):
await self._update_model(buffer)
buffer = []
buffer_start_time = time.monotonic()
finally:
await consumer.stop()

async def _update_model(self, examples: List[dict]):
"""Run one gradient step on a mini-batch of labeled examples."""
loop = asyncio.get_event_loop()
await loop.run_in_executor(None, self._gradient_step, examples)

def _gradient_step(self, examples: List[dict]):
"""CPU/GPU computation - run in thread pool to not block event loop."""
# Prepare batch
features = torch.tensor(
[e["features"] for e in examples], dtype=torch.float32
)
labels = torch.tensor(
[e["label"] for e in examples], dtype=torch.float32
)

# Forward pass
self.model.train()
self.optimizer.zero_grad()
logits = self.model(features).squeeze()
loss = self.loss_criterion(logits, labels)

# Backward pass with gradient clipping
loss.backward()
torch.nn.utils.clip_grad_norm_(self.model.parameters(), self.gradient_clip)
self.optimizer.step()

# Update EMA model (used for serving)
self._update_ema()

self.total_updates += 1
if self.total_updates % 100 == 0:
print(f"Update {self.total_updates}: loss={loss.item():.4f}, "
f"batch_size={len(examples)}")

def _update_ema(self):
"""Update exponential moving average of model weights."""
with torch.no_grad():
for ema_param, model_param in zip(
self.ema_model.parameters(), self.model.parameters()
):
ema_param.data.mul_(self.ema_alpha).add_(
model_param.data, alpha=1.0 - self.ema_alpha
)

def _copy_model(self, model: nn.Module) -> nn.Module:
"""Create a deep copy of the model for EMA."""
import copy
return copy.deepcopy(model)

def get_serving_model(self) -> nn.Module:
"""Return the EMA model (stable, used for serving)."""
return self.ema_model

Vowpal Wabbit for Production Online Learning

Vowpal Wabbit (VW) is the production standard for online learning at scale, used in ad systems at Yahoo, Microsoft, and other large platforms.

# vowpal_wabbit_online.py - VW for real-time ad click prediction
from vowpalwabbit import pyvw

# Initialize VW learner
# --sgd: stochastic gradient descent
# --loss_function logistic: binary classification
# --learning_rate: constant learning rate for non-stationary data
# --bit_precision: hash size (2^18 = 262144 feature slots)
vw = pyvw.vw(
"--sgd "
"--loss_function logistic "
"--learning_rate 0.1 "
"--bit_precision 18 "
"--l1 0.0 " # L1 regularization (for sparsity)
"--l2 0.001 " # L2 regularization (for stability)
"--quadratic ua " # interaction features between 'u' and 'a' namespaces
)

def make_vw_example(
user_id: str,
ad_id: str,
context_features: dict,
label: float = None # None for prediction, 0/1 for learning
) -> str:
"""
Format a VW example string.
VW format: [label] [importance] |namespace feature:value ...
"""
label_str = f"{label} " if label is not None else ""

# User namespace
user_ns = f"|u id_{user_id} country_{context_features.get('country', 'unknown')}"

# Ad namespace
ad_ns = f"|a id_{ad_id} category_{context_features.get('ad_category', 'other')}"

# Numeric features - VW handles sparse + dense
feature_ns = (
f"|f hour:{context_features.get('hour', 12)} "
f"price_tier:{int(context_features.get('price', 10) // 10)}"
)

return f"{label_str}{user_ns} {ad_ns} {feature_ns}"


def predict(vw_model, user_id: str, ad_id: str, context: dict) -> float:
"""Get click probability prediction."""
example_str = make_vw_example(user_id, ad_id, context)
prediction = vw_model.predict(example_str)
return prediction # sigmoid-transformed probability

def learn(vw_model, user_id: str, ad_id: str, context: dict, clicked: bool):
"""Update model with observed click/no-click label."""
label = 1 if clicked else -1 # VW uses +1/-1 for logistic loss
example_str = make_vw_example(user_id, ad_id, context, label=label)
vw_model.learn(example_str) # gradient step happens here


# Production pattern: predict → serve ad → observe click → learn
context = {"country": "US", "hour": 14, "ad_category": "electronics", "price": 49.99}

score = predict(vw, user_id="user_12345", ad_id="ad_67890", context=context)
print(f"Predicted CTR: {score:.3f}")

# ... user sees ad and clicks ...

learn(vw, user_id="user_12345", ad_id="ad_67890", context=context, clicked=True)
# Model is immediately updated - next prediction for similar (user, ad, context) will be higher

Bandit Algorithms for Online Learning

Multi-armed bandit algorithms combine prediction and online learning in a single framework. Instead of learning from historical labels, bandits make decisions and learn from immediate feedback.

# thompson_sampling.py - Thompson Sampling bandit for recommendation
import numpy as np
from dataclasses import dataclass, field
from typing import List, Dict, Optional

@dataclass
class BanditArm:
"""Represents a recommendation option (item, ad, content variant)."""
arm_id: str
alpha: float = 1.0 # Beta distribution parameter (success count + 1)
beta: float = 1.0 # Beta distribution parameter (failure count + 1)

def sample(self) -> float:
"""Sample from Beta(alpha, beta) distribution - Thompson Sampling."""
return np.random.beta(self.alpha, self.beta)

@property
def estimated_ctr(self) -> float:
"""Expected value of Beta distribution = alpha / (alpha + beta)."""
return self.alpha / (self.alpha + self.beta)

@property
def uncertainty(self) -> float:
"""Variance of Beta distribution - high early, low after many samples."""
n = self.alpha + self.beta
p = self.estimated_ctr
return np.sqrt(p * (1 - p) / n)


class ThompsonSamplingBandit:
"""
Thompson Sampling bandit for real-time recommendation.
Balances exploration (trying uncertain options) and exploitation (using best known).
"""

def __init__(self, arms: List[str]):
self.arms: Dict[str, BanditArm] = {
arm_id: BanditArm(arm_id=arm_id) for arm_id in arms
}

def select(self, n_recommendations: int = 5) -> List[str]:
"""
Select top-N arms via Thompson Sampling.
Each arm samples from its Beta distribution - arms with more uncertainty
have higher variance and will occasionally be selected for exploration.
"""
# Sample from each arm's Beta distribution
samples = {
arm_id: arm.sample()
for arm_id, arm in self.arms.items()
}

# Return top-N by sample value
top_arms = sorted(samples.keys(), key=lambda k: samples[k], reverse=True)
return top_arms[:n_recommendations]

def update(self, arm_id: str, reward: float):
"""
Update arm statistics after observing reward.
reward = 1 for click/purchase, 0 for no engagement.
"""
if arm_id not in self.arms:
self.arms[arm_id] = BanditArm(arm_id=arm_id)

arm = self.arms[arm_id]
if reward > 0:
arm.alpha += reward # increment success count
else:
arm.beta += 1.0 # increment failure count

def add_arm(self, arm_id: str):
"""Add a new content variant - starts with Beta(1,1) = uniform uncertainty."""
self.arms[arm_id] = BanditArm(arm_id=arm_id)

def get_stats(self) -> List[dict]:
"""Return estimated CTR and uncertainty for each arm."""
return [
{
"arm_id": arm_id,
"estimated_ctr": arm.estimated_ctr,
"uncertainty": arm.uncertainty,
"n_samples": arm.alpha + arm.beta - 2, # subtract prior counts
}
for arm_id, arm in sorted(
self.arms.items(),
key=lambda x: x[1].estimated_ctr,
reverse=True
)
]

Preventing Catastrophic Forgetting

The central challenge of neural network online learning is that gradient updates for new patterns can overwrite weights encoding old patterns.

Elastic Weight Consolidation (EWC)

EWC adds a regularization term that protects important weights from the previous task:

LEWC=Lnew+λ2iFi(θiθi)2\mathcal{L}_{EWC} = \mathcal{L}_{new} + \frac{\lambda}{2} \sum_i F_i (\theta_i - \theta_i^*)^2

Where FiF_i is the Fisher information for weight ii (how important it is for old tasks), θi\theta_i^* are the weights after learning the old task, and λ\lambda controls the tradeoff between plasticity and stability.

# ewc.py - Elastic Weight Consolidation for online learning
import torch
import torch.nn as nn
from typing import Dict
from copy import deepcopy

class EWCLearner:
"""
Online learner with EWC regularization to prevent catastrophic forgetting.
Maintains a reference checkpoint and Fisher information matrix.
"""

def __init__(
self,
model: nn.Module,
ewc_lambda: float = 1000.0, # regularization strength
fisher_samples: int = 200, # samples for Fisher approximation
):
self.model = model
self.ewc_lambda = ewc_lambda
self.fisher_samples = fisher_samples

self.reference_params: Dict[str, torch.Tensor] = {}
self.fisher_matrix: Dict[str, torch.Tensor] = {}

def consolidate(self, stable_dataloader):
"""
After learning a stable distribution, compute Fisher information
and store reference weights. Call this periodically (e.g., weekly).
"""
# Store current weights as reference
self.reference_params = {
name: param.data.clone()
for name, param in self.model.named_parameters()
}

# Compute Fisher information (diagonal approximation)
self.fisher_matrix = {
name: torch.zeros_like(param.data)
for name, param in self.model.named_parameters()
}

self.model.eval()
for i, (inputs, targets) in enumerate(stable_dataloader):
if i >= self.fisher_samples:
break

self.model.zero_grad()
output = self.model(inputs)
loss = nn.functional.cross_entropy(output, targets)
loss.backward()

for name, param in self.model.named_parameters():
if param.grad is not None:
# Fisher ≈ E[grad²] (diagonal approximation)
self.fisher_matrix[name] += param.grad.data ** 2

# Normalize by number of samples
for name in self.fisher_matrix:
self.fisher_matrix[name] /= self.fisher_samples

def ewc_loss(self) -> torch.Tensor:
"""Compute EWC regularization term."""
if not self.reference_params:
return torch.tensor(0.0)

ewc_penalty = torch.tensor(0.0, requires_grad=True)
for name, param in self.model.named_parameters():
if name in self.fisher_matrix:
penalty = (
self.fisher_matrix[name] *
(param - self.reference_params[name]) ** 2
).sum()
ewc_penalty = ewc_penalty + penalty

return self.ewc_lambda / 2 * ewc_penalty

def gradient_step(self, loss: torch.Tensor, optimizer: torch.optim.Optimizer):
"""Update model with EWC-regularized loss."""
total_loss = loss + self.ewc_loss()
optimizer.zero_grad()
total_loss.backward()
optimizer.step()

Replay Buffer

An alternative to EWC: maintain a buffer of examples from the old distribution and mix them with new examples during updates:

# replay_buffer.py - experience replay to prevent forgetting
import random
from collections import deque
from typing import List, Tuple
import torch

class ReplayBuffer:
"""
Maintains a buffer of historical examples.
Mixed with new examples during online learning to prevent forgetting.
"""

def __init__(self, capacity: int = 10000, replay_ratio: float = 0.3):
self.buffer = deque(maxlen=capacity)
self.replay_ratio = replay_ratio # fraction of batch from replay

def add(self, features: torch.Tensor, label: torch.Tensor):
"""Add example to replay buffer."""
self.buffer.append((features.clone(), label.clone()))

def sample_mixed_batch(
self,
new_examples: List[Tuple],
batch_size: int
) -> Tuple[torch.Tensor, torch.Tensor]:
"""
Create a batch mixing new examples with replayed historical ones.
Prevents catastrophic forgetting by maintaining old distribution.
"""
n_replay = min(
int(batch_size * self.replay_ratio),
len(self.buffer)
)
n_new = batch_size - n_replay

# Sample from replay buffer
replay_examples = random.sample(list(self.buffer), n_replay)

# Take from new examples (truncate if more than needed)
new_subset = new_examples[:n_new]

# Combine
all_features = (
[e[0] for e in new_subset] +
[e[0] for e in replay_examples]
)
all_labels = (
[e[1] for e in new_subset] +
[e[1] for e in replay_examples]
)

return torch.stack(all_features), torch.stack(all_labels)

Stability vs Adaptability Tradeoffs

The fundamental tension in online learning: learning rate controls both adaptability (high LR = fast adaptation) and stability (low LR = less noise, more consistent).

# adaptive_lr.py - adaptive learning rate based on drift detection
import numpy as np
from collections import deque

class DriftAdaptiveLearner:
"""
Adjusts learning rate based on detected concept drift.
High LR when drift detected; low LR for stable periods.
"""

def __init__(
self,
base_lr: float = 1e-4,
drift_lr_multiplier: float = 10.0,
drift_window: int = 500,
drift_threshold_sigma: float = 3.0,
):
self.base_lr = base_lr
self.drift_lr_multiplier = drift_lr_multiplier
self.recent_losses = deque(maxlen=drift_window)
self.reference_losses = deque(maxlen=drift_window)
self.drift_threshold = drift_threshold_sigma
self._drift_active = False

def update(self, current_loss: float) -> float:
"""
Record loss and return appropriate learning rate.
Detects drift via Page-Hinkley test on loss distribution.
"""
self.recent_losses.append(current_loss)

if len(self.recent_losses) < 100:
return self.base_lr # not enough data for drift detection

# Simple drift detection: is recent loss significantly higher than historical?
recent_mean = np.mean(list(self.recent_losses)[-100:])
historical_mean = np.mean(list(self.recent_losses)[:-100])
historical_std = np.std(list(self.recent_losses)[:-100]) + 1e-8

z_score = (recent_mean - historical_mean) / historical_std

if z_score > self.drift_threshold:
if not self._drift_active:
print(f"Drift detected: z={z_score:.2f}, increasing LR")
self._drift_active = True
return self.base_lr * self.drift_lr_multiplier
else:
if self._drift_active:
print(f"Drift resolved: z={z_score:.2f}, restoring LR")
self._drift_active = False
return self.base_lr

Production Engineering Notes

When Online Learning is Not Appropriate

Online learning is not universally better than periodic retraining. Avoid it when:

  • Labels are delayed: fraud chargebacks take 30-60 days; customer churn labels need 90 days to confirm. Online learning on stale labels is counterproductive.
  • Model architecture changes: if the new model has a different architecture, online learning cannot transition smoothly - full retraining is required.
  • Regulatory requirements: in regulated industries (healthcare, finance), model changes must be logged, validated, and auditable. Continuous weight changes are hard to audit.
  • Distribution shifts require complete retraining: if the distribution shift is so severe that the current model's feature representations are no longer relevant, online learning on the broken representation makes things worse.

Safe Online Learning Deployment Pattern

Never deploy raw online-learned models to 100% of traffic. Use:

  1. Train primary model offline → deploy to 95% of traffic
  2. Run online learner in shadow mode (learns but does not serve) → validate on held-out data
  3. Gradually shift traffic to online model (5% → 20% → 50%) while monitoring business metrics
  4. Keep offline model as instant fallback; auto-revert if metrics drop

Common Mistakes

:::danger Using Online Learning Without Label Delay Handling Most real-world labels are delayed. A click label is available immediately; a fraud label may take 7-60 days to confirm. Feeding a model unconfirmed labels (e.g., transactions not yet confirmed as fraud) trains it on noise. Implement an explicit delay buffer: queue examples when they arrive, only pass them to the online learner after the confirmation window has elapsed. "Online" means learning quickly after labels are confirmed - not learning from unconfirmed predictions. :::

:::danger Forgetting to Validate Against an Offline Holdout Online learning is hard to evaluate because the training and evaluation sets are temporally interleaved. Always maintain a separate holdout stream (e.g., the last 5% of examples by time) that the online learner never trains on, used only for validation. Monitor key metrics on this holdout hourly to catch regressions before they affect production traffic. :::

:::warning Catastrophic Forgetting on Seasonal Patterns If you run pure online learning year-round, the model will gradually forget seasonal patterns. A fraud model that learned Christmas purchase patterns in December will have nearly forgotten them by the following December, making it vulnerable to holiday-specific attacks. Solutions: (1) use EWC or replay buffers to retain old patterns; (2) maintain a periodic full retrain that captures the full historical distribution, and use online learning only for recency adaptation; (3) use a two-model ensemble: offline model for stable patterns + online model for recent patterns, blended by recency weight. :::


Interview Q&A

Q1: What is online learning and when is it preferable to periodic batch retraining?

A: Online learning updates a model incrementally as new labeled examples arrive, instead of collecting a batch and retraining from scratch. It is preferable when: (a) the data distribution changes faster than your retraining cycle - fraud attack patterns can emerge in hours; (b) immediate feedback is available - click/no-click labels for ad systems arrive in seconds; (c) full retraining is prohibitively expensive for the frequency needed - retraining a large model hourly is often impractical. Batch retraining is preferable when: labels are delayed (fraud chargebacks take weeks), model architecture changes are needed, regulatory auditability requires discrete versioned updates, or the distribution shift requires access to the full historical distribution to recover.

Q2: What is catastrophic forgetting and how do you mitigate it in production online learning?

A: Catastrophic forgetting happens when a neural network trained on new data overwrites the weights that encoded old patterns. It occurs because gradient descent optimizes for the current loss - weights for old patterns are "collateral damage." Three main mitigations: (1) Replay buffers - maintain a buffer of historical examples, mix them with new examples during every gradient step. The model sees old data regularly and cannot forget it completely. Effective and simple; main cost is storing the replay buffer. (2) EWC (Elastic Weight Consolidation) - after learning a stable distribution, compute the Fisher information (which weights are most important) and add a penalty that prevents those important weights from moving far. Allows learning new patterns while protecting old ones. Computationally moderate. (3) EMA (Exponential Moving Average) - the serving model is a moving average of recent training checkpoints, not the raw trained model. This smooths out noisy updates and prevents a single bad batch from degrading the serving model significantly.

Q3: Explain the Thompson Sampling bandit algorithm and why it works better than epsilon-greedy for recommendation.

A: Thompson Sampling maintains a probability distribution over each arm's true reward probability, represented as a Beta distribution. At each decision, it samples once from each arm's distribution and picks the arm with the highest sample. Arms with high uncertainty (high variance in their Beta distribution - few observations) are sometimes sampled with high values, causing exploration. Arms with low uncertainty (many observations, narrow distribution) are sampled predictably near their true CTR, causing exploitation. This automatically balances exploration and exploitation: highly uncertain arms get more exploration proportionally to their uncertainty, high-confidence good arms get exploited. Epsilon-greedy always explores with probability epsilon, regardless of uncertainty - it explores a well-characterized bad arm as often as a new unknown arm. Thompson Sampling is more sample-efficient because it focuses exploration on genuinely uncertain options.

Q4: How do you handle the delayed label problem in online learning for fraud detection?

A: Delayed labels require a temporal buffer. The implementation: (1) At transaction time, record the features and a "pending" status in a label queue keyed by transaction_id and expected_label_time. (2) Run a label resolution process that checks pending transactions for resolved labels - chargebacks from customers, manual review decisions, automated pattern detection. (3) When a label is resolved (confirmed fraud or confirmed legitimate), push the (features, label) pair to the online learning Kafka topic. (4) The online learner consumes from this topic, ensuring it only trains on confirmed labels. The key design parameter is the maximum delay you are willing to wait - 7-day delay gives high label coverage for fraud; 30-minute delay gives fast adaptation but incomplete labels. For most fraud detection systems, a tiered approach works: train immediately on high-confidence automated labels (transaction reversed by issuer = fraud, definitely), and separately train with 7-day delay on slow confirmation labels.

Q5: What is the stability-adaptability tradeoff in online learning, and how do you configure the learning rate?

A: The learning rate η\eta controls this tradeoff directly. A high η\eta means the model rapidly adapts to new patterns but forgets old ones quickly and is sensitive to noise (individual mislabeled examples can significantly perturb the model). A low η\eta means slow adaptation but high stability - outlier examples barely affect the model, but genuine distribution shifts take many examples to learn. For stationary distributions, a decaying learning rate is theoretically optimal (SGD convergence theory). For non-stationary distributions (which most real-world online learning scenarios are), a constant learning rate maintains adaptability indefinitely. A practical approach: start with η=104\eta = 10^{-4} to 10310^{-3}, monitor the model's performance on a held-out stream, and use drift detection (e.g., ADWIN, Page-Hinkley test) to temporarily increase η\eta when a distribution shift is detected. This gives baseline stability with accelerated adaptation when the world changes.

© 2026 EngineersOfAI. All rights reserved.