Skip to main content

Online Learning Theory

Reading time: ~25 minutes | Level: Statistical Learning Theory → Production ML Systems

The Streaming Prediction Problem

You're running an ad auction system at a major platform. 10 million ad auctions happen per second. For each auction, you must predict the click-through rate (CTR) of each ad - immediately, in under 5ms - before collecting the feedback on whether the user actually clicked.

You cannot batch up data and retrain weekly. The world changes too fast: advertisers launch new campaigns, trends shift, user behavior evolves. You need a model that:

  1. Makes a prediction for each incoming auction
  2. Receives feedback (click or no-click) immediately after
  3. Updates its parameters instantly
  4. Adapts to distribution shift without explicit retraining

This is the online learning model - fundamentally different from batch learning. The goal is not to minimize test error (you don't know the test distribution), but to minimize regret: how much worse you did compared to the best fixed strategy you could have chosen in hindsight.

What You Will Learn

  • The online learning protocol: how regret is defined and why sublinear regret is the right goal
  • The Perceptron algorithm: the original online learner, with the convergence theorem and proof
  • Follow-The-Leader and its failure mode on adversarial sequences
  • Follow-The-Regularised-Leader: the fix that achieves O(T)O(\sqrt{T}) regret
  • The Hedge algorithm: online learning over NN experts with O(TlnN)O(\sqrt{T \ln N}) regret
  • How online learning theory connects to SGD, AdaGrad, and streaming ML systems
  • Five interview questions with full worked answers

The Online Learning Protocol

The online learning game between a learner and an adversary proceeds as follows:

For each round t=1,2,,Tt = 1, 2, \ldots, T:

  1. Learner chooses hypothesis htHh_t \in \mathcal{H} (or distribution over H\mathcal{H})
  2. Environment/adversary reveals example (xt,yt)(x_t, y_t)
  3. Learner makes prediction y^t=ht(xt)\hat{y}_t = h_t(x_t)
  4. Learner suffers loss t=(ht(xt),yt)\ell_t = \ell(h_t(x_t), y_t)
  5. Learner updates to ht+1h_{t+1}

Cumulative loss of the learner: LT=t=1T(ht(xt),yt)L_T = \sum_{t=1}^T \ell(h_t(x_t), y_t)

Best fixed hypothesis in hindsight: h=argminhHt=1T(h(xt),yt)h^* = \arg\min_{h \in \mathcal{H}} \sum_{t=1}^T \ell(h(x_t), y_t), with loss LT=minhHt(h(xt),yt)L_T^* = \min_{h \in \mathcal{H}} \sum_t \ell(h(x_t), y_t)

Regret: The difference between the learner's cumulative loss and the best fixed hypothesis:

RegretT=LTLT=t=1T(ht(xt),yt)minhHt=1T(h(xt),yt)\text{Regret}_T = L_T - L_T^* = \sum_{t=1}^T \ell(h_t(x_t), y_t) - \min_{h \in \mathcal{H}} \sum_{t=1}^T \ell(h(x_t), y_t)

Goal: Achieve sublinear regret - RegretT=o(T)\text{Regret}_T = o(T) - meaning the per-round regret goes to zero:

RegretTT0 as T\frac{\text{Regret}_T}{T} \to 0 \text{ as } T \to \infty

A no-regret algorithm performs as well as the best fixed strategy in hindsight, on average.

:::note Why Regret, Not Test Error? In online learning, there's no fixed test distribution to evaluate against. The adversary can be adaptive - choosing examples that are hard for your current model. Regret is a comparison to the best fixed strategy in hindsight, without requiring any distributional assumptions. Sublinear regret means: even against an adaptive adversary, your algorithm learns to perform well. :::

The Perceptron: The Original Online Learner

The Perceptron algorithm (Rosenblatt, 1958) is the first online learning algorithm. It's also the foundation of modern neural networks.

Algorithm: For binary classification (y{1,+1}y \in \{-1, +1\}):

  1. Initialize: w1=0\mathbf{w}_1 = \mathbf{0}
  2. At round tt: predict y^t=sign(wtxt)\hat{y}_t = \text{sign}(\mathbf{w}_t^\top x_t)
  3. If y^tyt\hat{y}_t \neq y_t (mistake): update wt+1=wt+ytxt\mathbf{w}_{t+1} = \mathbf{w}_t + y_t x_t
  4. If y^t=yt\hat{y}_t = y_t (correct): wt+1=wt\mathbf{w}_{t+1} = \mathbf{w}_t (no update)

Perceptron Convergence Theorem (Novikoff, 1962):

Suppose the data is linearly separable with margin γ\gamma: there exists w\mathbf{w}^* with w=1\|\mathbf{w}^*\| = 1 such that yt(w)xtγy_t (\mathbf{w}^*)^\top x_t \geq \gamma for all tt, and xtR\|x_t\| \leq R for all tt.

Then the Perceptron makes at most M(R/γ)2M \leq (R/\gamma)^2 mistakes.

Proof sketch: Track two quantities:

  • wtw\mathbf{w}_t \cdot \mathbf{w}^* grows by γ\gamma per mistake (each update improves alignment with w\mathbf{w}^*)
  • wt2\|\mathbf{w}_t\|^2 grows by at most R2R^2 per mistake

After MM mistakes: wM+1wMγ\mathbf{w}_{M+1} \cdot \mathbf{w}^* \geq M\gamma and wM+1RM\|\mathbf{w}_{M+1}\| \leq R\sqrt{M}.

By Cauchy-Schwarz: MγwM+1wwM+1wRMM\gamma \leq \mathbf{w}_{M+1} \cdot \mathbf{w}^* \leq \|\mathbf{w}_{M+1}\| \|\mathbf{w}^*\| \leq R\sqrt{M}.

Therefore: M(R/γ)2M \leq (R/\gamma)^2.

import numpy as np
import matplotlib.pyplot as plt

class Perceptron:
"""
Online Perceptron algorithm.
Provably converges in at most (R/gamma)^2 mistakes
if data is linearly separable with margin gamma.
"""
def __init__(self, d):
self.w = np.zeros(d)
self.n_mistakes = 0
self.n_updates = 0

def predict(self, x):
return np.sign(self.w @ x)

def update(self, x, y):
"""Online update: only update on mistakes."""
y_hat = self.predict(x)
if y_hat != y:
self.w = self.w + y * x
self.n_mistakes += 1
return True # mistake
return False # correct

def run(self, X, y):
"""Run perceptron over a sequence of examples."""
mistake_history = []
for xi, yi in zip(X, y):
made_mistake = self.update(xi, yi)
mistake_history.append(made_mistake)
return mistake_history

# Generate linearly separable data with known margin
def generate_separable_data(n, d, margin, noise=0.0):
np.random.seed(42)
w_true = np.random.randn(d)
w_true = w_true / np.linalg.norm(w_true) # unit norm

X = np.random.randn(n, d)
scores = X @ w_true

# Push points to have margin gamma
labels = np.sign(scores)
# Only keep points with |score| > margin (guaranteed separability)
mask = np.abs(scores) > margin
X, labels, scores = X[mask], labels[mask], scores[mask]

if noise > 0:
flip_idx = np.random.choice(len(labels), int(noise * len(labels)), replace=False)
labels[flip_idx] *= -1

return X, labels, w_true

# Test Perceptron with different margins
d = 10
for margin in [0.1, 0.5, 1.0]:
X, y, w_true = generate_separable_data(n=2000, d=d, margin=margin)
R = np.max(np.linalg.norm(X, axis=1))

perceptron = Perceptron(d)
mistakes = perceptron.run(X, y)

theoretical_bound = (R / margin)**2
print(f"Margin γ={margin:.1f}: actual mistakes={perceptron.n_mistakes}, "
f"bound=(R/γ)²={theoretical_bound:.1f}, "
f"n_examples={len(X)}")

Follow-The-Leader (FTL)

The simplest online learning algorithm: at each round, choose the hypothesis that would have been best on all past data.

ht+1=argminhHs=1ts(h)h_{t+1} = \arg\min_{h \in \mathcal{H}} \sum_{s=1}^t \ell_s(h)

When FTL works: For strongly convex losses (e.g., squared loss), FTL achieves regret O(logT)O(\log T).

When FTL fails: For adversarial sequences with non-strongly-convex losses. Example: predict 1 or -1. The adversary always labels opposite to your current prediction. FTL switches every round, achieving linear regret - as bad as possible.

Follow-The-Regularised-Leader (FTRL)

The fix: add a regularisation term to stabilise the solution:

ht+1=argminhH[s=1ts(h)+1ηΩ(h)]h_{t+1} = \arg\min_{h \in \mathcal{H}} \left[\sum_{s=1}^t \ell_s(h) + \frac{1}{\eta}\Omega(h)\right]

where Ω(h)\Omega(h) is a regulariser (e.g., Ω(w)=12w22\Omega(\mathbf{w}) = \frac{1}{2}\|\mathbf{w}\|_2^2) and η>0\eta > 0 is the learning rate.

FTRL achieves regret O(T)O(\sqrt{T}) for convex losses.

The connection to online gradient descent (OGD): OGD is equivalent to FTRL with a quadratic regulariser and linear (gradient) approximation of the loss.

class FTRL:
"""
Follow-The-Regularised-Leader for online linear regression.
Equivalent to online gradient descent with squared regulariser.
"""
def __init__(self, d, eta=0.01, lambda_reg=0.1):
self.w = np.zeros(d)
self.eta = eta
self.lambda_reg = lambda_reg
self.cumulative_grad = np.zeros(d)
self.t = 0

def predict(self, x):
return self.w @ x

def update(self, x, y):
"""FTRL-Proximal update for squared loss."""
self.t += 1
y_hat = self.predict(x)
grad = (y_hat - y) * x # gradient of squared loss
self.cumulative_grad += grad

# FTRL closed form: minimize cumulative loss + regulariser
self.w = -self.eta * self.cumulative_grad / (1 + self.lambda_reg * self.t)
return (y_hat - y)**2


class OnlineGradientDescent:
"""
Online gradient descent (equivalent to FTRL with quadratic regulariser).
"""
def __init__(self, d, eta=0.01):
self.w = np.zeros(d)
self.eta = eta

def predict(self, x):
return self.w @ x

def update(self, x, y):
y_hat = self.predict(x)
grad = (y_hat - y) * x
self.w = self.w - self.eta * grad
return (y_hat - y)**2

# Online regression simulation
np.random.seed(42)
T = 10000
d = 5
w_true = np.random.randn(d)

ftrl = FTRL(d, eta=0.01, lambda_reg=0.01)
ogd = OnlineGradientDescent(d, eta=0.005)

ftrl_losses = []
ogd_losses = []

for t in range(T):
x = np.random.randn(d)
y = w_true @ x + np.random.randn() * 0.1

ftrl_losses.append(ftrl.update(x, y))
ogd_losses.append(ogd.update(x, y))

# Regret = cumulative loss - best fixed hypothesis loss
best_loss = 0.01 # noise variance (achievable by best fixed w*)
ftrl_regret = np.cumsum(ftrl_losses) - np.arange(1, T+1) * best_loss
ogd_regret = np.cumsum(ogd_losses) - np.arange(1, T+1) * best_loss

print(f"FTRL total regret: {ftrl_regret[-1]:.2f}")
print(f"OGD total regret: {ogd_regret[-1]:.2f}")
print(f"FTRL average regret per round: {ftrl_regret[-1]/T:.4f}")
print(f"OGD average regret per round: {ogd_regret[-1]/T:.4f}")
print(f"Theory predicts O(sqrt(T)/T) = O(1/sqrt(T)) = {1/np.sqrt(T):.4f}")

Hedge: Online Learning Over Experts

The expert setting: rather than choosing a single hypothesis, choose a distribution over NN expert predictions.

Hedge algorithm (multiplicative weights update):

  1. Initialize: uniform distribution p1=(1/N,,1/N)p_1 = (1/N, \ldots, 1/N) over NN experts
  2. At round tt: predict y^t=ipt(i)y^t(i)\hat{y}_t = \sum_i p_t(i) \hat{y}_t^{(i)} (weighted average)
  3. Observe losses t(1),,t(N)\ell_t^{(1)}, \ldots, \ell_t^{(N)} for each expert
  4. Update weights: pt+1(i)pt(i)eηt(i)p_{t+1}(i) \propto p_t(i) \cdot e^{-\eta \ell_t^{(i)}} (reduce weight of experts that made large mistakes)
  5. Renormalize: pt+1(i)=pt(i)eηt(i)jpt(j)eηt(j)p_{t+1}(i) = \frac{p_t(i) e^{-\eta \ell_t^{(i)}}}{\sum_j p_t(j) e^{-\eta \ell_t^{(j)}}}

Regret bound: With learning rate η=8lnN/T\eta = \sqrt{8\ln N / T}, Hedge achieves:

RegretTTlnN2\text{Regret}_T \leq \sqrt{\frac{T \ln N}{2}}

Optimal rate: O(TlnN)O(\sqrt{T \ln N}) - sublinear in TT, logarithmic in NN.

class HedgeAlgorithm:
"""
Multiplicative weights (Hedge) for online learning over experts.
Achieves regret O(sqrt(T log N)).
"""
def __init__(self, n_experts, eta=None, T=None):
self.n_experts = n_experts
# Optimal eta: sqrt(8 * log(N) / T) if T is known
self.eta = eta if eta else np.sqrt(8 * np.log(n_experts) / max(T or 1000, 1))
self.weights = np.ones(n_experts) / n_experts
self.t = 0
self.total_regret = 0

def predict_distribution(self):
"""Return current distribution over experts."""
return self.weights.copy()

def update(self, expert_losses):
"""
Update weights based on expert losses.
expert_losses: array of losses for each expert this round.
"""
self.t += 1
# Predicted loss (mixture of experts)
predicted_loss = self.weights @ expert_losses

# Update weights: penalise experts with high loss
self.weights *= np.exp(-self.eta * expert_losses)
self.weights /= self.weights.sum() # renormalise

# Best expert so far loss (for regret computation)
return predicted_loss

# Simulation: N weather prediction models, adversarial temperature
np.random.seed(42)
T, N = 1000, 10

# Expert predictions: some are biased, some are random, one is near-optimal
true_temps = 10 + 5 * np.sin(np.linspace(0, 4*np.pi, T)) + np.random.randn(T)
expert_predictions = np.zeros((T, N))
for i in range(N):
bias = np.random.uniform(-5, 5)
noise = np.random.uniform(0.5, 3.0)
expert_predictions[:, i] = true_temps + bias + np.random.randn(T) * noise

# One expert is nearly perfect
expert_predictions[:, 5] = true_temps + np.random.randn(T) * 0.5

hedge = HedgeAlgorithm(n_experts=N, T=T)
hedge_losses = []
best_expert_losses = [[] for _ in range(N)]

for t in range(T):
dist = hedge.predict_distribution()
mixture_pred = dist @ expert_predictions[t]
expert_losses = (expert_predictions[t] - true_temps[t])**2 # squared loss

hedge_loss = (mixture_pred - true_temps[t])**2
hedge_losses.append(hedge_loss)
for i in range(N): best_expert_losses[i].append(expert_losses[i])

hedge.update(expert_losses)

# Compute regret
total_hedge_loss = sum(hedge_losses)
total_expert_losses = [sum(losses) for losses in best_expert_losses]
best_expert_idx = np.argmin(total_expert_losses)
best_expert_loss = total_expert_losses[best_expert_idx]

regret = total_hedge_loss - best_expert_loss
theoretical_bound = np.sqrt(T * np.log(N) / 2)

print(f"Hedge: total loss = {total_hedge_loss:.2f}")
print(f"Best expert (#{best_expert_idx}): total loss = {best_expert_loss:.2f}")
print(f"Regret = {regret:.2f}")
print(f"Theoretical bound = O(sqrt(T*log(N))) = {theoretical_bound:.2f}")
print(f"Average regret per round = {regret/T:.4f}")

Connection to ML Practice

Online Learning ConceptML Practice Equivalent
Online gradient descentStochastic gradient descent (SGD)
FTRL-ProximalAdaGrad, Adam (adaptive learning rate)
Hedge algorithmGradient boosting (weights on training examples)
Expert settingEnsemble methods, model selection
Regret bound O(T)O(\sqrt{T})Why SGD converges: O(1/T)O(1/\sqrt{T}) average regret
No-regret algorithmConvergence guarantee for SGD-based optimizers
Online-to-batch conversionTraining on streaming data with theoretical guarantees

AdaGrad (Duchi et al., 2011) is directly derived from FTRL: it uses a diagonal adaptive learning rate that accumulates gradient magnitudes. This gives O(T)O(\sqrt{T}) regret, matching the FTRL bound, but with learning rates automatically scaled to each feature's gradient history.

Online Learning in Production: Key Design Decisions

When building a streaming ML system, online learning theory guides several decisions:

1. Regret vs. Accuracy: For non-stationary distributions (user preferences shift, concept drift), regret is the right objective - it measures performance relative to the best adaptive strategy. For stable distributions, batch learning with periodic retraining may be sufficient.

2. Learning rate schedule: The O(T)O(\sqrt{T}) regret bound uses ηt=O(1/t)\eta_t = O(1/\sqrt{t}) learning rates. Fixed learning rates give O(T)O(T) regret (linear, unacceptable). Adaptive rates (AdaGrad, Adam) achieve the optimal bound with feature-specific scaling.

3. Expert ensemble size: The Hedge bound O(TlnN)O(\sqrt{T \ln N}) shows you can maintain NN models with only logarithmic cost in NN. Maintaining 100 candidate model configurations costs only ln1002.1×\sqrt{\ln 100} \approx 2.1\times more than 1 configuration - compelling argument for online model selection.

4. Distribution shift handling: Online learning with a sliding window (forget old data) achieves O(T2/3)O(T^{2/3}) adaptive regret against shifting targets - useful for seasonal patterns, breaking news, or user preference drift.

:::note Role-specific relevance ML Engineers: SGD's convergence guarantees come directly from online learning theory. The O(1/T)O(1/\sqrt{T}) convergence rate for SGD on convex losses IS the O(T)O(\sqrt{T}) regret bound divided by TT. When you tune learning rates, you're implicitly doing what FTRL suggests: balance accumulated gradient history against stability.

Research Engineers / Scientists: Online learning is foundational for understanding modern adaptive optimizers. Adam's moment estimates approximate the FTRL-Proximal solution with a diagonal adaptive matrix. The PAC-online connection (converting online algorithms to batch algorithms via online-to-batch conversion) appears in NeurIPS papers on learning theory.

AI Engineers / Production Systems: Ad auction CTR prediction, fraud detection, personalization - any system that must make predictions on a stream of events with immediate feedback uses online learning principles. The Hedge algorithm underlies ensemble methods in streaming settings. Regret bounds provide service-level guarantees: "our CTR predictor achieves at most X% worse performance than any fixed model in hindsight."

Data Engineers: The online learning model is the theoretical foundation for streaming data pipelines. Kafka + feature streaming + online model updates implement the online learning protocol. Regret bounds tell you how quickly the online model adapts to distribution shift - important for SLA planning. :::

Interview Questions

Q1: What is regret in online learning, and why is sublinear regret the goal?

Regret =t=1T(ht(xt),yt)minhHt=1T(h(xt),yt)= \sum_{t=1}^T \ell(h_t(x_t), y_t) - \min_{h \in \mathcal{H}} \sum_{t=1}^T \ell(h(x_t), y_t) is the difference between the learner's cumulative loss and the loss of the best fixed hypothesis chosen with full hindsight. Sublinear regret means RegretT=o(T)\text{Regret}_T = o(T), so per-round regret 0\to 0: the learner's average per-round performance approaches the best fixed strategy's average performance. This is the right notion of learning in adversarial or non-stationary settings - no distributional assumption is required, and the algorithm provably "learns" in the sense that it eventually performs as well as the best strategy in the hypothesis class.

Q2: State the Perceptron Convergence Theorem and explain each component.

If data is linearly separable with margin γ\gamma (there exists w=1\|\mathbf{w}^*\| = 1 with yt(w)xtγy_t(\mathbf{w}^*)^\top x_t \geq \gamma for all tt) and data radius R=maxtxtR = \max_t\|x_t\|, then the Perceptron makes at most (R/γ)2(R/\gamma)^2 mistakes. The components: RR is the data radius (larger radius = harder geometry); γ\gamma is the margin (smaller margin = data points closer to the decision boundary = harder). The bound (R/γ)2(R/\gamma)^2 is tight - adversarial examples can force exactly this many mistakes. The theorem shows the Perceptron is equivalent to an SVM in the mistake-bound model: large-margin data requires few mistakes to learn. In practice: after (R/γ)2(R/\gamma)^2 mistakes, the Perceptron has converged to a separator; subsequent examples in the separable case cause zero mistakes.

Q3: Why does Follow-The-Leader fail but Follow-The-Regularised-Leader succeeds?

FTL fails because its solutions can be unstable: if the optimal cumulative-loss hypothesis changes drastically with each new example, FTL switches predictions rapidly and incurs linear regret. Example: predict +1+1 or 1-1 on a sequence where the adversary always chooses the opposite of FTL's next prediction - FTL flips every round for linear regret. FTRL adds a regulariser 1ηΩ(h)\frac{1}{\eta}\Omega(h) that penalises large changes from the origin. This stabilises the solution: even if the new loss function strongly prefers a different hypothesis, the regulariser prevents the solution from jumping too far. The resulting solutions change slowly and smoothly, giving the "be stable, don't change too much each round" property needed for sublinear regret. Technically: FTRL achieves low regret because the regulariser bounds the "stability" of consecutive solutions.

Q4: How does online gradient descent relate to batch SGD in neural network training?

Online gradient descent is: at each round, receive loss t\ell_t, compute gradient gt=wt(wt)g_t = \nabla_\mathbf{w} \ell_t(\mathbf{w}_t), update wt+1=wtηtgt\mathbf{w}_{t+1} = \mathbf{w}_t - \eta_t g_t. SGD for neural networks is identical: process one minibatch (or one example), compute the loss gradient, update weights. The regret bound for OGD with a convex Lipschitz loss is O(T)O(\sqrt{T}), meaning average regret O(1/T)0O(1/\sqrt{T}) \to 0. This is the theoretical guarantee that SGD for convex problems converges at rate O(1/T)O(1/\sqrt{T}). For non-convex neural networks, the regret framework is modified: we instead bound convergence to first-order stationary points (gradient norm 0\to 0). AdaGrad uses FTRL-Proximal with adaptive diagonal preconditioning, which can achieve O(T)O(\sqrt{T}) regret even when the effective learning rate varies across coordinates - this is particularly useful for sparse gradients (NLP tasks).

Q5: Explain the Hedge algorithm and give a real ML example where it applies.

Hedge maintains a distribution over NN experts, updating weights multiplicatively: pt+1(i)pt(i)eηt(i)p_{t+1}(i) \propto p_t(i) \cdot e^{-\eta \ell_t^{(i)}}. Experts with low loss retain high weight; experts with high loss are exponentially down-weighted. With optimal η=8lnN/T\eta = \sqrt{8\ln N / T}, regret is O(TlnN)O(\sqrt{T\ln N}). Real ML example: online hyperparameter selection. During training, maintain NN candidate learning rate schedules (experts). At each epoch, compute validation loss for each schedule. Upweight schedules with low validation loss using Hedge. The final ensemble of parameter settings achieves near-best-single-schedule performance. This is used in practice in neural architecture search and automated ML. Another example: online ensemble of base models in streaming prediction - each base model (random forest, GBM, neural net) is an expert; Hedge weights them adaptively based on recent prediction errors, without requiring knowledge of which will be best in advance.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Online Learning demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.