Online Learning Theory

Reading time: ~25 minutes | Level: Statistical Learning Theory → Production ML Systems

The Streaming Prediction Problem

You're running an ad auction system at a major platform. 10 million ad auctions happen per second. For each auction, you must predict the click-through rate (CTR) of each ad - immediately, in under 5ms - before collecting the feedback on whether the user actually clicked.

You cannot batch up data and retrain weekly. The world changes too fast: advertisers launch new campaigns, trends shift, user behavior evolves. You need a model that:

Makes a prediction for each incoming auction
Receives feedback (click or no-click) immediately after
Updates its parameters instantly
Adapts to distribution shift without explicit retraining

This is the online learning model - fundamentally different from batch learning. The goal is not to minimize test error (you don't know the test distribution), but to minimize regret: how much worse you did compared to the best fixed strategy you could have chosen in hindsight.

What You Will Learn

The online learning protocol: how regret is defined and why sublinear regret is the right goal
The Perceptron algorithm: the original online learner, with the convergence theorem and proof
Follow-The-Leader and its failure mode on adversarial sequences
Follow-The-Regularised-Leader: the fix that achieves $O(\sqrt{T})$ regret
The Hedge algorithm: online learning over $N$ experts with $O(\sqrt{T \ln N})$ regret
How online learning theory connects to SGD, AdaGrad, and streaming ML systems
Five interview questions with full worked answers

The Online Learning Protocol

The online learning game between a learner and an adversary proceeds as follows:

For each round $t = 1, 2, \ldots, T$ :

Learner chooses hypothesis $h_t \in \mathcal{H}$ (or distribution over $\mathcal{H}$ )
Environment/adversary reveals example $(x_t, y_t)$
Learner makes prediction $\hat{y}_t = h_t(x_t)$
Learner suffers loss $\ell_t = \ell(h_t(x_t), y_t)$
Learner updates to $h_{t+1}$

Cumulative loss of the learner: $L_T = \sum_{t=1}^T \ell(h_t(x_t), y_t)$

Best fixed hypothesis in hindsight: $h^* = \arg\min_{h \in \mathcal{H}} \sum_{t=1}^T \ell(h(x_t), y_t)$ , with loss $L_T^* = \min_{h \in \mathcal{H}} \sum_t \ell(h(x_t), y_t)$

Regret: The difference between the learner's cumulative loss and the best fixed hypothesis:

$\text{Regret}_T = L_T - L_T^* = \sum_{t=1}^T \ell(h_t(x_t), y_t) - \min_{h \in \mathcal{H}} \sum_{t=1}^T \ell(h(x_t), y_t)$

Goal: Achieve sublinear regret - $\text{Regret}_T = o(T)$ - meaning the per-round regret goes to zero:

$\frac{\text{Regret}_T}{T} \to 0 \text{ as } T \to \infty$

A no-regret algorithm performs as well as the best fixed strategy in hindsight, on average.

:::note Why Regret, Not Test Error? In online learning, there's no fixed test distribution to evaluate against. The adversary can be adaptive - choosing examples that are hard for your current model. Regret is a comparison to the best fixed strategy in hindsight, without requiring any distributional assumptions. Sublinear regret means: even against an adaptive adversary, your algorithm learns to perform well. :::

The Perceptron: The Original Online Learner

The Perceptron algorithm (Rosenblatt, 1958) is the first online learning algorithm. It's also the foundation of modern neural networks.

Algorithm: For binary classification ( $y \in \{-1, +1\}$ ):

Initialize: $\mathbf{w}_1 = \mathbf{0}$
At round $t$ : predict $\hat{y}_t = \text{sign}(\mathbf{w}_t^\top x_t)$
If $\hat{y}_t \neq y_t$ (mistake): update $\mathbf{w}_{t+1} = \mathbf{w}_t + y_t x_t$
If $\hat{y}_t = y_t$ (correct): $\mathbf{w}_{t+1} = \mathbf{w}_t$ (no update)

Perceptron Convergence Theorem (Novikoff, 1962):

Suppose the data is linearly separable with margin $\gamma$ : there exists $\mathbf{w}^*$ with $\|\mathbf{w}^*\| = 1$ such that $y_t (\mathbf{w}^*)^\top x_t \geq \gamma$ for all $t$ , and $\|x_t\| \leq R$ for all $t$ .

Then the Perceptron makes at most $M \leq (R/\gamma)^2$ mistakes.

Proof sketch: Track two quantities:

$\mathbf{w}_t \cdot \mathbf{w}^*$ grows by $\gamma$ per mistake (each update improves alignment with $\mathbf{w}^*$ )
$\|\mathbf{w}_t\|^2$ grows by at most $R^2$ per mistake

After $M$ mistakes: $\mathbf{w}_{M+1} \cdot \mathbf{w}^* \geq M\gamma$ and $\|\mathbf{w}_{M+1}\| \leq R\sqrt{M}$ .

By Cauchy-Schwarz: $M\gamma \leq \mathbf{w}_{M+1} \cdot \mathbf{w}^* \leq \|\mathbf{w}_{M+1}\| \|\mathbf{w}^*\| \leq R\sqrt{M}$ .

Therefore: $M \leq (R/\gamma)^2$ .

import numpy as np
import matplotlib.pyplot as plt

class Perceptron:
    """
    Online Perceptron algorithm.
    Provably converges in at most (R/gamma)^2 mistakes
    if data is linearly separable with margin gamma.
    """
    def __init__(self, d):
        self.w = np.zeros(d)
        self.n_mistakes = 0
        self.n_updates = 0

    def predict(self, x):
        return np.sign(self.w @ x)

    def update(self, x, y):
        """Online update: only update on mistakes."""
        y_hat = self.predict(x)
        if y_hat != y:
            self.w = self.w + y * x
            self.n_mistakes += 1
            return True  # mistake
        return False  # correct

    def run(self, X, y):
        """Run perceptron over a sequence of examples."""
        mistake_history = []
        for xi, yi in zip(X, y):
            made_mistake = self.update(xi, yi)
            mistake_history.append(made_mistake)
        return mistake_history

# Generate linearly separable data with known margin
def generate_separable_data(n, d, margin, noise=0.0):
    np.random.seed(42)
    w_true = np.random.randn(d)
    w_true = w_true / np.linalg.norm(w_true)  # unit norm

    X = np.random.randn(n, d)
    scores = X @ w_true

    # Push points to have margin gamma
    labels = np.sign(scores)
    # Only keep points with |score| > margin (guaranteed separability)
    mask = np.abs(scores) > margin
    X, labels, scores = X[mask], labels[mask], scores[mask]

    if noise > 0:
        flip_idx = np.random.choice(len(labels), int(noise * len(labels)), replace=False)
        labels[flip_idx] *= -1

    return X, labels, w_true

# Test Perceptron with different margins
d = 10
for margin in [0.1, 0.5, 1.0]:
    X, y, w_true = generate_separable_data(n=2000, d=d, margin=margin)
    R = np.max(np.linalg.norm(X, axis=1))

    perceptron = Perceptron(d)
    mistakes = perceptron.run(X, y)

    theoretical_bound = (R / margin)**2
    print(f"Margin γ={margin:.1f}: actual mistakes={perceptron.n_mistakes}, "
          f"bound=(R/γ)²={theoretical_bound:.1f}, "
          f"n_examples={len(X)}")

Follow-The-Leader (FTL)

The simplest online learning algorithm: at each round, choose the hypothesis that would have been best on all past data.

$h_{t+1} = \arg\min_{h \in \mathcal{H}} \sum_{s=1}^t \ell_s(h)$

When FTL works: For strongly convex losses (e.g., squared loss), FTL achieves regret $O(\log T)$ .

When FTL fails: For adversarial sequences with non-strongly-convex losses. Example: predict 1 or -1. The adversary always labels opposite to your current prediction. FTL switches every round, achieving linear regret - as bad as possible.

Follow-The-Regularised-Leader (FTRL)

The fix: add a regularisation term to stabilise the solution:

$h_{t+1} = \arg\min_{h \in \mathcal{H}} \left[\sum_{s=1}^t \ell_s(h) + \frac{1}{\eta}\Omega(h)\right]$

where $\Omega(h)$ is a regulariser (e.g., $\Omega(\mathbf{w}) = \frac{1}{2}\|\mathbf{w}\|_2^2$ ) and $\eta > 0$ is the learning rate.

FTRL achieves regret $O(\sqrt{T})$ for convex losses.

The connection to online gradient descent (OGD): OGD is equivalent to FTRL with a quadratic regulariser and linear (gradient) approximation of the loss.

class FTRL:
    """
    Follow-The-Regularised-Leader for online linear regression.
    Equivalent to online gradient descent with squared regulariser.
    """
    def __init__(self, d, eta=0.01, lambda_reg=0.1):
        self.w = np.zeros(d)
        self.eta = eta
        self.lambda_reg = lambda_reg
        self.cumulative_grad = np.zeros(d)
        self.t = 0

    def predict(self, x):
        return self.w @ x

    def update(self, x, y):
        """FTRL-Proximal update for squared loss."""
        self.t += 1
        y_hat = self.predict(x)
        grad = (y_hat - y) * x  # gradient of squared loss
        self.cumulative_grad += grad

        # FTRL closed form: minimize cumulative loss + regulariser
        self.w = -self.eta * self.cumulative_grad / (1 + self.lambda_reg * self.t)
        return (y_hat - y)**2


class OnlineGradientDescent:
    """
    Online gradient descent (equivalent to FTRL with quadratic regulariser).
    """
    def __init__(self, d, eta=0.01):
        self.w = np.zeros(d)
        self.eta = eta

    def predict(self, x):
        return self.w @ x

    def update(self, x, y):
        y_hat = self.predict(x)
        grad = (y_hat - y) * x
        self.w = self.w - self.eta * grad
        return (y_hat - y)**2

# Online regression simulation
np.random.seed(42)
T = 10000
d = 5
w_true = np.random.randn(d)

ftrl = FTRL(d, eta=0.01, lambda_reg=0.01)
ogd = OnlineGradientDescent(d, eta=0.005)

ftrl_losses = []
ogd_losses = []

for t in range(T):
    x = np.random.randn(d)
    y = w_true @ x + np.random.randn() * 0.1

    ftrl_losses.append(ftrl.update(x, y))
    ogd_losses.append(ogd.update(x, y))

# Regret = cumulative loss - best fixed hypothesis loss
best_loss = 0.01  # noise variance (achievable by best fixed w*)
ftrl_regret = np.cumsum(ftrl_losses) - np.arange(1, T+1) * best_loss
ogd_regret = np.cumsum(ogd_losses) - np.arange(1, T+1) * best_loss

print(f"FTRL total regret: {ftrl_regret[-1]:.2f}")
print(f"OGD total regret:  {ogd_regret[-1]:.2f}")
print(f"FTRL average regret per round: {ftrl_regret[-1]/T:.4f}")
print(f"OGD average regret per round:  {ogd_regret[-1]/T:.4f}")
print(f"Theory predicts O(sqrt(T)/T) = O(1/sqrt(T)) = {1/np.sqrt(T):.4f}")

Hedge: Online Learning Over Experts

The expert setting: rather than choosing a single hypothesis, choose a distribution over $N$ expert predictions.

Hedge algorithm (multiplicative weights update):

Initialize: uniform distribution $p_1 = (1/N, \ldots, 1/N)$ over $N$ experts
At round $t$ : predict $\hat{y}_t = \sum_i p_t(i) \hat{y}_t^{(i)}$ (weighted average)
Observe losses $\ell_t^{(1)}, \ldots, \ell_t^{(N)}$ for each expert
Update weights: $p_{t+1}(i) \propto p_t(i) \cdot e^{-\eta \ell_t^{(i)}}$ (reduce weight of experts that made large mistakes)
Renormalize: $p_{t+1}(i) = \frac{p_t(i) e^{-\eta \ell_t^{(i)}}}{\sum_j p_t(j) e^{-\eta \ell_t^{(j)}}}$

Regret bound: With learning rate $\eta = \sqrt{8\ln N / T}$ , Hedge achieves:

$\text{Regret}_T \leq \sqrt{\frac{T \ln N}{2}}$

Optimal rate: $O(\sqrt{T \ln N})$ - sublinear in $T$ , logarithmic in $N$ .

class HedgeAlgorithm:
    """
    Multiplicative weights (Hedge) for online learning over experts.
    Achieves regret O(sqrt(T log N)).
    """
    def __init__(self, n_experts, eta=None, T=None):
        self.n_experts = n_experts
        # Optimal eta: sqrt(8 * log(N) / T) if T is known
        self.eta = eta if eta else np.sqrt(8 * np.log(n_experts) / max(T or 1000, 1))
        self.weights = np.ones(n_experts) / n_experts
        self.t = 0
        self.total_regret = 0

    def predict_distribution(self):
        """Return current distribution over experts."""
        return self.weights.copy()

    def update(self, expert_losses):
        """
        Update weights based on expert losses.
        expert_losses: array of losses for each expert this round.
        """
        self.t += 1
        # Predicted loss (mixture of experts)
        predicted_loss = self.weights @ expert_losses

        # Update weights: penalise experts with high loss
        self.weights *= np.exp(-self.eta * expert_losses)
        self.weights /= self.weights.sum()  # renormalise

        # Best expert so far loss (for regret computation)
        return predicted_loss

# Simulation: N weather prediction models, adversarial temperature
np.random.seed(42)
T, N = 1000, 10

# Expert predictions: some are biased, some are random, one is near-optimal
true_temps = 10 + 5 * np.sin(np.linspace(0, 4*np.pi, T)) + np.random.randn(T)
expert_predictions = np.zeros((T, N))
for i in range(N):
    bias = np.random.uniform(-5, 5)
    noise = np.random.uniform(0.5, 3.0)
    expert_predictions[:, i] = true_temps + bias + np.random.randn(T) * noise

# One expert is nearly perfect
expert_predictions[:, 5] = true_temps + np.random.randn(T) * 0.5

hedge = HedgeAlgorithm(n_experts=N, T=T)
hedge_losses = []
best_expert_losses = [[] for _ in range(N)]

for t in range(T):
    dist = hedge.predict_distribution()
    mixture_pred = dist @ expert_predictions[t]
    expert_losses = (expert_predictions[t] - true_temps[t])**2  # squared loss

    hedge_loss = (mixture_pred - true_temps[t])**2
    hedge_losses.append(hedge_loss)
    for i in range(N): best_expert_losses[i].append(expert_losses[i])

    hedge.update(expert_losses)

# Compute regret
total_hedge_loss = sum(hedge_losses)
total_expert_losses = [sum(losses) for losses in best_expert_losses]
best_expert_idx = np.argmin(total_expert_losses)
best_expert_loss = total_expert_losses[best_expert_idx]

regret = total_hedge_loss - best_expert_loss
theoretical_bound = np.sqrt(T * np.log(N) / 2)

print(f"Hedge: total loss = {total_hedge_loss:.2f}")
print(f"Best expert (#{best_expert_idx}): total loss = {best_expert_loss:.2f}")
print(f"Regret = {regret:.2f}")
print(f"Theoretical bound = O(sqrt(T*log(N))) = {theoretical_bound:.2f}")
print(f"Average regret per round = {regret/T:.4f}")

Connection to ML Practice

Online Learning Concept	ML Practice Equivalent
Online gradient descent	Stochastic gradient descent (SGD)
FTRL-Proximal	AdaGrad, Adam (adaptive learning rate)
Hedge algorithm	Gradient boosting (weights on training examples)
Expert setting	Ensemble methods, model selection
Regret bound $O(\sqrt{T})$	Why SGD converges: $O(1/\sqrt{T})$ average regret
No-regret algorithm	Convergence guarantee for SGD-based optimizers
Online-to-batch conversion	Training on streaming data with theoretical guarantees

AdaGrad (Duchi et al., 2011) is directly derived from FTRL: it uses a diagonal adaptive learning rate that accumulates gradient magnitudes. This gives $O(\sqrt{T})$ regret, matching the FTRL bound, but with learning rates automatically scaled to each feature's gradient history.

Online Learning in Production: Key Design Decisions

When building a streaming ML system, online learning theory guides several decisions:

1. Regret vs. Accuracy: For non-stationary distributions (user preferences shift, concept drift), regret is the right objective - it measures performance relative to the best adaptive strategy. For stable distributions, batch learning with periodic retraining may be sufficient.

2. Learning rate schedule: The $O(\sqrt{T})$ regret bound uses $\eta_t = O(1/\sqrt{t})$ learning rates. Fixed learning rates give $O(T)$ regret (linear, unacceptable). Adaptive rates (AdaGrad, Adam) achieve the optimal bound with feature-specific scaling.

3. Expert ensemble size: The Hedge bound $O(\sqrt{T \ln N})$ shows you can maintain $N$ models with only logarithmic cost in $N$ . Maintaining 100 candidate model configurations costs only $\sqrt{\ln 100} \approx 2.1\times$ more than 1 configuration - compelling argument for online model selection.

4. Distribution shift handling: Online learning with a sliding window (forget old data) achieves $O(T^{2/3})$ adaptive regret against shifting targets - useful for seasonal patterns, breaking news, or user preference drift.

:::note Role-specific relevance ML Engineers: SGD's convergence guarantees come directly from online learning theory. The $O(1/\sqrt{T})$ convergence rate for SGD on convex losses IS the $O(\sqrt{T})$ regret bound divided by $T$ . When you tune learning rates, you're implicitly doing what FTRL suggests: balance accumulated gradient history against stability.

Research Engineers / Scientists: Online learning is foundational for understanding modern adaptive optimizers. Adam's moment estimates approximate the FTRL-Proximal solution with a diagonal adaptive matrix. The PAC-online connection (converting online algorithms to batch algorithms via online-to-batch conversion) appears in NeurIPS papers on learning theory.

AI Engineers / Production Systems: Ad auction CTR prediction, fraud detection, personalization - any system that must make predictions on a stream of events with immediate feedback uses online learning principles. The Hedge algorithm underlies ensemble methods in streaming settings. Regret bounds provide service-level guarantees: "our CTR predictor achieves at most X% worse performance than any fixed model in hindsight."

Data Engineers: The online learning model is the theoretical foundation for streaming data pipelines. Kafka + feature streaming + online model updates implement the online learning protocol. Regret bounds tell you how quickly the online model adapts to distribution shift - important for SLA planning. :::

Interview Questions

Q1: What is regret in online learning, and why is sublinear regret the goal?

Regret $= \sum_{t=1}^T \ell(h_t(x_t), y_t) - \min_{h \in \mathcal{H}} \sum_{t=1}^T \ell(h(x_t), y_t)$ is the difference between the learner's cumulative loss and the loss of the best fixed hypothesis chosen with full hindsight. Sublinear regret means $\text{Regret}_T = o(T)$ , so per-round regret $\to 0$ : the learner's average per-round performance approaches the best fixed strategy's average performance. This is the right notion of learning in adversarial or non-stationary settings - no distributional assumption is required, and the algorithm provably "learns" in the sense that it eventually performs as well as the best strategy in the hypothesis class.

Q2: State the Perceptron Convergence Theorem and explain each component.

If data is linearly separable with margin $\gamma$ (there exists $\|\mathbf{w}^*\| = 1$ with $y_t(\mathbf{w}^*)^\top x_t \geq \gamma$ for all $t$ ) and data radius $R = \max_t\|x_t\|$ , then the Perceptron makes at most $(R/\gamma)^2$ mistakes. The components: $R$ is the data radius (larger radius = harder geometry); $\gamma$ is the margin (smaller margin = data points closer to the decision boundary = harder). The bound $(R/\gamma)^2$ is tight - adversarial examples can force exactly this many mistakes. The theorem shows the Perceptron is equivalent to an SVM in the mistake-bound model: large-margin data requires few mistakes to learn. In practice: after $(R/\gamma)^2$ mistakes, the Perceptron has converged to a separator; subsequent examples in the separable case cause zero mistakes.

Q3: Why does Follow-The-Leader fail but Follow-The-Regularised-Leader succeeds?

FTL fails because its solutions can be unstable: if the optimal cumulative-loss hypothesis changes drastically with each new example, FTL switches predictions rapidly and incurs linear regret. Example: predict $+1$ or $-1$ on a sequence where the adversary always chooses the opposite of FTL's next prediction - FTL flips every round for linear regret. FTRL adds a regulariser $\frac{1}{\eta}\Omega(h)$ that penalises large changes from the origin. This stabilises the solution: even if the new loss function strongly prefers a different hypothesis, the regulariser prevents the solution from jumping too far. The resulting solutions change slowly and smoothly, giving the "be stable, don't change too much each round" property needed for sublinear regret. Technically: FTRL achieves low regret because the regulariser bounds the "stability" of consecutive solutions.

Q4: How does online gradient descent relate to batch SGD in neural network training?

Online gradient descent is: at each round, receive loss $\ell_t$ , compute gradient $g_t = \nabla_\mathbf{w} \ell_t(\mathbf{w}_t)$ , update $\mathbf{w}_{t+1} = \mathbf{w}_t - \eta_t g_t$ . SGD for neural networks is identical: process one minibatch (or one example), compute the loss gradient, update weights. The regret bound for OGD with a convex Lipschitz loss is $O(\sqrt{T})$ , meaning average regret $O(1/\sqrt{T}) \to 0$ . This is the theoretical guarantee that SGD for convex problems converges at rate $O(1/\sqrt{T})$ . For non-convex neural networks, the regret framework is modified: we instead bound convergence to first-order stationary points (gradient norm $\to 0$ ). AdaGrad uses FTRL-Proximal with adaptive diagonal preconditioning, which can achieve $O(\sqrt{T})$ regret even when the effective learning rate varies across coordinates - this is particularly useful for sparse gradients (NLP tasks).

Q5: Explain the Hedge algorithm and give a real ML example where it applies.

Hedge maintains a distribution over $N$ experts, updating weights multiplicatively: $p_{t+1}(i) \propto p_t(i) \cdot e^{-\eta \ell_t^{(i)}}$ . Experts with low loss retain high weight; experts with high loss are exponentially down-weighted. With optimal $\eta = \sqrt{8\ln N / T}$ , regret is $O(\sqrt{T\ln N})$ . Real ML example: online hyperparameter selection. During training, maintain $N$ candidate learning rate schedules (experts). At each epoch, compute validation loss for each schedule. Upweight schedules with low validation loss using Hedge. The final ensemble of parameter settings achieves near-best-single-schedule performance. This is used in practice in neural architecture search and automated ML. Another example: online ensemble of base models in streaming prediction - each base model (random forest, GBM, neural net) is an expert; Hedge weights them adaptively based on recent prediction errors, without requiring knowledge of which will be best in advance.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Online Learning demo on the EngineersOfAI Playground - no code required.

:::

The Streaming Prediction Problem​

What You Will Learn​

The Online Learning Protocol​

The Perceptron: The Original Online Learner​

Follow-The-Leader (FTL)​

Follow-The-Regularised-Leader (FTRL)​

Hedge: Online Learning Over Experts​

Connection to ML Practice​

Online Learning in Production: Key Design Decisions​

Interview Questions​