Bayesian Updating

The Stream Never Stops

You're running a fraud detection system. 50,000 transactions arrive every minute. You can't store them all and retrain from scratch. You need a model that updates its beliefs in real time, incorporating each new transaction as it arrives.

Or you're tracking a drone with a radar system. Every 100ms you get a noisy position measurement. You need to maintain a running estimate of the drone's position and velocity, combining the physics of motion (your prior prediction) with the noisy sensor readings (the likelihood).

Or you're running an online recommendation system. Each user interaction - click, skip, dwell time - updates your belief about user preferences. You don't batch up interactions and retrain weekly; you update continuously.

These scenarios share a common structure: beliefs must be updated sequentially as new data arrives. Bayesian updating is the principled mathematical framework for exactly this problem.

The Sequential Updating Principle

The fundamental insight of Bayesian updating: today's posterior is tomorrow's prior.

Given data arriving in sequence $x_1, x_2, \ldots, x_n$ , assuming conditional independence (each $x_i$ is independent given $\theta$ ):

$P(\theta \mid x_1, \ldots, x_n) = P(\theta \mid x_1, \ldots, x_{n-1}) \cdot \frac{P(x_n \mid \theta)}{P(x_n \mid x_1, \ldots, x_{n-1})}$

In practice, we use the proportionality form and normalize after:

$P(\theta \mid x_1, \ldots, x_n) \propto P(\theta \mid x_1, \ldots, x_{n-1}) \cdot P(x_n \mid \theta)$

Key property: Sequential updating gives exactly the same result as batch updating. The order doesn't matter. This is because:

$P(\theta \mid x_1, \ldots, x_n) \propto P(x_1, \ldots, x_n \mid \theta) P(\theta) = \left[\prod_{i=1}^n P(x_i \mid \theta)\right] P(\theta)$

Whether you multiply all likelihoods at once (batch) or one at a time (sequential), you get the same product.

Beta-Bernoulli: Sequential Updating in Action

The Beta-Bernoulli model is the perfect illustration because updates are so simple.

Model:

Prior: $\theta \sim \text{Beta}(\alpha_0, \beta_0)$
Likelihood: $x_i \mid \theta \sim \text{Bernoulli}(\theta)$
Update rule: $\text{Beta}(\alpha, \beta) \xrightarrow{x=1} \text{Beta}(\alpha+1, \beta)$ or $\xrightarrow{x=0} \text{Beta}(\alpha, \beta+1)$

The parameters $\alpha$ and $\beta$ literally count the successes and failures you've seen, including the pseudo-counts from the prior.

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

class BetaBernoulliStream:
    """
    Online Bayesian updating for a Bernoulli parameter.
    Processes data one observation at a time.
    """
    def __init__(self, alpha0=1.0, beta0=1.0):
        self.alpha = alpha0
        self.beta = beta0
        self.n_obs = 0
        self.history = [(alpha0, beta0, None)]

    def update(self, x: int):
        """Update belief with a single Bernoulli observation (0 or 1)."""
        assert x in (0, 1), "Observation must be 0 or 1"
        self.alpha += x
        self.beta += (1 - x)
        self.n_obs += 1
        self.history.append((self.alpha, self.beta, x))
        return self

    @property
    def posterior_mean(self):
        return self.alpha / (self.alpha + self.beta)

    @property
    def posterior_std(self):
        a, b = self.alpha, self.beta
        return np.sqrt(a * b / ((a + b)**2 * (a + b + 1)))

    @property
    def credible_interval_95(self):
        return stats.beta(self.alpha, self.beta).ppf([0.025, 0.975])

    def summary(self):
        ci = self.credible_interval_95
        return (f"n={self.n_obs}, alpha={self.alpha:.1f}, beta={self.beta:.1f}, "
                f"mean={self.posterior_mean:.4f}, "
                f"95% CI=[{ci[0]:.4f}, {ci[1]:.4f}]")

# Simulate fraud detection stream
np.random.seed(42)
true_fraud_rate = 0.08  # 8% of transactions are fraudulent

# Start with weak prior: Beta(2, 23) ~ roughly 8% rate
stream = BetaBernoulliStream(alpha0=2.0, beta0=23.0)
print(f"Prior: {stream.summary()}")
print()

# Process transactions one at a time
transactions = np.random.bernoulli(true_fraud_rate, size=1000)
checkpoints = [1, 5, 20, 100, 500, 1000]

for i, x in enumerate(transactions, 1):
    stream.update(x)
    if i in checkpoints:
        print(f"After {i:4d} transactions: {stream.summary()}")

# Key insight: notice how the 95% CI shrinks as more data arrives
# and how the posterior mean converges to the true rate (0.08)

What to observe: The posterior mean starts near the prior mean, then gradually converges toward the true rate as more data arrives. The credible interval width shrinks as $O(1/\sqrt{n})$ . The prior becomes irrelevant at large $n$ .

Visualizing the Evolution of Beliefs

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

def plot_belief_evolution():
    """Show how posterior evolves over sequential observations."""
    np.random.seed(123)
    true_p = 0.3
    observations = np.random.bernoulli(true_p, size=100)

    # Starting prior
    alpha0, beta0 = 2, 2
    theta_grid = np.linspace(0, 1, 500)

    fig, axes = plt.subplots(2, 3, figsize=(14, 8))
    axes = axes.flatten()

    checkpoints = [0, 2, 5, 15, 50, 100]
    for idx, n in enumerate(checkpoints):
        ax = axes[idx]
        # Posterior after n observations
        h = observations[:n].sum()
        t = n - h
        alpha_n = alpha0 + h
        beta_n = beta0 + t
        posterior = stats.beta(alpha_n, beta_n)

        ax.plot(theta_grid, posterior.pdf(theta_grid), 'b-', linewidth=2)
        ax.axvline(true_p, color='red', linestyle='--', label=f'True p={true_p}', linewidth=2)
        ax.axvline(posterior.mean(), color='blue', linestyle=':', label=f'Post. mean={posterior.mean():.3f}')
        ci = posterior.ppf([0.025, 0.975])
        ax.fill_between(theta_grid, posterior.pdf(theta_grid),
                       where=(theta_grid >= ci[0]) & (theta_grid <= ci[1]),
                       alpha=0.3, color='blue', label='95% CI')
        ax.set_title(f'n={n}: Beta({alpha_n:.0f},{beta_n:.0f})')
        ax.set_xlabel('θ')
        ax.set_ylabel('Posterior density')
        ax.legend(fontsize=7)
        ax.set_xlim(0, 1)

    plt.suptitle('Sequential Bayesian Updating: Belief Evolution Over Time', fontsize=13)
    plt.tight_layout()
    plt.savefig('bayesian_sequential_updates.png', dpi=150)
    print("Plot saved.")

plot_belief_evolution()

The Forgetting Problem: Concept Drift

Standard Bayesian updating assumes the true parameter $\theta$ is constant over time. In real ML systems, this assumption fails - user preferences change, fraud patterns evolve, distributions shift.

Solution: Introduce a discount factor or sliding window to allow the model to "forget" old observations:

Exponential Forgetting

Instead of accumulating all counts, discount old observations:

$\alpha_t = \lambda \alpha_{t-1} + x_t, \quad \beta_t = \lambda \beta_{t-1} + (1 - x_t)$

where $\lambda \in (0, 1)$ is the forgetting factor (e.g., $\lambda = 0.99$ ).

class AdaptiveBetaBernoulliStream:
    """
    Beta-Bernoulli with exponential forgetting for non-stationary data.
    lambda_factor: discount factor (0.99 = remember ~100 recent observations)
    """
    def __init__(self, alpha0=1.0, beta0=1.0, lambda_factor=0.99):
        self.alpha = alpha0
        self.beta = beta0
        self.lambda_factor = lambda_factor
        self.n_effective = 0

    def update(self, x: int):
        # Discount old observations
        self.alpha = self.lambda_factor * self.alpha + x
        self.beta = self.lambda_factor * self.beta + (1 - x)
        # Effective sample size
        self.n_effective = (1 - self.lambda_factor**(self.n_effective + 1)) / (1 - self.lambda_factor)
        return self

    @property
    def posterior_mean(self):
        return self.alpha / (self.alpha + self.beta)

# Simulate non-stationary fraud: rate changes from 5% to 20% at t=500
np.random.seed(42)
obs1 = np.random.bernoulli(0.05, size=500)
obs2 = np.random.bernoulli(0.20, size=500)
all_obs = np.concatenate([obs1, obs2])

# Compare static vs adaptive
static_stream = BetaBernoulliStream(alpha0=2, beta0=38)
adaptive_stream = AdaptiveBetaBernoulliStream(alpha0=2, beta0=38, lambda_factor=0.99)

static_means = []
adaptive_means = []

for x in all_obs:
    static_stream.update(x)
    adaptive_stream.update(x)
    static_means.append(static_stream.posterior_mean)
    adaptive_means.append(adaptive_stream.posterior_mean)

print(f"Static model at t=999: mean = {static_means[-1]:.4f}")
print(f"Adaptive model at t=999: mean = {adaptive_means[-1]:.4f}")
print(f"True rate at t=999: 0.20")
# Adaptive model tracks concept drift; static model is slow to adapt

The Kalman Filter as Bayesian Updating

The Kalman filter is the most practically important example of sequential Bayesian updating. It powers GPS, robotics, drone control, financial trading systems, and self-driving vehicles.

The State Space Model

$\text{State transition:} \quad \mathbf{x}_t = \mathbf{F}\mathbf{x}_{t-1} + \mathbf{w}_t, \quad \mathbf{w}_t \sim \mathcal{N}(\mathbf{0}, \mathbf{Q})$ $\text{Observation:} \quad \mathbf{z}_t = \mathbf{H}\mathbf{x}_t + \mathbf{v}_t, \quad \mathbf{v}_t \sim \mathcal{N}(\mathbf{0}, \mathbf{R})$

$\mathbf{x}_t$ : hidden state (e.g., position + velocity)
$\mathbf{F}$ : state transition matrix (physics model)
$\mathbf{Q}$ : process noise covariance
$\mathbf{z}_t$ : noisy measurement
$\mathbf{H}$ : observation matrix
$\mathbf{R}$ : measurement noise covariance

The Kalman Filter as Sequential Bayesian Update

The Kalman filter maintains a Gaussian posterior over the state:

$P(\mathbf{x}_t \mid \mathbf{z}_{1:t}) = \mathcal{N}(\mathbf{x}_t \mid \boldsymbol{\mu}_t, \boldsymbol{\Sigma}_t)$

At each time step, it performs two operations:

1. Predict (prior update using dynamics model): $\boldsymbol{\mu}_{t|t-1} = \mathbf{F}\boldsymbol{\mu}_{t-1}$ $\boldsymbol{\Sigma}_{t|t-1} = \mathbf{F}\boldsymbol{\Sigma}_{t-1}\mathbf{F}^\top + \mathbf{Q}$

2. Update (posterior update using new measurement): $\mathbf{K}_t = \boldsymbol{\Sigma}_{t|t-1}\mathbf{H}^\top (\mathbf{H}\boldsymbol{\Sigma}_{t|t-1}\mathbf{H}^\top + \mathbf{R})^{-1} \quad \text{(Kalman gain)}$ $\boldsymbol{\mu}_t = \boldsymbol{\mu}_{t|t-1} + \mathbf{K}_t(\mathbf{z}_t - \mathbf{H}\boldsymbol{\mu}_{t|t-1})$ $\boldsymbol{\Sigma}_t = (\mathbf{I} - \mathbf{K}_t\mathbf{H})\boldsymbol{\Sigma}_{t|t-1}$

The Kalman gain $\mathbf{K}_t$ is the Bayesian weight: it's large when measurement noise ( $\mathbf{R}$ ) is small relative to state uncertainty ( $\boldsymbol{\Sigma}_{t|t-1}$ ), meaning we trust the measurement. It's small when the measurement is noisy, meaning we trust the dynamics model.

import numpy as np

class KalmanFilter1D:
    """
    1D Kalman filter for tracking a moving object.
    State: [position, velocity]
    """
    def __init__(self, dt=1.0, process_noise=0.1, measurement_noise=2.0):
        self.dt = dt
        # State transition: constant velocity model
        self.F = np.array([[1, dt], [0, 1]])
        # Observation: we observe position only
        self.H = np.array([[1, 0]])
        # Process noise covariance
        self.Q = process_noise * np.array([[dt**4/4, dt**3/2],
                                            [dt**3/2, dt**2]])
        # Measurement noise covariance
        self.R = np.array([[measurement_noise**2]])
        # Initial state: position=0, velocity=0
        self.mu = np.array([0.0, 0.0])
        # Initial uncertainty: high uncertainty
        self.Sigma = np.eye(2) * 100.0

    def predict(self):
        """Predict next state using dynamics model (prior step)."""
        self.mu = self.F @ self.mu
        self.Sigma = self.F @ self.Sigma @ self.F.T + self.Q
        return self

    def update(self, z):
        """Update state estimate using new measurement (likelihood step)."""
        z = np.array([[z]])
        # Innovation: difference between measurement and prediction
        innovation = z - self.H @ self.mu.reshape(-1, 1)
        # Innovation covariance
        S = self.H @ self.Sigma @ self.H.T + self.R
        # Kalman gain (Bayesian posterior weight)
        K = self.Sigma @ self.H.T @ np.linalg.inv(S)
        # Update mean and covariance
        self.mu = self.mu + (K @ innovation).flatten()
        self.Sigma = (np.eye(2) - K @ self.H) @ self.Sigma
        return self

# Simulate: object moving at constant velocity v=1.0 m/s
# with noisy position measurements
np.random.seed(42)
true_velocity = 1.0
true_positions = np.arange(0, 50, true_velocity)
measurements = true_positions + np.random.normal(0, 2.0, len(true_positions))

kf = KalmanFilter1D(dt=1.0, process_noise=0.1, measurement_noise=2.0)
estimated_positions = []
uncertainties = []

for z in measurements:
    kf.predict()
    kf.update(z)
    estimated_positions.append(kf.mu[0])
    uncertainties.append(np.sqrt(kf.Sigma[0, 0]))

print(f"True final position: {true_positions[-1]:.1f}")
print(f"Kalman estimated final position: {estimated_positions[-1]:.3f}")
print(f"Final position uncertainty (1-sigma): {uncertainties[-1]:.4f}")
print(f"Final measurement: {measurements[-1]:.3f}")
print()
print("The Kalman estimate is smoother than raw measurements")
print("and the uncertainty quickly converges to a steady-state value")

:::note The Bayesian Interpretation of Kalman Gain The Kalman gain $\mathbf{K}_t = \boldsymbol{\Sigma}_{t|t-1}\mathbf{H}^\top (\mathbf{H}\boldsymbol{\Sigma}_{t|t-1}\mathbf{H}^\top + \mathbf{R})^{-1}$ is exactly the Gaussian posterior update formula. When $\mathbf{R} \to 0$ (perfect measurement), $\mathbf{K}_t \to \mathbf{H}^{-1}$ , and we trust the measurement completely. When $\mathbf{R} \to \infty$ (terrible measurement), $\mathbf{K}_t \to 0$ , and we ignore the measurement. This is Bayesian uncertainty weighting: trust sources proportional to their precision. :::

Online Learning Connection

Bayesian sequential updating is the theoretical foundation for online learning algorithms. The connection:

Bayesian Concept	Online Learning Equivalent
Prior $P(\theta)$	Regularization / initial hypothesis
Posterior update $P(\theta \mid x_t)$	Model update from single example
Posterior mean	Current best parameter estimate
Posterior variance	Uncertainty / exploration bonus
Conjugate prior update	Closed-form online update rule
Sequential Bayesian = Batch Bayesian	Equivalent to full-dataset training

Gaussian process online regression is a direct extension of sequential Bayesian updating to function spaces - after each observation, the GP posterior is updated exactly, giving a running estimate of the unknown function.

Summary: When to Use Bayesian Sequential Updating

Scenario	Recommended Approach
Bernoulli/binomial stream (CTR, fraud rates)	Beta-Bernoulli with conjugate updates
Gaussian stream (sensor readings, latency)	Gaussian-Gaussian conjugate updates
Linear dynamical system (tracking, control)	Kalman filter
Nonlinear dynamics, continuous state	Extended Kalman Filter (EKF) or particle filter
Non-stationary stream (concept drift)	Exponential forgetting + conjugate updates
Complex posterior (neural network weights)	Online variational inference (VOGN)

Interview Questions

Q1: Why is sequential Bayesian updating equivalent to batch Bayesian updating?

Because of conditional independence. Given $\theta$ , observations $x_1, \ldots, x_n$ are independent, so $P(x_1, \ldots, x_n|\theta) = \prod_i P(x_i|\theta)$ . Whether you compute $\prod_i P(x_i|\theta)$ all at once or one factor at a time doesn't change the product. Formally: starting from prior $P(\theta)$ , updating with $x_1$ gives $P(\theta|x_1) \propto P(x_1|\theta)P(\theta)$ . Updating this with $x_2$ : $P(\theta|x_1,x_2) \propto P(x_2|\theta)P(\theta|x_1) \propto P(x_2|\theta)P(x_1|\theta)P(\theta) = P(x_1,x_2|\theta)P(\theta)$ . This equals the batch posterior. The order of observations doesn't matter either.

Q2: What is the Kalman gain and what does it represent geometrically?

The Kalman gain $\mathbf{K}_t = \boldsymbol{\Sigma}_{t|t-1}\mathbf{H}^\top(\mathbf{H}\boldsymbol{\Sigma}_{t|t-1}\mathbf{H}^\top + \mathbf{R})^{-1}$ is the weight assigned to the new measurement in the posterior mean update. Geometrically, it's the ratio of prior state uncertainty (projected into observation space) to total uncertainty (state uncertainty + measurement noise). When measurement noise $\mathbf{R}$ is small relative to state uncertainty $\boldsymbol{\Sigma}_{t|t-1}$ , the gain is large - we update strongly toward the measurement. When state uncertainty is small (we already know the state well), the gain is small - we don't update much. It's the optimal Bayesian linear combination of prior prediction and noisy measurement.

Q3: How would you handle concept drift in a Bayesian sequential model?

Several approaches: (1) Exponential forgetting - discount old pseudo-counts by $\lambda < 1$ at each step: $\alpha_t = \lambda\alpha_{t-1} + x_t$ . The effective window size is $1/(1-\lambda)$ . (2) Sliding window - only count observations from the last $W$ time steps; equivalent to resetting the prior every $W$ steps. (3) Change point detection - explicitly model the possibility of a parameter change using a hierarchical Bayesian model; the Bayesian Online Changepoint Detection (BOCD) algorithm maintains a posterior over when the last change point occurred. (4) Heavy-tailed priors - use Student-t instead of Gaussian process noise to be robust to occasional large jumps. In practice for ML systems, exponential forgetting is simple and effective; BOCD is best when change points are sparse and you need to detect them explicitly.

Q4: What is the Extended Kalman Filter and when would you use it?

The Kalman filter assumes both the state transition and observation model are linear. The Extended Kalman Filter (EKF) handles nonlinear dynamics by linearizing around the current estimate using a first-order Taylor expansion (Jacobian). Specifically, if $\mathbf{x}_t = f(\mathbf{x}_{t-1}) + \mathbf{w}_t$ is nonlinear, the EKF approximates it as $\mathbf{x}_t \approx f(\boldsymbol{\mu}_{t-1}) + \mathbf{F}_t(\mathbf{x}_{t-1} - \boldsymbol{\mu}_{t-1}) + \mathbf{w}_t$ where $\mathbf{F}_t = \nabla f|_{\boldsymbol{\mu}_{t-1}}$ . Use the EKF for robot localization with nonlinear sensor models, vehicle tracking with nonlinear motion models, and GPS with spherical Earth geometry. The Unscented Kalman Filter (UKF) is often preferred over EKF because it doesn't require computing Jacobians and handles strong nonlinearities better.

Q5: How does the Beta-Bernoulli model connect to Thompson Sampling for recommendation?

Thompson Sampling is a Bayesian bandit algorithm for exploration vs exploitation. For each item/arm with unknown click-through rate $\theta_k$ , maintain a Beta posterior $\text{Beta}(\alpha_k, \beta_k)$ . At each decision: (1) sample one value $\hat{\theta}_k \sim \text{Beta}(\alpha_k, \beta_k)$ for each arm; (2) choose the arm with the highest sampled value; (3) observe reward (click or no-click); (4) update: $\alpha_k \mathrel{+}= 1$ if clicked, $\beta_k \mathrel{+}= 1$ if not. This is pure Bayesian sequential updating applied to multi-armed bandits. Thompson Sampling is provably optimal (matches the Lai-Robbins lower bound on regret) and naturally handles exploration - arms with high uncertainty (wide Beta distributions) get sampled with high probability because extreme values are more likely from wide distributions. Netflix, LinkedIn, and Pinterest use variants of this approach for real-time recommendation.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Prior to Posterior demo on the EngineersOfAI Playground - no code required.

:::

The Stream Never Stops​

The Sequential Updating Principle​

Beta-Bernoulli: Sequential Updating in Action​

Visualizing the Evolution of Beliefs​

The Forgetting Problem: Concept Drift​

Exponential Forgetting​

The Kalman Filter as Bayesian Updating​

The State Space Model​

The Kalman Filter as Sequential Bayesian Update​

Online Learning Connection​

Summary: When to Use Bayesian Sequential Updating​

Interview Questions​