Skip to main content

Bayesian Updating

The Stream Never Stops

You're running a fraud detection system. 50,000 transactions arrive every minute. You can't store them all and retrain from scratch. You need a model that updates its beliefs in real time, incorporating each new transaction as it arrives.

Or you're tracking a drone with a radar system. Every 100ms you get a noisy position measurement. You need to maintain a running estimate of the drone's position and velocity, combining the physics of motion (your prior prediction) with the noisy sensor readings (the likelihood).

Or you're running an online recommendation system. Each user interaction - click, skip, dwell time - updates your belief about user preferences. You don't batch up interactions and retrain weekly; you update continuously.

These scenarios share a common structure: beliefs must be updated sequentially as new data arrives. Bayesian updating is the principled mathematical framework for exactly this problem.

The Sequential Updating Principle

The fundamental insight of Bayesian updating: today's posterior is tomorrow's prior.

Given data arriving in sequence x1,x2,,xnx_1, x_2, \ldots, x_n, assuming conditional independence (each xix_i is independent given θ\theta):

P(θx1,,xn)=P(θx1,,xn1)P(xnθ)P(xnx1,,xn1)P(\theta \mid x_1, \ldots, x_n) = P(\theta \mid x_1, \ldots, x_{n-1}) \cdot \frac{P(x_n \mid \theta)}{P(x_n \mid x_1, \ldots, x_{n-1})}

In practice, we use the proportionality form and normalize after:

P(θx1,,xn)P(θx1,,xn1)P(xnθ)P(\theta \mid x_1, \ldots, x_n) \propto P(\theta \mid x_1, \ldots, x_{n-1}) \cdot P(x_n \mid \theta)

Key property: Sequential updating gives exactly the same result as batch updating. The order doesn't matter. This is because:

P(θx1,,xn)P(x1,,xnθ)P(θ)=[i=1nP(xiθ)]P(θ)P(\theta \mid x_1, \ldots, x_n) \propto P(x_1, \ldots, x_n \mid \theta) P(\theta) = \left[\prod_{i=1}^n P(x_i \mid \theta)\right] P(\theta)

Whether you multiply all likelihoods at once (batch) or one at a time (sequential), you get the same product.

Beta-Bernoulli: Sequential Updating in Action

The Beta-Bernoulli model is the perfect illustration because updates are so simple.

Model:

  • Prior: θBeta(α0,β0)\theta \sim \text{Beta}(\alpha_0, \beta_0)
  • Likelihood: xiθBernoulli(θ)x_i \mid \theta \sim \text{Bernoulli}(\theta)
  • Update rule: Beta(α,β)x=1Beta(α+1,β)\text{Beta}(\alpha, \beta) \xrightarrow{x=1} \text{Beta}(\alpha+1, \beta) or x=0Beta(α,β+1)\xrightarrow{x=0} \text{Beta}(\alpha, \beta+1)

The parameters α\alpha and β\beta literally count the successes and failures you've seen, including the pseudo-counts from the prior.

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

class BetaBernoulliStream:
"""
Online Bayesian updating for a Bernoulli parameter.
Processes data one observation at a time.
"""
def __init__(self, alpha0=1.0, beta0=1.0):
self.alpha = alpha0
self.beta = beta0
self.n_obs = 0
self.history = [(alpha0, beta0, None)]

def update(self, x: int):
"""Update belief with a single Bernoulli observation (0 or 1)."""
assert x in (0, 1), "Observation must be 0 or 1"
self.alpha += x
self.beta += (1 - x)
self.n_obs += 1
self.history.append((self.alpha, self.beta, x))
return self

@property
def posterior_mean(self):
return self.alpha / (self.alpha + self.beta)

@property
def posterior_std(self):
a, b = self.alpha, self.beta
return np.sqrt(a * b / ((a + b)**2 * (a + b + 1)))

@property
def credible_interval_95(self):
return stats.beta(self.alpha, self.beta).ppf([0.025, 0.975])

def summary(self):
ci = self.credible_interval_95
return (f"n={self.n_obs}, alpha={self.alpha:.1f}, beta={self.beta:.1f}, "
f"mean={self.posterior_mean:.4f}, "
f"95% CI=[{ci[0]:.4f}, {ci[1]:.4f}]")

# Simulate fraud detection stream
np.random.seed(42)
true_fraud_rate = 0.08 # 8% of transactions are fraudulent

# Start with weak prior: Beta(2, 23) ~ roughly 8% rate
stream = BetaBernoulliStream(alpha0=2.0, beta0=23.0)
print(f"Prior: {stream.summary()}")
print()

# Process transactions one at a time
transactions = np.random.bernoulli(true_fraud_rate, size=1000)
checkpoints = [1, 5, 20, 100, 500, 1000]

for i, x in enumerate(transactions, 1):
stream.update(x)
if i in checkpoints:
print(f"After {i:4d} transactions: {stream.summary()}")

# Key insight: notice how the 95% CI shrinks as more data arrives
# and how the posterior mean converges to the true rate (0.08)

What to observe: The posterior mean starts near the prior mean, then gradually converges toward the true rate as more data arrives. The credible interval width shrinks as O(1/n)O(1/\sqrt{n}). The prior becomes irrelevant at large nn.

Visualizing the Evolution of Beliefs

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

def plot_belief_evolution():
"""Show how posterior evolves over sequential observations."""
np.random.seed(123)
true_p = 0.3
observations = np.random.bernoulli(true_p, size=100)

# Starting prior
alpha0, beta0 = 2, 2
theta_grid = np.linspace(0, 1, 500)

fig, axes = plt.subplots(2, 3, figsize=(14, 8))
axes = axes.flatten()

checkpoints = [0, 2, 5, 15, 50, 100]
for idx, n in enumerate(checkpoints):
ax = axes[idx]
# Posterior after n observations
h = observations[:n].sum()
t = n - h
alpha_n = alpha0 + h
beta_n = beta0 + t
posterior = stats.beta(alpha_n, beta_n)

ax.plot(theta_grid, posterior.pdf(theta_grid), 'b-', linewidth=2)
ax.axvline(true_p, color='red', linestyle='--', label=f'True p={true_p}', linewidth=2)
ax.axvline(posterior.mean(), color='blue', linestyle=':', label=f'Post. mean={posterior.mean():.3f}')
ci = posterior.ppf([0.025, 0.975])
ax.fill_between(theta_grid, posterior.pdf(theta_grid),
where=(theta_grid >= ci[0]) & (theta_grid <= ci[1]),
alpha=0.3, color='blue', label='95% CI')
ax.set_title(f'n={n}: Beta({alpha_n:.0f},{beta_n:.0f})')
ax.set_xlabel('θ')
ax.set_ylabel('Posterior density')
ax.legend(fontsize=7)
ax.set_xlim(0, 1)

plt.suptitle('Sequential Bayesian Updating: Belief Evolution Over Time', fontsize=13)
plt.tight_layout()
plt.savefig('bayesian_sequential_updates.png', dpi=150)
print("Plot saved.")

plot_belief_evolution()

The Forgetting Problem: Concept Drift

Standard Bayesian updating assumes the true parameter θ\theta is constant over time. In real ML systems, this assumption fails - user preferences change, fraud patterns evolve, distributions shift.

Solution: Introduce a discount factor or sliding window to allow the model to "forget" old observations:

Exponential Forgetting

Instead of accumulating all counts, discount old observations:

αt=λαt1+xt,βt=λβt1+(1xt)\alpha_t = \lambda \alpha_{t-1} + x_t, \quad \beta_t = \lambda \beta_{t-1} + (1 - x_t)

where λ(0,1)\lambda \in (0, 1) is the forgetting factor (e.g., λ=0.99\lambda = 0.99).

class AdaptiveBetaBernoulliStream:
"""
Beta-Bernoulli with exponential forgetting for non-stationary data.
lambda_factor: discount factor (0.99 = remember ~100 recent observations)
"""
def __init__(self, alpha0=1.0, beta0=1.0, lambda_factor=0.99):
self.alpha = alpha0
self.beta = beta0
self.lambda_factor = lambda_factor
self.n_effective = 0

def update(self, x: int):
# Discount old observations
self.alpha = self.lambda_factor * self.alpha + x
self.beta = self.lambda_factor * self.beta + (1 - x)
# Effective sample size
self.n_effective = (1 - self.lambda_factor**(self.n_effective + 1)) / (1 - self.lambda_factor)
return self

@property
def posterior_mean(self):
return self.alpha / (self.alpha + self.beta)

# Simulate non-stationary fraud: rate changes from 5% to 20% at t=500
np.random.seed(42)
obs1 = np.random.bernoulli(0.05, size=500)
obs2 = np.random.bernoulli(0.20, size=500)
all_obs = np.concatenate([obs1, obs2])

# Compare static vs adaptive
static_stream = BetaBernoulliStream(alpha0=2, beta0=38)
adaptive_stream = AdaptiveBetaBernoulliStream(alpha0=2, beta0=38, lambda_factor=0.99)

static_means = []
adaptive_means = []

for x in all_obs:
static_stream.update(x)
adaptive_stream.update(x)
static_means.append(static_stream.posterior_mean)
adaptive_means.append(adaptive_stream.posterior_mean)

print(f"Static model at t=999: mean = {static_means[-1]:.4f}")
print(f"Adaptive model at t=999: mean = {adaptive_means[-1]:.4f}")
print(f"True rate at t=999: 0.20")
# Adaptive model tracks concept drift; static model is slow to adapt

The Kalman Filter as Bayesian Updating

The Kalman filter is the most practically important example of sequential Bayesian updating. It powers GPS, robotics, drone control, financial trading systems, and self-driving vehicles.

The State Space Model

State transition:xt=Fxt1+wt,wtN(0,Q)\text{State transition:} \quad \mathbf{x}_t = \mathbf{F}\mathbf{x}_{t-1} + \mathbf{w}_t, \quad \mathbf{w}_t \sim \mathcal{N}(\mathbf{0}, \mathbf{Q}) Observation:zt=Hxt+vt,vtN(0,R)\text{Observation:} \quad \mathbf{z}_t = \mathbf{H}\mathbf{x}_t + \mathbf{v}_t, \quad \mathbf{v}_t \sim \mathcal{N}(\mathbf{0}, \mathbf{R})

  • xt\mathbf{x}_t: hidden state (e.g., position + velocity)
  • F\mathbf{F}: state transition matrix (physics model)
  • Q\mathbf{Q}: process noise covariance
  • zt\mathbf{z}_t: noisy measurement
  • H\mathbf{H}: observation matrix
  • R\mathbf{R}: measurement noise covariance

The Kalman Filter as Sequential Bayesian Update

The Kalman filter maintains a Gaussian posterior over the state:

P(xtz1:t)=N(xtμt,Σt)P(\mathbf{x}_t \mid \mathbf{z}_{1:t}) = \mathcal{N}(\mathbf{x}_t \mid \boldsymbol{\mu}_t, \boldsymbol{\Sigma}_t)

At each time step, it performs two operations:

1. Predict (prior update using dynamics model): μtt1=Fμt1\boldsymbol{\mu}_{t|t-1} = \mathbf{F}\boldsymbol{\mu}_{t-1} Σtt1=FΣt1F+Q\boldsymbol{\Sigma}_{t|t-1} = \mathbf{F}\boldsymbol{\Sigma}_{t-1}\mathbf{F}^\top + \mathbf{Q}

2. Update (posterior update using new measurement): Kt=Σtt1H(HΣtt1H+R)1(Kalman gain)\mathbf{K}_t = \boldsymbol{\Sigma}_{t|t-1}\mathbf{H}^\top (\mathbf{H}\boldsymbol{\Sigma}_{t|t-1}\mathbf{H}^\top + \mathbf{R})^{-1} \quad \text{(Kalman gain)} μt=μtt1+Kt(ztHμtt1)\boldsymbol{\mu}_t = \boldsymbol{\mu}_{t|t-1} + \mathbf{K}_t(\mathbf{z}_t - \mathbf{H}\boldsymbol{\mu}_{t|t-1}) Σt=(IKtH)Σtt1\boldsymbol{\Sigma}_t = (\mathbf{I} - \mathbf{K}_t\mathbf{H})\boldsymbol{\Sigma}_{t|t-1}

The Kalman gain Kt\mathbf{K}_t is the Bayesian weight: it's large when measurement noise (R\mathbf{R}) is small relative to state uncertainty (Σtt1\boldsymbol{\Sigma}_{t|t-1}), meaning we trust the measurement. It's small when the measurement is noisy, meaning we trust the dynamics model.

import numpy as np

class KalmanFilter1D:
"""
1D Kalman filter for tracking a moving object.
State: [position, velocity]
"""
def __init__(self, dt=1.0, process_noise=0.1, measurement_noise=2.0):
self.dt = dt
# State transition: constant velocity model
self.F = np.array([[1, dt], [0, 1]])
# Observation: we observe position only
self.H = np.array([[1, 0]])
# Process noise covariance
self.Q = process_noise * np.array([[dt**4/4, dt**3/2],
[dt**3/2, dt**2]])
# Measurement noise covariance
self.R = np.array([[measurement_noise**2]])
# Initial state: position=0, velocity=0
self.mu = np.array([0.0, 0.0])
# Initial uncertainty: high uncertainty
self.Sigma = np.eye(2) * 100.0

def predict(self):
"""Predict next state using dynamics model (prior step)."""
self.mu = self.F @ self.mu
self.Sigma = self.F @ self.Sigma @ self.F.T + self.Q
return self

def update(self, z):
"""Update state estimate using new measurement (likelihood step)."""
z = np.array([[z]])
# Innovation: difference between measurement and prediction
innovation = z - self.H @ self.mu.reshape(-1, 1)
# Innovation covariance
S = self.H @ self.Sigma @ self.H.T + self.R
# Kalman gain (Bayesian posterior weight)
K = self.Sigma @ self.H.T @ np.linalg.inv(S)
# Update mean and covariance
self.mu = self.mu + (K @ innovation).flatten()
self.Sigma = (np.eye(2) - K @ self.H) @ self.Sigma
return self

# Simulate: object moving at constant velocity v=1.0 m/s
# with noisy position measurements
np.random.seed(42)
true_velocity = 1.0
true_positions = np.arange(0, 50, true_velocity)
measurements = true_positions + np.random.normal(0, 2.0, len(true_positions))

kf = KalmanFilter1D(dt=1.0, process_noise=0.1, measurement_noise=2.0)
estimated_positions = []
uncertainties = []

for z in measurements:
kf.predict()
kf.update(z)
estimated_positions.append(kf.mu[0])
uncertainties.append(np.sqrt(kf.Sigma[0, 0]))

print(f"True final position: {true_positions[-1]:.1f}")
print(f"Kalman estimated final position: {estimated_positions[-1]:.3f}")
print(f"Final position uncertainty (1-sigma): {uncertainties[-1]:.4f}")
print(f"Final measurement: {measurements[-1]:.3f}")
print()
print("The Kalman estimate is smoother than raw measurements")
print("and the uncertainty quickly converges to a steady-state value")

:::note The Bayesian Interpretation of Kalman Gain The Kalman gain Kt=Σtt1H(HΣtt1H+R)1\mathbf{K}_t = \boldsymbol{\Sigma}_{t|t-1}\mathbf{H}^\top (\mathbf{H}\boldsymbol{\Sigma}_{t|t-1}\mathbf{H}^\top + \mathbf{R})^{-1} is exactly the Gaussian posterior update formula. When R0\mathbf{R} \to 0 (perfect measurement), KtH1\mathbf{K}_t \to \mathbf{H}^{-1}, and we trust the measurement completely. When R\mathbf{R} \to \infty (terrible measurement), Kt0\mathbf{K}_t \to 0, and we ignore the measurement. This is Bayesian uncertainty weighting: trust sources proportional to their precision. :::

Online Learning Connection

Bayesian sequential updating is the theoretical foundation for online learning algorithms. The connection:

Bayesian ConceptOnline Learning Equivalent
Prior P(θ)P(\theta)Regularization / initial hypothesis
Posterior update P(θxt)P(\theta \mid x_t)Model update from single example
Posterior meanCurrent best parameter estimate
Posterior varianceUncertainty / exploration bonus
Conjugate prior updateClosed-form online update rule
Sequential Bayesian = Batch BayesianEquivalent to full-dataset training

Gaussian process online regression is a direct extension of sequential Bayesian updating to function spaces - after each observation, the GP posterior is updated exactly, giving a running estimate of the unknown function.

Summary: When to Use Bayesian Sequential Updating

ScenarioRecommended Approach
Bernoulli/binomial stream (CTR, fraud rates)Beta-Bernoulli with conjugate updates
Gaussian stream (sensor readings, latency)Gaussian-Gaussian conjugate updates
Linear dynamical system (tracking, control)Kalman filter
Nonlinear dynamics, continuous stateExtended Kalman Filter (EKF) or particle filter
Non-stationary stream (concept drift)Exponential forgetting + conjugate updates
Complex posterior (neural network weights)Online variational inference (VOGN)

Interview Questions

Q1: Why is sequential Bayesian updating equivalent to batch Bayesian updating?

Because of conditional independence. Given θ\theta, observations x1,,xnx_1, \ldots, x_n are independent, so P(x1,,xnθ)=iP(xiθ)P(x_1, \ldots, x_n|\theta) = \prod_i P(x_i|\theta). Whether you compute iP(xiθ)\prod_i P(x_i|\theta) all at once or one factor at a time doesn't change the product. Formally: starting from prior P(θ)P(\theta), updating with x1x_1 gives P(θx1)P(x1θ)P(θ)P(\theta|x_1) \propto P(x_1|\theta)P(\theta). Updating this with x2x_2: P(θx1,x2)P(x2θ)P(θx1)P(x2θ)P(x1θ)P(θ)=P(x1,x2θ)P(θ)P(\theta|x_1,x_2) \propto P(x_2|\theta)P(\theta|x_1) \propto P(x_2|\theta)P(x_1|\theta)P(\theta) = P(x_1,x_2|\theta)P(\theta). This equals the batch posterior. The order of observations doesn't matter either.

Q2: What is the Kalman gain and what does it represent geometrically?

The Kalman gain Kt=Σtt1H(HΣtt1H+R)1\mathbf{K}_t = \boldsymbol{\Sigma}_{t|t-1}\mathbf{H}^\top(\mathbf{H}\boldsymbol{\Sigma}_{t|t-1}\mathbf{H}^\top + \mathbf{R})^{-1} is the weight assigned to the new measurement in the posterior mean update. Geometrically, it's the ratio of prior state uncertainty (projected into observation space) to total uncertainty (state uncertainty + measurement noise). When measurement noise R\mathbf{R} is small relative to state uncertainty Σtt1\boldsymbol{\Sigma}_{t|t-1}, the gain is large - we update strongly toward the measurement. When state uncertainty is small (we already know the state well), the gain is small - we don't update much. It's the optimal Bayesian linear combination of prior prediction and noisy measurement.

Q3: How would you handle concept drift in a Bayesian sequential model?

Several approaches: (1) Exponential forgetting - discount old pseudo-counts by λ<1\lambda < 1 at each step: αt=λαt1+xt\alpha_t = \lambda\alpha_{t-1} + x_t. The effective window size is 1/(1λ)1/(1-\lambda). (2) Sliding window - only count observations from the last WW time steps; equivalent to resetting the prior every WW steps. (3) Change point detection - explicitly model the possibility of a parameter change using a hierarchical Bayesian model; the Bayesian Online Changepoint Detection (BOCD) algorithm maintains a posterior over when the last change point occurred. (4) Heavy-tailed priors - use Student-t instead of Gaussian process noise to be robust to occasional large jumps. In practice for ML systems, exponential forgetting is simple and effective; BOCD is best when change points are sparse and you need to detect them explicitly.

Q4: What is the Extended Kalman Filter and when would you use it?

The Kalman filter assumes both the state transition and observation model are linear. The Extended Kalman Filter (EKF) handles nonlinear dynamics by linearizing around the current estimate using a first-order Taylor expansion (Jacobian). Specifically, if xt=f(xt1)+wt\mathbf{x}_t = f(\mathbf{x}_{t-1}) + \mathbf{w}_t is nonlinear, the EKF approximates it as xtf(μt1)+Ft(xt1μt1)+wt\mathbf{x}_t \approx f(\boldsymbol{\mu}_{t-1}) + \mathbf{F}_t(\mathbf{x}_{t-1} - \boldsymbol{\mu}_{t-1}) + \mathbf{w}_t where Ft=fμt1\mathbf{F}_t = \nabla f|_{\boldsymbol{\mu}_{t-1}}. Use the EKF for robot localization with nonlinear sensor models, vehicle tracking with nonlinear motion models, and GPS with spherical Earth geometry. The Unscented Kalman Filter (UKF) is often preferred over EKF because it doesn't require computing Jacobians and handles strong nonlinearities better.

Q5: How does the Beta-Bernoulli model connect to Thompson Sampling for recommendation?

Thompson Sampling is a Bayesian bandit algorithm for exploration vs exploitation. For each item/arm with unknown click-through rate θk\theta_k, maintain a Beta posterior Beta(αk,βk)\text{Beta}(\alpha_k, \beta_k). At each decision: (1) sample one value θ^kBeta(αk,βk)\hat{\theta}_k \sim \text{Beta}(\alpha_k, \beta_k) for each arm; (2) choose the arm with the highest sampled value; (3) observe reward (click or no-click); (4) update: αk+=1\alpha_k \mathrel{+}= 1 if clicked, βk+=1\beta_k \mathrel{+}= 1 if not. This is pure Bayesian sequential updating applied to multi-armed bandits. Thompson Sampling is provably optimal (matches the Lai-Robbins lower bound on regret) and naturally handles exploration - arms with high uncertainty (wide Beta distributions) get sampled with high probability because extreme values are more likely from wide distributions. Netflix, LinkedIn, and Pinterest use variants of this approach for real-time recommendation.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Prior to Posterior demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.