Bayesian Updating
The Stream Never Stops
You're running a fraud detection system. 50,000 transactions arrive every minute. You can't store them all and retrain from scratch. You need a model that updates its beliefs in real time, incorporating each new transaction as it arrives.
Or you're tracking a drone with a radar system. Every 100ms you get a noisy position measurement. You need to maintain a running estimate of the drone's position and velocity, combining the physics of motion (your prior prediction) with the noisy sensor readings (the likelihood).
Or you're running an online recommendation system. Each user interaction - click, skip, dwell time - updates your belief about user preferences. You don't batch up interactions and retrain weekly; you update continuously.
These scenarios share a common structure: beliefs must be updated sequentially as new data arrives. Bayesian updating is the principled mathematical framework for exactly this problem.
The Sequential Updating Principle
The fundamental insight of Bayesian updating: today's posterior is tomorrow's prior.
Given data arriving in sequence , assuming conditional independence (each is independent given ):
In practice, we use the proportionality form and normalize after:
Key property: Sequential updating gives exactly the same result as batch updating. The order doesn't matter. This is because:
Whether you multiply all likelihoods at once (batch) or one at a time (sequential), you get the same product.
Beta-Bernoulli: Sequential Updating in Action
The Beta-Bernoulli model is the perfect illustration because updates are so simple.
Model:
- Prior:
- Likelihood:
- Update rule: or
The parameters and literally count the successes and failures you've seen, including the pseudo-counts from the prior.
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
class BetaBernoulliStream:
"""
Online Bayesian updating for a Bernoulli parameter.
Processes data one observation at a time.
"""
def __init__(self, alpha0=1.0, beta0=1.0):
self.alpha = alpha0
self.beta = beta0
self.n_obs = 0
self.history = [(alpha0, beta0, None)]
def update(self, x: int):
"""Update belief with a single Bernoulli observation (0 or 1)."""
assert x in (0, 1), "Observation must be 0 or 1"
self.alpha += x
self.beta += (1 - x)
self.n_obs += 1
self.history.append((self.alpha, self.beta, x))
return self
@property
def posterior_mean(self):
return self.alpha / (self.alpha + self.beta)
@property
def posterior_std(self):
a, b = self.alpha, self.beta
return np.sqrt(a * b / ((a + b)**2 * (a + b + 1)))
@property
def credible_interval_95(self):
return stats.beta(self.alpha, self.beta).ppf([0.025, 0.975])
def summary(self):
ci = self.credible_interval_95
return (f"n={self.n_obs}, alpha={self.alpha:.1f}, beta={self.beta:.1f}, "
f"mean={self.posterior_mean:.4f}, "
f"95% CI=[{ci[0]:.4f}, {ci[1]:.4f}]")
# Simulate fraud detection stream
np.random.seed(42)
true_fraud_rate = 0.08 # 8% of transactions are fraudulent
# Start with weak prior: Beta(2, 23) ~ roughly 8% rate
stream = BetaBernoulliStream(alpha0=2.0, beta0=23.0)
print(f"Prior: {stream.summary()}")
print()
# Process transactions one at a time
transactions = np.random.bernoulli(true_fraud_rate, size=1000)
checkpoints = [1, 5, 20, 100, 500, 1000]
for i, x in enumerate(transactions, 1):
stream.update(x)
if i in checkpoints:
print(f"After {i:4d} transactions: {stream.summary()}")
# Key insight: notice how the 95% CI shrinks as more data arrives
# and how the posterior mean converges to the true rate (0.08)
What to observe: The posterior mean starts near the prior mean, then gradually converges toward the true rate as more data arrives. The credible interval width shrinks as . The prior becomes irrelevant at large .
Visualizing the Evolution of Beliefs
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
def plot_belief_evolution():
"""Show how posterior evolves over sequential observations."""
np.random.seed(123)
true_p = 0.3
observations = np.random.bernoulli(true_p, size=100)
# Starting prior
alpha0, beta0 = 2, 2
theta_grid = np.linspace(0, 1, 500)
fig, axes = plt.subplots(2, 3, figsize=(14, 8))
axes = axes.flatten()
checkpoints = [0, 2, 5, 15, 50, 100]
for idx, n in enumerate(checkpoints):
ax = axes[idx]
# Posterior after n observations
h = observations[:n].sum()
t = n - h
alpha_n = alpha0 + h
beta_n = beta0 + t
posterior = stats.beta(alpha_n, beta_n)
ax.plot(theta_grid, posterior.pdf(theta_grid), 'b-', linewidth=2)
ax.axvline(true_p, color='red', linestyle='--', label=f'True p={true_p}', linewidth=2)
ax.axvline(posterior.mean(), color='blue', linestyle=':', label=f'Post. mean={posterior.mean():.3f}')
ci = posterior.ppf([0.025, 0.975])
ax.fill_between(theta_grid, posterior.pdf(theta_grid),
where=(theta_grid >= ci[0]) & (theta_grid <= ci[1]),
alpha=0.3, color='blue', label='95% CI')
ax.set_title(f'n={n}: Beta({alpha_n:.0f},{beta_n:.0f})')
ax.set_xlabel('θ')
ax.set_ylabel('Posterior density')
ax.legend(fontsize=7)
ax.set_xlim(0, 1)
plt.suptitle('Sequential Bayesian Updating: Belief Evolution Over Time', fontsize=13)
plt.tight_layout()
plt.savefig('bayesian_sequential_updates.png', dpi=150)
print("Plot saved.")
plot_belief_evolution()
The Forgetting Problem: Concept Drift
Standard Bayesian updating assumes the true parameter is constant over time. In real ML systems, this assumption fails - user preferences change, fraud patterns evolve, distributions shift.
Solution: Introduce a discount factor or sliding window to allow the model to "forget" old observations:
Exponential Forgetting
Instead of accumulating all counts, discount old observations:
where is the forgetting factor (e.g., ).
class AdaptiveBetaBernoulliStream:
"""
Beta-Bernoulli with exponential forgetting for non-stationary data.
lambda_factor: discount factor (0.99 = remember ~100 recent observations)
"""
def __init__(self, alpha0=1.0, beta0=1.0, lambda_factor=0.99):
self.alpha = alpha0
self.beta = beta0
self.lambda_factor = lambda_factor
self.n_effective = 0
def update(self, x: int):
# Discount old observations
self.alpha = self.lambda_factor * self.alpha + x
self.beta = self.lambda_factor * self.beta + (1 - x)
# Effective sample size
self.n_effective = (1 - self.lambda_factor**(self.n_effective + 1)) / (1 - self.lambda_factor)
return self
@property
def posterior_mean(self):
return self.alpha / (self.alpha + self.beta)
# Simulate non-stationary fraud: rate changes from 5% to 20% at t=500
np.random.seed(42)
obs1 = np.random.bernoulli(0.05, size=500)
obs2 = np.random.bernoulli(0.20, size=500)
all_obs = np.concatenate([obs1, obs2])
# Compare static vs adaptive
static_stream = BetaBernoulliStream(alpha0=2, beta0=38)
adaptive_stream = AdaptiveBetaBernoulliStream(alpha0=2, beta0=38, lambda_factor=0.99)
static_means = []
adaptive_means = []
for x in all_obs:
static_stream.update(x)
adaptive_stream.update(x)
static_means.append(static_stream.posterior_mean)
adaptive_means.append(adaptive_stream.posterior_mean)
print(f"Static model at t=999: mean = {static_means[-1]:.4f}")
print(f"Adaptive model at t=999: mean = {adaptive_means[-1]:.4f}")
print(f"True rate at t=999: 0.20")
# Adaptive model tracks concept drift; static model is slow to adapt
The Kalman Filter as Bayesian Updating
The Kalman filter is the most practically important example of sequential Bayesian updating. It powers GPS, robotics, drone control, financial trading systems, and self-driving vehicles.
The State Space Model
- : hidden state (e.g., position + velocity)
- : state transition matrix (physics model)
- : process noise covariance
- : noisy measurement
- : observation matrix
- : measurement noise covariance
The Kalman Filter as Sequential Bayesian Update
The Kalman filter maintains a Gaussian posterior over the state:
At each time step, it performs two operations:
1. Predict (prior update using dynamics model):
2. Update (posterior update using new measurement):
The Kalman gain is the Bayesian weight: it's large when measurement noise () is small relative to state uncertainty (), meaning we trust the measurement. It's small when the measurement is noisy, meaning we trust the dynamics model.
import numpy as np
class KalmanFilter1D:
"""
1D Kalman filter for tracking a moving object.
State: [position, velocity]
"""
def __init__(self, dt=1.0, process_noise=0.1, measurement_noise=2.0):
self.dt = dt
# State transition: constant velocity model
self.F = np.array([[1, dt], [0, 1]])
# Observation: we observe position only
self.H = np.array([[1, 0]])
# Process noise covariance
self.Q = process_noise * np.array([[dt**4/4, dt**3/2],
[dt**3/2, dt**2]])
# Measurement noise covariance
self.R = np.array([[measurement_noise**2]])
# Initial state: position=0, velocity=0
self.mu = np.array([0.0, 0.0])
# Initial uncertainty: high uncertainty
self.Sigma = np.eye(2) * 100.0
def predict(self):
"""Predict next state using dynamics model (prior step)."""
self.mu = self.F @ self.mu
self.Sigma = self.F @ self.Sigma @ self.F.T + self.Q
return self
def update(self, z):
"""Update state estimate using new measurement (likelihood step)."""
z = np.array([[z]])
# Innovation: difference between measurement and prediction
innovation = z - self.H @ self.mu.reshape(-1, 1)
# Innovation covariance
S = self.H @ self.Sigma @ self.H.T + self.R
# Kalman gain (Bayesian posterior weight)
K = self.Sigma @ self.H.T @ np.linalg.inv(S)
# Update mean and covariance
self.mu = self.mu + (K @ innovation).flatten()
self.Sigma = (np.eye(2) - K @ self.H) @ self.Sigma
return self
# Simulate: object moving at constant velocity v=1.0 m/s
# with noisy position measurements
np.random.seed(42)
true_velocity = 1.0
true_positions = np.arange(0, 50, true_velocity)
measurements = true_positions + np.random.normal(0, 2.0, len(true_positions))
kf = KalmanFilter1D(dt=1.0, process_noise=0.1, measurement_noise=2.0)
estimated_positions = []
uncertainties = []
for z in measurements:
kf.predict()
kf.update(z)
estimated_positions.append(kf.mu[0])
uncertainties.append(np.sqrt(kf.Sigma[0, 0]))
print(f"True final position: {true_positions[-1]:.1f}")
print(f"Kalman estimated final position: {estimated_positions[-1]:.3f}")
print(f"Final position uncertainty (1-sigma): {uncertainties[-1]:.4f}")
print(f"Final measurement: {measurements[-1]:.3f}")
print()
print("The Kalman estimate is smoother than raw measurements")
print("and the uncertainty quickly converges to a steady-state value")
:::note The Bayesian Interpretation of Kalman Gain The Kalman gain is exactly the Gaussian posterior update formula. When (perfect measurement), , and we trust the measurement completely. When (terrible measurement), , and we ignore the measurement. This is Bayesian uncertainty weighting: trust sources proportional to their precision. :::
Online Learning Connection
Bayesian sequential updating is the theoretical foundation for online learning algorithms. The connection:
| Bayesian Concept | Online Learning Equivalent |
|---|---|
| Prior | Regularization / initial hypothesis |
| Posterior update | Model update from single example |
| Posterior mean | Current best parameter estimate |
| Posterior variance | Uncertainty / exploration bonus |
| Conjugate prior update | Closed-form online update rule |
| Sequential Bayesian = Batch Bayesian | Equivalent to full-dataset training |
Gaussian process online regression is a direct extension of sequential Bayesian updating to function spaces - after each observation, the GP posterior is updated exactly, giving a running estimate of the unknown function.
Summary: When to Use Bayesian Sequential Updating
| Scenario | Recommended Approach |
|---|---|
| Bernoulli/binomial stream (CTR, fraud rates) | Beta-Bernoulli with conjugate updates |
| Gaussian stream (sensor readings, latency) | Gaussian-Gaussian conjugate updates |
| Linear dynamical system (tracking, control) | Kalman filter |
| Nonlinear dynamics, continuous state | Extended Kalman Filter (EKF) or particle filter |
| Non-stationary stream (concept drift) | Exponential forgetting + conjugate updates |
| Complex posterior (neural network weights) | Online variational inference (VOGN) |
Interview Questions
Q1: Why is sequential Bayesian updating equivalent to batch Bayesian updating?
Because of conditional independence. Given , observations are independent, so . Whether you compute all at once or one factor at a time doesn't change the product. Formally: starting from prior , updating with gives . Updating this with : . This equals the batch posterior. The order of observations doesn't matter either.
Q2: What is the Kalman gain and what does it represent geometrically?
The Kalman gain is the weight assigned to the new measurement in the posterior mean update. Geometrically, it's the ratio of prior state uncertainty (projected into observation space) to total uncertainty (state uncertainty + measurement noise). When measurement noise is small relative to state uncertainty , the gain is large - we update strongly toward the measurement. When state uncertainty is small (we already know the state well), the gain is small - we don't update much. It's the optimal Bayesian linear combination of prior prediction and noisy measurement.
Q3: How would you handle concept drift in a Bayesian sequential model?
Several approaches: (1) Exponential forgetting - discount old pseudo-counts by at each step: . The effective window size is . (2) Sliding window - only count observations from the last time steps; equivalent to resetting the prior every steps. (3) Change point detection - explicitly model the possibility of a parameter change using a hierarchical Bayesian model; the Bayesian Online Changepoint Detection (BOCD) algorithm maintains a posterior over when the last change point occurred. (4) Heavy-tailed priors - use Student-t instead of Gaussian process noise to be robust to occasional large jumps. In practice for ML systems, exponential forgetting is simple and effective; BOCD is best when change points are sparse and you need to detect them explicitly.
Q4: What is the Extended Kalman Filter and when would you use it?
The Kalman filter assumes both the state transition and observation model are linear. The Extended Kalman Filter (EKF) handles nonlinear dynamics by linearizing around the current estimate using a first-order Taylor expansion (Jacobian). Specifically, if is nonlinear, the EKF approximates it as where . Use the EKF for robot localization with nonlinear sensor models, vehicle tracking with nonlinear motion models, and GPS with spherical Earth geometry. The Unscented Kalman Filter (UKF) is often preferred over EKF because it doesn't require computing Jacobians and handles strong nonlinearities better.
Q5: How does the Beta-Bernoulli model connect to Thompson Sampling for recommendation?
Thompson Sampling is a Bayesian bandit algorithm for exploration vs exploitation. For each item/arm with unknown click-through rate , maintain a Beta posterior . At each decision: (1) sample one value for each arm; (2) choose the arm with the highest sampled value; (3) observe reward (click or no-click); (4) update: if clicked, if not. This is pure Bayesian sequential updating applied to multi-armed bandits. Thompson Sampling is provably optimal (matches the Lai-Robbins lower bound on regret) and naturally handles exploration - arms with high uncertainty (wide Beta distributions) get sampled with high probability because extreme values are more likely from wide distributions. Netflix, LinkedIn, and Pinterest use variants of this approach for real-time recommendation.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Prior to Posterior demo on the EngineersOfAI Playground - no code required.
:::
