Skip to main content

State Space Models and the Kalman Filter

Reading time: ~45 minutes Interview relevance: High - Kalman filter is standard in robotics, autonomous driving, financial ML, and signal processing interviews; conceptual connection to LSTMs is frequently tested Target roles: ML Engineer, Research Engineer, Robotics/Autonomous Systems, Financial ML

The Real Interview Moment

You're interviewing at an autonomous vehicle startup. The interviewer describes the problem: "Our car has a GPS with 3-meter accuracy, an IMU with drift, and a camera. We need a smooth, accurate position estimate at 100 Hz. How would you approach this?"

The candidate who says "use the GPS reading directly" fails immediately.

The candidate who says "average the sensors" gets a polite nod.

The candidate who says "Kalman filter - it's the optimal linear estimator that fuses noisy sensor measurements with a motion model, giving a minimum variance estimate at each timestep" gets the job.

The Kalman filter is not just a robotics tool. It is:

  • The algorithm behind GPS navigation systems
  • The foundation for financial time series tracking (dynamic linear models)
  • The theoretical ancestor of RNNs and LSTMs (learned state space models)
  • A Bayesian filtering algorithm that does optimal inference in linear Gaussian systems
  • The backbone of structural time series (trend + seasonality + noise decomposition)

State Space Representation

A state space model separates the system into two equations:

  1. Transition equation (state evolution model): xt=Ftxt1+Btut+wt,wtN(0,Qt)\mathbf{x}_t = \mathbf{F}_t \mathbf{x}_{t-1} + \mathbf{B}_t \mathbf{u}_t + \mathbf{w}_t, \quad \mathbf{w}_t \sim \mathcal{N}(\mathbf{0}, \mathbf{Q}_t)

  2. Observation equation (measurement model): yt=Htxt+vt,vtN(0,Rt)\mathbf{y}_t = \mathbf{H}_t \mathbf{x}_t + \mathbf{v}_t, \quad \mathbf{v}_t \sim \mathcal{N}(\mathbf{0}, \mathbf{R}_t)

Variables:

SymbolDimensionMeaning
xt\mathbf{x}_tn×1n \times 1Hidden state (latent)
yt\mathbf{y}_tm×1m \times 1Observation (measured)
Ft\mathbf{F}_tn×nn \times nState transition matrix
Ht\mathbf{H}_tm×nm \times nObservation matrix
Bt\mathbf{B}_tn×rn \times rControl input matrix
ut\mathbf{u}_tr×1r \times 1Control input
Qt\mathbf{Q}_tn×nn \times nProcess noise covariance
Rt\mathbf{R}_tm×mm \times mObservation noise covariance

ARIMA as a state space model

Every ARIMA model can be written in state space form. This is how modern implementations like statsmodels compute the likelihood via the Kalman filter.

ARIMA(2,1,1) in state space form (one of many representations):

xt=(XtXt1Xt2),F=(ϕ1ϕ20100010),H=(100)\mathbf{x}_t = \begin{pmatrix} X_t \\ X_{t-1} \\ X_{t-2} \end{pmatrix}, \quad \mathbf{F} = \begin{pmatrix} \phi_1 & \phi_2 & 0 \\ 1 & 0 & 0 \\ 0 & 1 & 0 \end{pmatrix}, \quad \mathbf{H} = \begin{pmatrix} 1 & 0 & 0 \end{pmatrix}

Local level model (random walk + noise)

The simplest useful state space model: xt=xt1+wt,wtN(0,q)x_t = x_{t-1} + w_t, \quad w_t \sim \mathcal{N}(0, q) yt=xt+vt,vtN(0,r)y_t = x_t + v_t, \quad v_t \sim \mathcal{N}(0, r)

The "true level" xtx_t evolves as a random walk; observations yty_t are noisy measurements. The Kalman filter gives the optimal estimate of xtx_t given all observations up to tt.

The Kalman Filter: Optimal Linear Estimation

The Kalman filter solves the problem: given all observations y1:t\mathbf{y}_{1:t}, compute the optimal estimate of the hidden state xt\mathbf{x}_t.

For linear Gaussian systems, the optimal estimate (minimum mean squared error) is the conditional mean x^tt=E[xty1:t]\hat{\mathbf{x}}_{t|t} = \mathbb{E}[\mathbf{x}_t | \mathbf{y}_{1:t}].

The filter proceeds in two alternating steps at each timestep.

Prediction step (time update)

Using the previous estimate, predict the current state:

x^tt1=Ftx^t1t1\hat{\mathbf{x}}_{t|t-1} = \mathbf{F}_t \hat{\mathbf{x}}_{t-1|t-1}

Ptt1=FtPt1t1Ft+Qt\mathbf{P}_{t|t-1} = \mathbf{F}_t \mathbf{P}_{t-1|t-1} \mathbf{F}_t^\top + \mathbf{Q}_t

  • x^tt1\hat{\mathbf{x}}_{t|t-1}: prior state estimate (before seeing yt\mathbf{y}_t)
  • Ptt1\mathbf{P}_{t|t-1}: prior error covariance (uncertainty in prediction)

Update step (measurement update)

When observation yt\mathbf{y}_t arrives, update the estimate:

Innovation (prediction error): y~t=ytHtx^tt1\tilde{\mathbf{y}}_t = \mathbf{y}_t - \mathbf{H}_t \hat{\mathbf{x}}_{t|t-1}

Innovation covariance: St=HtPtt1Ht+Rt\mathbf{S}_t = \mathbf{H}_t \mathbf{P}_{t|t-1} \mathbf{H}_t^\top + \mathbf{R}_t

Kalman gain (how much to trust the measurement vs the prediction): Kt=Ptt1HtSt1\mathbf{K}_t = \mathbf{P}_{t|t-1} \mathbf{H}_t^\top \mathbf{S}_t^{-1}

Posterior state estimate: x^tt=x^tt1+Kty~t\hat{\mathbf{x}}_{t|t} = \hat{\mathbf{x}}_{t|t-1} + \mathbf{K}_t \tilde{\mathbf{y}}_t

Posterior covariance: Ptt=(IKtHt)Ptt1\mathbf{P}_{t|t} = (\mathbf{I} - \mathbf{K}_t \mathbf{H}_t) \mathbf{P}_{t|t-1}

Interpretation of Kalman gain

Kt=prediction uncertaintyprediction uncertainty+measurement uncertainty\mathbf{K}_t = \frac{\text{prediction uncertainty}}{\text{prediction uncertainty} + \text{measurement uncertainty}}

  • KtHt1\mathbf{K}_t \to \mathbf{H}_t^{-1} (trust measurement fully): when measurement noise Rt0\mathbf{R}_t \to 0
  • Kt0\mathbf{K}_t \to 0 (ignore measurement, trust model): when Rt\mathbf{R}_t \to \infty or Ptt10\mathbf{P}_{t|t-1} \to 0

Python Implementation: Full Kalman Filter

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from dataclasses import dataclass, field
from typing import Optional

@dataclass
class KalmanFilter:
"""
Linear Kalman Filter for state space model:
x_t = F @ x_{t-1} + B @ u_t + w_t, w_t ~ N(0, Q)
y_t = H @ x_t + v_t, v_t ~ N(0, R)

Attributes:
F: State transition matrix (n x n)
H: Observation matrix (m x n)
Q: Process noise covariance (n x n)
R: Observation noise covariance (m x m)
B: Control input matrix (n x r), optional
x0: Initial state estimate (n,)
P0: Initial covariance (n x n)
"""
F: np.ndarray # state transition
H: np.ndarray # observation
Q: np.ndarray # process noise
R: np.ndarray # measurement noise
B: Optional[np.ndarray] = None
x0: Optional[np.ndarray] = None
P0: Optional[np.ndarray] = None

def __post_init__(self):
n = self.F.shape[0]
if self.x0 is None:
self.x0 = np.zeros(n)
if self.P0 is None:
self.P0 = np.eye(n) * 1e3 # large initial uncertainty

def filter(
self,
observations: np.ndarray,
controls: Optional[np.ndarray] = None
) -> dict:
"""
Run Kalman filter over a sequence of observations.

Args:
observations: array of shape (T, m) - T measurements of dimension m
controls: optional array of shape (T, r) - control inputs

Returns:
dict with:
x_pred: prior estimates (T, n)
P_pred: prior covariances (T, n, n)
x_filt: posterior estimates (T, n)
P_filt: posterior covariances (T, n, n)
innovations: (T, m)
K: Kalman gains (T, n, m)
log_likelihood: scalar
"""
T = len(observations)
n = self.F.shape[0]
m = self.H.shape[0]

# Storage
x_pred = np.zeros((T, n))
P_pred = np.zeros((T, n, n))
x_filt = np.zeros((T, n))
P_filt = np.zeros((T, n, n))
innovations = np.zeros((T, m))
K_all = np.zeros((T, n, m))
log_likelihood = 0.0

# Initial state
x = self.x0.copy()
P = self.P0.copy()

for t in range(T):
# ── Prediction step ────────────────────────────────────────────
# Control input
Bu = np.zeros(n)
if self.B is not None and controls is not None:
Bu = self.B @ controls[t]

x_prior = self.F @ x + Bu
P_prior = self.F @ P @ self.F.T + self.Q

x_pred[t] = x_prior
P_pred[t] = P_prior

# ── Update step ────────────────────────────────────────────────
y = observations[t]

# Innovation
innovation = y - self.H @ x_prior
innovations[t] = innovation

# Innovation covariance S = H P H^T + R
S = self.H @ P_prior @ self.H.T + self.R

# Kalman gain K = P H^T S^{-1}
K = P_prior @ self.H.T @ np.linalg.inv(S)
K_all[t] = K

# Posterior state
x = x_prior + K @ innovation
P = (np.eye(n) - K @ self.H) @ P_prior

# Joseph form for numerical stability:
# P = (I - KH) P_prior (I - KH)^T + K R K^T
# (symmetric by construction, always positive semi-definite)
IKH = np.eye(n) - K @ self.H
P = IKH @ P_prior @ IKH.T + K @ self.R @ K.T

x_filt[t] = x
P_filt[t] = P

# Log-likelihood contribution: log N(innovation; 0, S)
sign, logdet = np.linalg.slogdet(S)
log_likelihood -= 0.5 * (m * np.log(2*np.pi) + logdet +
innovation @ np.linalg.inv(S) @ innovation)

return {
'x_pred': x_pred, 'P_pred': P_pred,
'x_filt': x_filt, 'P_filt': P_filt,
'innovations': innovations, 'K': K_all,
'log_likelihood': log_likelihood
}

def smooth(self, filter_output: dict) -> dict:
"""
Rauch-Tung-Striebel (RTS) smoother.

The Kalman filter is a causal estimator: at time t, it uses
observations y_{1:t}. The smoother is non-causal: at time t,
it uses ALL observations y_{1:T}, giving better estimates.

Use case: offline analysis where you have the full series,
but not for real-time filtering.
"""
x_filt = filter_output['x_filt']
P_filt = filter_output['P_filt']
x_pred = filter_output['x_pred']
P_pred = filter_output['P_pred']

T = len(x_filt)
n = x_filt.shape[1]

x_smooth = np.zeros((T, n))
P_smooth = np.zeros((T, n, n))

# Initialize from end
x_smooth[-1] = x_filt[-1]
P_smooth[-1] = P_filt[-1]

for t in range(T - 2, -1, -1):
# Smoother gain: G_t = P_t|t * F^T * P_{t+1|t}^{-1}
G = P_filt[t] @ self.F.T @ np.linalg.inv(P_pred[t+1])

x_smooth[t] = x_filt[t] + G @ (x_smooth[t+1] - x_pred[t+1])
P_smooth[t] = P_filt[t] + G @ (P_smooth[t+1] - P_pred[t+1]) @ G.T

return {'x_smooth': x_smooth, 'P_smooth': P_smooth}

Example 1: 1D Noisy Signal Tracking

# ─── Local level model: x_t = x_{t-1} + w_t, y_t = x_t + v_t ───────────────
np.random.seed(42)
T = 200

# True state: piecewise linear trend
true_state = np.zeros(T)
for t in range(1, T):
if t < 50:
true_state[t] = true_state[t-1] + 0.5 + np.random.normal(0, 0.1)
elif t < 100:
true_state[t] = true_state[t-1] - 0.3 + np.random.normal(0, 0.1)
else:
true_state[t] = true_state[t-1] + 0.2 + np.random.normal(0, 0.1)

# Observations with noise
obs_noise_std = 3.0
observations = true_state + np.random.normal(0, obs_noise_std, T)

# ─── Define Kalman Filter ────────────────────────────────────────────────────
# State: [position, velocity] (local linear trend model)
# x_t = [pos_t, vel_t]
# Transition: pos_t = pos_{t-1} + vel_{t-1}, vel_t = vel_{t-1} (constant velocity)
dt = 1.0
F = np.array([[1, dt],
[0, 1]])

H = np.array([[1, 0]]) # observe position only

# Process noise: velocity changes are stochastic
q = 0.05 # process noise intensity
Q = q * np.array([[dt**3/3, dt**2/2],
[dt**2/2, dt]])

R = np.array([[obs_noise_std**2]]) # measurement noise variance

kf = KalmanFilter(
F=F, H=H, Q=Q, R=R,
x0=np.array([observations[0], 0.0]), # initial position from first obs
P0=np.diag([obs_noise_std**2, 1.0])
)

# Run filter
result = kf.filter(observations.reshape(-1, 1))

# Run smoother
smooth_result = kf.smooth(result)

x_filtered = result['x_filt'][:, 0] # position estimates (filtered)
x_smoothed = smooth_result['x_smooth'][:, 0] # position estimates (smoothed)

# Evaluate
filter_rmse = np.sqrt(np.mean((x_filtered - true_state)**2))
smooth_rmse = np.sqrt(np.mean((x_smoothed - true_state)**2))
noisy_rmse = np.sqrt(np.mean((observations - true_state)**2))

print("Signal Tracking Results:")
print(f" Noisy observations RMSE: {noisy_rmse:.4f}")
print(f" Kalman filter RMSE: {filter_rmse:.4f} ({(1-filter_rmse/noisy_rmse)*100:.1f}% improvement)")
print(f" Kalman smoother RMSE: {smooth_rmse:.4f} ({(1-smooth_rmse/noisy_rmse)*100:.1f}% improvement)")

Output:

Signal Tracking Results:
Noisy observations RMSE: 2.9847
Kalman filter RMSE: 1.2143 (59.3% improvement)
Kalman smoother RMSE: 0.8741 (70.7% improvement)

Example 2: Sensor Fusion

Fusing GPS (accurate but slow) with IMU (fast but drifts) is the quintessential Kalman filter application.

def sensor_fusion_kalman(
gps_measurements: np.ndarray,
imu_measurements: np.ndarray,
gps_frequency: int = 10, # GPS at 10 Hz
imu_frequency: int = 100, # IMU at 100 Hz
gps_std: float = 3.0, # GPS accuracy: 3 meters
imu_std: float = 0.1, # IMU velocity noise: 0.1 m/s
) -> np.ndarray:
"""
Fuse GPS (noisy, slow) and IMU (fast, drifts) for position estimation.

Model:
State: x = [position, velocity]
IMU provides velocity measurement (high freq, low noise)
GPS provides position measurement (low freq, high noise)

At IMU-only timesteps: update with IMU only
At GPS timesteps: update with both GPS and IMU
"""
T_imu = len(imu_measurements)
dt = 1.0 / imu_frequency
gps_ratio = imu_frequency // gps_frequency # update GPS every N IMU steps

# State: [position, velocity]
x = np.array([gps_measurements[0], 0.0])
P = np.diag([gps_std**2, (imu_std*10)**2]) # high velocity uncertainty initially

F = np.array([[1, dt], [0, 1]])
Q = np.diag([0.01 * dt, 0.01]) # small process noise

# Observation models
H_gps = np.array([[1, 0]]) # GPS measures position
H_imu = np.array([[0, 1]]) # IMU measures velocity
R_gps = np.array([[gps_std**2]])
R_imu = np.array([[imu_std**2]])

estimates = np.zeros(T_imu)
gps_idx = 0

for t in range(T_imu):
# Predict
x = F @ x
P = F @ P @ F.T + Q

# IMU update (always available)
K_imu = P @ H_imu.T @ np.linalg.inv(H_imu @ P @ H_imu.T + R_imu)
innov_imu = imu_measurements[t] - H_imu @ x
x = x + (K_imu @ innov_imu).ravel()
P = (np.eye(2) - K_imu @ H_imu) @ P

# GPS update (every gps_ratio steps)
if t % gps_ratio == 0 and gps_idx < len(gps_measurements):
K_gps = P @ H_gps.T @ np.linalg.inv(H_gps @ P @ H_gps.T + R_gps)
innov_gps = gps_measurements[gps_idx] - H_gps @ x
x = x + (K_gps @ innov_gps).ravel()
P = (np.eye(2) - K_gps @ H_gps) @ P
gps_idx += 1

estimates[t] = x[0]

return estimates

# Simulate
np.random.seed(42)
T_total = 1000 # 10 seconds at 100 Hz

# True trajectory: constant acceleration
true_position = np.cumsum(np.cumsum(np.ones(T_total) * 0.1))
true_velocity = np.cumsum(np.ones(T_total) * 0.1)

# GPS: 10 Hz, 3m accuracy
gps_times = np.arange(0, T_total, 10)
gps_obs = true_position[gps_times] + np.random.normal(0, 3.0, len(gps_times))

# IMU: 100 Hz, 0.1 m/s accuracy
imu_obs = true_velocity + np.random.normal(0, 0.1, T_total)

fused = sensor_fusion_kalman(gps_obs, imu_obs)

gps_interp_rmse = np.sqrt(np.mean((np.interp(np.arange(T_total), gps_times, gps_obs) - true_position)**2))
fused_rmse = np.sqrt(np.mean((fused - true_position)**2))
print(f"GPS-only RMSE (interpolated): {gps_interp_rmse:.4f} m")
print(f"Fused (KF) RMSE: {fused_rmse:.4f} m")

Example 3: EM Algorithm for Parameter Learning

In practice, the noise covariances Q\mathbf{Q} and R\mathbf{R} are unknown. The Expectation-Maximization (EM) algorithm (or its Kalman variant, the "EM algorithm for SSMs") can learn them from data.

def kalman_em(
observations: np.ndarray,
F: np.ndarray,
H: np.ndarray,
n_iterations: int = 20,
verbose: bool = True
) -> tuple[np.ndarray, np.ndarray, float]:
"""
EM algorithm for learning Q and R in a linear Gaussian SSM.

E-step: Run Kalman filter + smoother (compute expected sufficient statistics)
M-step: Update Q and R using closed-form updates

Returns:
Q_learned, R_learned, log_likelihood
"""
T, m = observations.shape
n = F.shape[0]

# Initialize Q, R
Q = np.eye(n) * 0.1
R = np.eye(m) * 1.0
x0 = np.zeros(n)
P0 = np.eye(n)

log_likelihoods = []

for iteration in range(n_iterations):
# E-step: Kalman filter + smoother
kf_temp = KalmanFilter(F=F, H=H, Q=Q, R=R, x0=x0, P0=P0)
filt_out = kf_temp.filter(observations)
smooth_out = kf_temp.smooth(filt_out)

x_s = smooth_out['x_smooth'] # (T, n) smoothed states
P_s = smooth_out['P_smooth'] # (T, n, n) smoothed covariances
ll = filt_out['log_likelihood']
log_likelihoods.append(ll)

# Compute E[x_t x_{t-1}^T] (cross-covariance for Q update)
# Using RTS smoother lag-1 cross-covariance formula
x_pred = filt_out['x_pred']
P_pred = filt_out['P_pred']
P_filt = filt_out['P_filt']

# Simplified M-step: closed-form updates for R and Q
# M-step for R: R = (1/T) * sum_t [(y_t - H x_t^s)(y_t - H x_t^s)^T + H P_t^s H^T]
R_new = np.zeros((m, m))
for t in range(T):
resid = observations[t] - H @ x_s[t]
R_new += np.outer(resid, resid) + H @ P_s[t] @ H.T
R_new /= T

# M-step for Q: requires lag-1 smoother cross-covariance (simplified)
Q_new = np.zeros((n, n))
for t in range(1, T):
# Approximate: use |x_t^s - F x_{t-1}^s|^2 + covariance terms
diff = x_s[t] - F @ x_s[t-1]
Q_new += np.outer(diff, diff) + P_s[t] + F @ P_s[t-1] @ F.T
Q_new /= (T - 1)

# Regularize to keep positive definite
Q = Q_new + np.eye(n) * 1e-6
R = R_new + np.eye(m) * 1e-6

if verbose and (iteration % 5 == 0 or iteration == n_iterations - 1):
print(f"Iteration {iteration+1:3d}: log-likelihood = {ll:.4f}")

return Q, R, log_likelihoods

print("EM Parameter Learning:")
obs = observations.reshape(-1, 1)
F_local = np.array([[1, 1.0], [0, 1.0]])
H_local = np.array([[1, 0]])
Q_learned, R_learned, lls = kalman_em(obs, F_local, H_local, n_iterations=20)
print(f"\nLearned R (measurement noise var): {R_learned[0,0]:.4f} (true: {obs_noise_std**2:.4f})")

Connection to RNNs and LSTMs

The deep learning connection is not superficial - LSTMs are learned non-linear Kalman filters.

Linear SSM → non-linear SSM → RNN

The state space model: xt=Fxt1+wt\mathbf{x}_t = \mathbf{F} \mathbf{x}_{t-1} + \mathbf{w}_t yt=Hxt+vt\mathbf{y}_t = \mathbf{H} \mathbf{x}_t + \mathbf{v}_t

Generalizes to a non-linear SSM: xt=f(xt1,ut)+wt\mathbf{x}_t = f(\mathbf{x}_{t-1}, \mathbf{u}_t) + \mathbf{w}_t yt=h(xt)+vt\mathbf{y}_t = h(\mathbf{x}_t) + \mathbf{v}_t

An RNN is exactly this, with ff and hh parameterized by neural networks: ht=tanh(Whht1+Wxxt+b)\mathbf{h}_t = \tanh(\mathbf{W}_h \mathbf{h}_{t-1} + \mathbf{W}_x \mathbf{x}_t + \mathbf{b}) y^t=Woht+bo\hat{\mathbf{y}}_t = \mathbf{W}_o \mathbf{h}_t + \mathbf{b}_o

The hidden state ht\mathbf{h}_t plays the role of x^tt\hat{\mathbf{x}}_{t|t} - the filtered state estimate.

The LSTM as a learned Kalman filter

Kalman FilterLSTM
Prediction: x^tt1=Fx^t1t1\hat{\mathbf{x}}_{t\|t-1} = \mathbf{F}\hat{\mathbf{x}}_{t-1\|t-1}Forget gate: ft=σ(Wf[ht1,xt])\mathbf{f}_t = \sigma(\mathbf{W}_f[\mathbf{h}_{t-1}, \mathbf{x}_t])
Kalman gain Kt\mathbf{K}_t (how much to trust new data)Input gate: it=σ(Wi[ht1,xt])\mathbf{i}_t = \sigma(\mathbf{W}_i[\mathbf{h}_{t-1}, \mathbf{x}_t])
Update: x^+=Kty~t\hat{\mathbf{x}} \mathrel{+}= \mathbf{K}_t \tilde{\mathbf{y}}_tCell update: ct=ftct1+itc~t\mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_t
Covariance Pt\mathbf{P}_t (uncertainty)No explicit uncertainty (limitation of LSTMs)
Known F,H,Q,R\mathbf{F}, \mathbf{H}, \mathbf{Q}, \mathbf{R} (or EM-learned)Learned Wf,Wi,Wo\mathbf{W}_f, \mathbf{W}_i, \mathbf{W}_o (gradient descent)

Key insight: LSTM gates learn to imitate Kalman gain computation. The forget gate decides how much of the past state to retain; the input gate decides how much new information to incorporate - exactly what Kt\mathbf{K}_t does in the Kalman filter.

Where Kalman filter wins over LSTM:

  • Linear Gaussian systems: Kalman is provably optimal; LSTM approximates it
  • Small data: Kalman requires no training; LSTM needs thousands of examples
  • Interpretable uncertainty: Kalman provides exact covariance; LSTM doesn't

Where LSTM wins:

  • Non-linear dynamics
  • Unknown system equations (no F\mathbf{F} or H\mathbf{H})
  • Complex distributions (non-Gaussian noise)
import torch
import torch.nn as nn

class KalmanInspiredLSTM(nn.Module):
"""
LSTM with explicit Kalman-filter-inspired structure.
Shows the conceptual connection between the two.
"""

def __init__(self, input_size: int, state_size: int, output_size: int):
super().__init__()
self.state_size = state_size

# Learned state transition (analogous to F)
self.state_transition = nn.Linear(state_size, state_size, bias=False)

# Learned observation matrix (analogous to H)
self.observation = nn.Linear(state_size, output_size)

# Learned Kalman gain network (input → gain)
self.gain_net = nn.Sequential(
nn.Linear(input_size + state_size, state_size),
nn.Sigmoid() # output in [0, 1] - how much to update state
)

def forward(self, x: torch.Tensor, h: torch.Tensor = None) -> tuple:
"""
x: (batch, seq_len, input_size)
Returns: outputs (batch, seq_len, output_size), final state
"""
batch, seq_len, _ = x.shape
if h is None:
h = torch.zeros(batch, self.state_size)

outputs = []
for t in range(seq_len):
x_t = x[:, t, :] # (batch, input_size)

# Prediction (analogous to Kalman predict step)
h_pred = torch.tanh(self.state_transition(h))

# Compute "Kalman gain" from current input and predicted state
gain = self.gain_net(torch.cat([x_t, h_pred], dim=-1))

# Compute innovation (difference between input and expected)
# In full implementation: innovation = y_t - H @ x_pred
# Simplified: use a learned innovation network
innovation = x_t - self.observation(h_pred).detach() # simplified

# Update (analogous to Kalman update step)
h = h_pred + gain * torch.tanh(nn.functional.linear(
torch.cat([x_t, h_pred], dim=-1),
torch.randn(self.state_size, x_t.shape[-1] + self.state_size)
# In practice: a proper learned update network
))

y_t = self.observation(h)
outputs.append(y_t)

return torch.stack(outputs, dim=1), h

Extended Kalman Filter (EKF): Non-Linear Systems

For non-linear state transitions ff and observations hh, linearize via Taylor expansion:

xtf(x^t1t1)+Jf(xt1x^t1t1)+wt\mathbf{x}_t \approx f(\hat{\mathbf{x}}_{t-1|t-1}) + \mathbf{J}_f (\mathbf{x}_{t-1} - \hat{\mathbf{x}}_{t-1|t-1}) + \mathbf{w}_t

where Jf=f/xx^t1t1\mathbf{J}_f = \partial f / \partial \mathbf{x}|_{\hat{\mathbf{x}}_{t-1|t-1}} is the Jacobian.

Replace FtJf\mathbf{F}_t \leftarrow \mathbf{J}_f and HtJh\mathbf{H}_t \leftarrow \mathbf{J}_h in the Kalman equations. This is the Extended Kalman Filter.

Applications: robot SLAM (Simultaneous Localization and Mapping), non-linear econometric models, GPS with non-linear measurement equations (range vs position).

Limitation: EKF linearizes locally - poor for highly non-linear systems (use Unscented KF or particle filters instead).

Structural Time Series Models

State space models provide a principled framework for decomposing time series:

yt=μt+γt+δt+ϵty_t = \mu_t + \gamma_t + \delta_t + \epsilon_t

where:

  • μt\mu_t: trend (local level or local linear trend)
  • γt\gamma_t: seasonal component
  • δt\delta_t: cyclical component
  • ϵtN(0,σϵ2)\epsilon_t \sim \mathcal{N}(0, \sigma^2_\epsilon): irregular

Each component has its own state space representation; combine them into a single SSM and use the Kalman filter for inference.

from statsmodels.tsa.statespace.structural import UnobservedComponents

def fit_structural_time_series(series: pd.Series) -> dict:
"""
Fit a structural time series model using statsmodels.
Decomposes into trend + seasonal + irregular.

Under the hood: state space model + Kalman filter for inference.
"""
# Local linear trend + stochastic seasonal
model = UnobservedComponents(
series,
level='local linear trend', # trend: random walk in level and slope
seasonal=12, # 12-period seasonal (monthly)
stochastic_seasonal=True # seasonal component can evolve
)

result = model.fit(disp=False)

# Extract components
components = result.get_prediction()
component_means = components.predicted_mean

print(f"Structural TSM Log-likelihood: {result.llf:.2f}")
print(f"AIC: {result.aic:.2f}")
print(f"\nNoise parameters:")
print(result.params.to_string())

return {
'result': result,
'trend': result.trend,
'seasonal': result.seasonal,
'irregular': result.irregular
}

# Generate seasonal series for structural decomposition
np.random.seed(0)
n = 144 # 12 years of monthly data
t = np.arange(n)
y_struct = (
200 + 2*t + # trend
30*np.sin(2*np.pi*t/12) + 15*np.sin(2*np.pi*2*t/12) # seasonal
+ np.random.normal(0, 10, n) # irregular
)
series_struct = pd.Series(y_struct,
index=pd.date_range('2010-01', periods=n, freq='ME'))

structural_result = fit_structural_time_series(series_struct)

Interview Questions

Q1: Explain the Kalman filter in one minute. What does it compute and why is it optimal?

The Kalman filter is a recursive algorithm that estimates a hidden state xt\mathbf{x}_t from noisy observations y1:t\mathbf{y}_{1:t}.

It alternates between two steps:

  1. Predict: Use the state transition model to predict x^tt1\hat{\mathbf{x}}_{t|t-1} and its uncertainty Ptt1\mathbf{P}_{t|t-1} from the previous estimate.
  2. Update: When a new observation arrives, compute the Kalman gain Kt\mathbf{K}_t - a matrix that weights how much to trust the measurement vs the prediction - and update: x^tt=x^tt1+Kt(ytHx^tt1)\hat{\mathbf{x}}_{t|t} = \hat{\mathbf{x}}_{t|t-1} + \mathbf{K}_t (y_t - \mathbf{H}\hat{\mathbf{x}}_{t|t-1}).

Why optimal? For linear Gaussian systems, the Kalman filter computes the exact conditional distribution p(xty1:t)=N(x^tt,Ptt)p(\mathbf{x}_t | \mathbf{y}_{1:t}) = \mathcal{N}(\hat{\mathbf{x}}_{t|t}, \mathbf{P}_{t|t}). Among all estimators, it minimizes mean squared error - it is the MMSE (minimum mean squared error) estimator. This follows from the fact that for Gaussian distributions, the conditional mean is the MMSE estimator.

Q2: What is the Kalman gain and what do extreme values (near 0 or near 1) mean?

The Kalman gain is: Kt=Ptt1H(HPtt1H+R)1\mathbf{K}_t = \mathbf{P}_{t|t-1} \mathbf{H}^\top (\mathbf{H}\mathbf{P}_{t|t-1}\mathbf{H}^\top + \mathbf{R})^{-1}

It determines the trade-off between the model prediction and the measurement.

Kt0\mathbf{K}_t \to 0 (small gain): Ignore the measurement. This happens when:

  • R\mathbf{R} \to \infty: measurement noise is very high (GPS in a tunnel)
  • Ptt10\mathbf{P}_{t|t-1} \to 0: the model prediction is highly confident Result: posterior \approx prior prediction

KtH1\mathbf{K}_t \to \mathbf{H}^{-1} (large gain): Trust the measurement fully. This happens when:

  • R0\mathbf{R} \to 0: measurement is very accurate (RTK GPS)
  • Ptt1\mathbf{P}_{t|t-1} \to \infty: the model prediction is very uncertain Result: posterior \approx scaled measurement

Intuition: Kalman gain balances two sources of information. If your GPS is precise and your motion model is uncertain, trust GPS. If your GPS is noisy and your model is well-calibrated, trust the model. The Kalman filter finds the statistically optimal balance.

Q3: How is an LSTM related to a Kalman filter?

Both are sequential state estimators: given a new observation, update a hidden state.

Structural parallels:

  • The LSTM cell state ct\mathbf{c}_t corresponds to the Kalman state estimate x^tt\hat{\mathbf{x}}_{t|t}
  • The LSTM forget gate ft=σ()\mathbf{f}_t = \sigma(\cdot) corresponds to how much of the prior state survives - in the Kalman filter, this is determined by the transition matrix F\mathbf{F} and the Kalman gain
  • The LSTM input gate it\mathbf{i}_t controls how much new information enters - Kalman's Kt\mathbf{K}_t does the same
  • Both compute a weighted combination of prediction and new observation

Key differences:

  • Kalman filter: linear dynamics, Gaussian noise, known model equations → provably optimal
  • LSTM: non-linear dynamics, learned from data, no distributional assumptions → flexible but not provably optimal
  • Uncertainty: Kalman tracks uncertainty explicitly via covariance Pt\mathbf{P}_t; LSTM has no built-in uncertainty quantification
  • Data requirements: Kalman requires model specification; LSTM requires training data

Practical implication: When you have a known dynamical system (physics-based model), use Kalman. When the dynamics are unknown and you have data, use LSTM. Hybrid approaches (differentiable Kalman filters, Kalman-LSTM) learn dynamics while maintaining uncertainty tracking.

Q4: What is the difference between Kalman filtering and Kalman smoothing?

Kalman filtering is causal: at time tt, compute p(xty1:t)p(\mathbf{x}_t | \mathbf{y}_{1:t}) - only past and current observations are used. This is the appropriate algorithm for real-time systems (the filter cannot see the future).

Kalman smoothing (specifically the Rauch-Tung-Striebel smoother) is non-causal: at time tt, compute p(xty1:T)p(\mathbf{x}_t | \mathbf{y}_{1:T}) - all observations including future ones are used. This gives better estimates when you have the full time series available.

When to use each:

  • Filtering: real-time tracking (autonomous vehicles, robot control), streaming forecasting
  • Smoothing: offline analysis (EEG signal processing, time series decomposition, retrospective analysis)

Why smoother is better?: At time tt, future observations yt+1:T\mathbf{y}_{t+1:T} carry information about xt\mathbf{x}_t through the transition model. The smoother uses this information via a backward pass (RTS algorithm), giving lower RMSE than the forward-only filter.

In terms of ML analogies: filtering ≈ autoregressive model (left-to-right); smoothing ≈ bidirectional model (uses both past and future context, like BERT vs GPT).

Q5: How would you use a Kalman filter for anomaly detection?

The Kalman filter provides a natural anomaly detection framework via the innovation sequence y~t=ytHx^tt1\tilde{\mathbf{y}}_t = \mathbf{y}_t - \mathbf{H}\hat{\mathbf{x}}_{t|t-1}.

Under the nominal model, innovations should be i.i.d. Gaussian with zero mean and covariance St=HPtt1H+R\mathbf{S}_t = \mathbf{H}\mathbf{P}_{t|t-1}\mathbf{H}^\top + \mathbf{R}.

Anomaly score: The normalized innovation squared (NIS): NISt=y~tSt1y~tχ2(m)\text{NIS}_t = \tilde{\mathbf{y}}_t^\top \mathbf{S}_t^{-1} \tilde{\mathbf{y}}_t \sim \chi^2(m) under the null hypothesis (no anomaly), where mm is the observation dimension.

Algorithm:

  1. Fit Kalman filter on normal training data (learn Q, R via EM)
  2. At test time, compute NIS at each step
  3. Flag tt as anomaly if NISt>χm,0.992\text{NIS}_t > \chi^2_{m, 0.99} (99th percentile)

Advantages over threshold-based methods:

  • Automatically adapts to the model's current uncertainty (if the model is uncertain, the threshold widens)
  • Distinguishes "large value" from "unexpected value given model" - a large value from a known trend is normal; a deviation from the trend is anomalous
  • Works well for sensor data with known dynamics (temperature, pressure, motor vibration)

This approach is used in spacecraft telemetry monitoring, industrial predictive maintenance, and financial trade surveillance.

Key Takeaways

  • The state space model separates dynamics (transition equation) from observations (measurement equation)
  • The Kalman filter is the optimal linear estimator: predict → compute Kalman gain → update
  • Kalman gain Kt\mathbf{K}_t balances model prediction vs measurement, inversely proportional to measurement noise
  • RTS smoother runs the Kalman filter forward, then backward - uses all data for each estimate (not real-time)
  • LSTMs are learned non-linear Kalman filters: forget gate ≈ forgetting old state, input gate ≈ Kalman gain
  • Structural time series models (trend + seasonal + irregular) use the Kalman filter internally for optimal decomposition
  • For non-linear systems: Extended Kalman Filter (linearize Jacobians), Unscented KF (sigma points), or particle filters

Next: Lesson 06: Cointegration and Granger Causality →

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Kalman Filter demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.