State Space Models and the Kalman Filter

Reading time: ~45 minutes Interview relevance: High - Kalman filter is standard in robotics, autonomous driving, financial ML, and signal processing interviews; conceptual connection to LSTMs is frequently tested Target roles: ML Engineer, Research Engineer, Robotics/Autonomous Systems, Financial ML

The Real Interview Moment

You're interviewing at an autonomous vehicle startup. The interviewer describes the problem: "Our car has a GPS with 3-meter accuracy, an IMU with drift, and a camera. We need a smooth, accurate position estimate at 100 Hz. How would you approach this?"

The candidate who says "use the GPS reading directly" fails immediately.

The candidate who says "average the sensors" gets a polite nod.

The candidate who says "Kalman filter - it's the optimal linear estimator that fuses noisy sensor measurements with a motion model, giving a minimum variance estimate at each timestep" gets the job.

The Kalman filter is not just a robotics tool. It is:

The algorithm behind GPS navigation systems
The foundation for financial time series tracking (dynamic linear models)
The theoretical ancestor of RNNs and LSTMs (learned state space models)
A Bayesian filtering algorithm that does optimal inference in linear Gaussian systems
The backbone of structural time series (trend + seasonality + noise decomposition)

State Space Representation

A state space model separates the system into two equations:

Transition equation (state evolution model): $\mathbf{x}_t = \mathbf{F}_t \mathbf{x}_{t-1} + \mathbf{B}_t \mathbf{u}_t + \mathbf{w}_t, \quad \mathbf{w}_t \sim \mathcal{N}(\mathbf{0}, \mathbf{Q}_t)$
Observation equation (measurement model): $\mathbf{y}_t = \mathbf{H}_t \mathbf{x}_t + \mathbf{v}_t, \quad \mathbf{v}_t \sim \mathcal{N}(\mathbf{0}, \mathbf{R}_t)$

Variables:

Symbol	Dimension	Meaning
$\mathbf{x}_t$	$n \times 1$	Hidden state (latent)
$\mathbf{y}_t$	$m \times 1$	Observation (measured)
$\mathbf{F}_t$	$n \times n$	State transition matrix
$\mathbf{H}_t$	$m \times n$	Observation matrix
$\mathbf{B}_t$	$n \times r$	Control input matrix
$\mathbf{u}_t$	$r \times 1$	Control input
$\mathbf{Q}_t$	$n \times n$	Process noise covariance
$\mathbf{R}_t$	$m \times m$	Observation noise covariance

ARIMA as a state space model

Every ARIMA model can be written in state space form. This is how modern implementations like statsmodels compute the likelihood via the Kalman filter.

ARIMA(2,1,1) in state space form (one of many representations):

$\mathbf{x}_t = \begin{pmatrix} X_t \\ X_{t-1} \\ X_{t-2} \end{pmatrix}, \quad \mathbf{F} = \begin{pmatrix} \phi_1 & \phi_2 & 0 \\ 1 & 0 & 0 \\ 0 & 1 & 0 \end{pmatrix}, \quad \mathbf{H} = \begin{pmatrix} 1 & 0 & 0 \end{pmatrix}$

Local level model (random walk + noise)

The simplest useful state space model: $x_t = x_{t-1} + w_t, \quad w_t \sim \mathcal{N}(0, q)$ $y_t = x_t + v_t, \quad v_t \sim \mathcal{N}(0, r)$

The "true level" $x_t$ evolves as a random walk; observations $y_t$ are noisy measurements. The Kalman filter gives the optimal estimate of $x_t$ given all observations up to $t$ .

The Kalman Filter: Optimal Linear Estimation

The Kalman filter solves the problem: given all observations $\mathbf{y}_{1:t}$ , compute the optimal estimate of the hidden state $\mathbf{x}_t$ .

For linear Gaussian systems, the optimal estimate (minimum mean squared error) is the conditional mean $\hat{\mathbf{x}}_{t|t} = \mathbb{E}[\mathbf{x}_t | \mathbf{y}_{1:t}]$ .

The filter proceeds in two alternating steps at each timestep.

Prediction step (time update)

Using the previous estimate, predict the current state:

$\hat{\mathbf{x}}_{t|t-1} = \mathbf{F}_t \hat{\mathbf{x}}_{t-1|t-1}$

$\mathbf{P}_{t|t-1} = \mathbf{F}_t \mathbf{P}_{t-1|t-1} \mathbf{F}_t^\top + \mathbf{Q}_t$

$\hat{\mathbf{x}}_{t|t-1}$ : prior state estimate (before seeing $\mathbf{y}_t$ )
$\mathbf{P}_{t|t-1}$ : prior error covariance (uncertainty in prediction)

Update step (measurement update)

When observation $\mathbf{y}_t$ arrives, update the estimate:

Innovation (prediction error): $\tilde{\mathbf{y}}_t = \mathbf{y}_t - \mathbf{H}_t \hat{\mathbf{x}}_{t|t-1}$

Innovation covariance: $\mathbf{S}_t = \mathbf{H}_t \mathbf{P}_{t|t-1} \mathbf{H}_t^\top + \mathbf{R}_t$

Kalman gain (how much to trust the measurement vs the prediction): $\mathbf{K}_t = \mathbf{P}_{t|t-1} \mathbf{H}_t^\top \mathbf{S}_t^{-1}$

Posterior state estimate: $\hat{\mathbf{x}}_{t|t} = \hat{\mathbf{x}}_{t|t-1} + \mathbf{K}_t \tilde{\mathbf{y}}_t$

Posterior covariance: $\mathbf{P}_{t|t} = (\mathbf{I} - \mathbf{K}_t \mathbf{H}_t) \mathbf{P}_{t|t-1}$

Interpretation of Kalman gain

$\mathbf{K}_t = \frac{\text{prediction uncertainty}}{\text{prediction uncertainty} + \text{measurement uncertainty}}$

$\mathbf{K}_t \to \mathbf{H}_t^{-1}$ (trust measurement fully): when measurement noise $\mathbf{R}_t \to 0$
$\mathbf{K}_t \to 0$ (ignore measurement, trust model): when $\mathbf{R}_t \to \infty$ or $\mathbf{P}_{t|t-1} \to 0$

Python Implementation: Full Kalman Filter

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from dataclasses import dataclass, field
from typing import Optional

@dataclass
class KalmanFilter:
    """
    Linear Kalman Filter for state space model:
        x_t = F @ x_{t-1} + B @ u_t + w_t,  w_t ~ N(0, Q)
        y_t = H @ x_t + v_t,                 v_t ~ N(0, R)

    Attributes:
        F:  State transition matrix  (n x n)
        H:  Observation matrix       (m x n)
        Q:  Process noise covariance (n x n)
        R:  Observation noise covariance (m x m)
        B:  Control input matrix     (n x r), optional
        x0: Initial state estimate   (n,)
        P0: Initial covariance       (n x n)
    """
    F: np.ndarray   # state transition
    H: np.ndarray   # observation
    Q: np.ndarray   # process noise
    R: np.ndarray   # measurement noise
    B: Optional[np.ndarray] = None
    x0: Optional[np.ndarray] = None
    P0: Optional[np.ndarray] = None

    def __post_init__(self):
        n = self.F.shape[0]
        if self.x0 is None:
            self.x0 = np.zeros(n)
        if self.P0 is None:
            self.P0 = np.eye(n) * 1e3  # large initial uncertainty

    def filter(
        self,
        observations: np.ndarray,
        controls: Optional[np.ndarray] = None
    ) -> dict:
        """
        Run Kalman filter over a sequence of observations.

        Args:
            observations: array of shape (T, m) - T measurements of dimension m
            controls: optional array of shape (T, r) - control inputs

        Returns:
            dict with:
                x_pred:  prior estimates    (T, n)
                P_pred:  prior covariances  (T, n, n)
                x_filt:  posterior estimates (T, n)
                P_filt:  posterior covariances (T, n, n)
                innovations: (T, m)
                K: Kalman gains (T, n, m)
                log_likelihood: scalar
        """
        T = len(observations)
        n = self.F.shape[0]
        m = self.H.shape[0]

        # Storage
        x_pred = np.zeros((T, n))
        P_pred = np.zeros((T, n, n))
        x_filt = np.zeros((T, n))
        P_filt = np.zeros((T, n, n))
        innovations = np.zeros((T, m))
        K_all = np.zeros((T, n, m))
        log_likelihood = 0.0

        # Initial state
        x = self.x0.copy()
        P = self.P0.copy()

        for t in range(T):
            # ── Prediction step ────────────────────────────────────────────
            # Control input
            Bu = np.zeros(n)
            if self.B is not None and controls is not None:
                Bu = self.B @ controls[t]

            x_prior = self.F @ x + Bu
            P_prior = self.F @ P @ self.F.T + self.Q

            x_pred[t] = x_prior
            P_pred[t] = P_prior

            # ── Update step ────────────────────────────────────────────────
            y = observations[t]

            # Innovation
            innovation = y - self.H @ x_prior
            innovations[t] = innovation

            # Innovation covariance S = H P H^T + R
            S = self.H @ P_prior @ self.H.T + self.R

            # Kalman gain K = P H^T S^{-1}
            K = P_prior @ self.H.T @ np.linalg.inv(S)
            K_all[t] = K

            # Posterior state
            x = x_prior + K @ innovation
            P = (np.eye(n) - K @ self.H) @ P_prior

            # Joseph form for numerical stability:
            # P = (I - KH) P_prior (I - KH)^T + K R K^T
            # (symmetric by construction, always positive semi-definite)
            IKH = np.eye(n) - K @ self.H
            P = IKH @ P_prior @ IKH.T + K @ self.R @ K.T

            x_filt[t] = x
            P_filt[t] = P

            # Log-likelihood contribution: log N(innovation; 0, S)
            sign, logdet = np.linalg.slogdet(S)
            log_likelihood -= 0.5 * (m * np.log(2*np.pi) + logdet +
                                     innovation @ np.linalg.inv(S) @ innovation)

        return {
            'x_pred': x_pred, 'P_pred': P_pred,
            'x_filt': x_filt, 'P_filt': P_filt,
            'innovations': innovations, 'K': K_all,
            'log_likelihood': log_likelihood
        }

    def smooth(self, filter_output: dict) -> dict:
        """
        Rauch-Tung-Striebel (RTS) smoother.

        The Kalman filter is a causal estimator: at time t, it uses
        observations y_{1:t}. The smoother is non-causal: at time t,
        it uses ALL observations y_{1:T}, giving better estimates.

        Use case: offline analysis where you have the full series,
        but not for real-time filtering.
        """
        x_filt = filter_output['x_filt']
        P_filt = filter_output['P_filt']
        x_pred = filter_output['x_pred']
        P_pred = filter_output['P_pred']

        T = len(x_filt)
        n = x_filt.shape[1]

        x_smooth = np.zeros((T, n))
        P_smooth = np.zeros((T, n, n))

        # Initialize from end
        x_smooth[-1] = x_filt[-1]
        P_smooth[-1] = P_filt[-1]

        for t in range(T - 2, -1, -1):
            # Smoother gain: G_t = P_t|t * F^T * P_{t+1|t}^{-1}
            G = P_filt[t] @ self.F.T @ np.linalg.inv(P_pred[t+1])

            x_smooth[t] = x_filt[t] + G @ (x_smooth[t+1] - x_pred[t+1])
            P_smooth[t] = P_filt[t] + G @ (P_smooth[t+1] - P_pred[t+1]) @ G.T

        return {'x_smooth': x_smooth, 'P_smooth': P_smooth}

Example 1: 1D Noisy Signal Tracking

# ─── Local level model: x_t = x_{t-1} + w_t, y_t = x_t + v_t ───────────────
np.random.seed(42)
T = 200

# True state: piecewise linear trend
true_state = np.zeros(T)
for t in range(1, T):
    if t < 50:
        true_state[t] = true_state[t-1] + 0.5 + np.random.normal(0, 0.1)
    elif t < 100:
        true_state[t] = true_state[t-1] - 0.3 + np.random.normal(0, 0.1)
    else:
        true_state[t] = true_state[t-1] + 0.2 + np.random.normal(0, 0.1)

# Observations with noise
obs_noise_std = 3.0
observations = true_state + np.random.normal(0, obs_noise_std, T)

# ─── Define Kalman Filter ────────────────────────────────────────────────────
# State: [position, velocity] (local linear trend model)
# x_t = [pos_t, vel_t]
# Transition: pos_t = pos_{t-1} + vel_{t-1}, vel_t = vel_{t-1} (constant velocity)
dt = 1.0
F = np.array([[1, dt],
              [0,  1]])

H = np.array([[1, 0]])  # observe position only

# Process noise: velocity changes are stochastic
q = 0.05  # process noise intensity
Q = q * np.array([[dt**3/3, dt**2/2],
                   [dt**2/2, dt]])

R = np.array([[obs_noise_std**2]])  # measurement noise variance

kf = KalmanFilter(
    F=F, H=H, Q=Q, R=R,
    x0=np.array([observations[0], 0.0]),  # initial position from first obs
    P0=np.diag([obs_noise_std**2, 1.0])
)

# Run filter
result = kf.filter(observations.reshape(-1, 1))

# Run smoother
smooth_result = kf.smooth(result)

x_filtered = result['x_filt'][:, 0]   # position estimates (filtered)
x_smoothed = smooth_result['x_smooth'][:, 0]  # position estimates (smoothed)

# Evaluate
filter_rmse = np.sqrt(np.mean((x_filtered - true_state)**2))
smooth_rmse = np.sqrt(np.mean((x_smoothed - true_state)**2))
noisy_rmse = np.sqrt(np.mean((observations - true_state)**2))

print("Signal Tracking Results:")
print(f"  Noisy observations RMSE: {noisy_rmse:.4f}")
print(f"  Kalman filter RMSE:      {filter_rmse:.4f} ({(1-filter_rmse/noisy_rmse)*100:.1f}% improvement)")
print(f"  Kalman smoother RMSE:    {smooth_rmse:.4f} ({(1-smooth_rmse/noisy_rmse)*100:.1f}% improvement)")

Output:

Signal Tracking Results:
  Noisy observations RMSE: 2.9847
  Kalman filter RMSE:      1.2143 (59.3% improvement)
  Kalman smoother RMSE:    0.8741 (70.7% improvement)

Example 2: Sensor Fusion

Fusing GPS (accurate but slow) with IMU (fast but drifts) is the quintessential Kalman filter application.

def sensor_fusion_kalman(
    gps_measurements: np.ndarray,
    imu_measurements: np.ndarray,
    gps_frequency: int = 10,   # GPS at 10 Hz
    imu_frequency: int = 100,  # IMU at 100 Hz
    gps_std: float = 3.0,      # GPS accuracy: 3 meters
    imu_std: float = 0.1,      # IMU velocity noise: 0.1 m/s
) -> np.ndarray:
    """
    Fuse GPS (noisy, slow) and IMU (fast, drifts) for position estimation.

    Model:
        State: x = [position, velocity]
        IMU provides velocity measurement (high freq, low noise)
        GPS provides position measurement (low freq, high noise)

    At IMU-only timesteps: update with IMU only
    At GPS timesteps: update with both GPS and IMU
    """
    T_imu = len(imu_measurements)
    dt = 1.0 / imu_frequency
    gps_ratio = imu_frequency // gps_frequency  # update GPS every N IMU steps

    # State: [position, velocity]
    x = np.array([gps_measurements[0], 0.0])
    P = np.diag([gps_std**2, (imu_std*10)**2])  # high velocity uncertainty initially

    F = np.array([[1, dt], [0, 1]])
    Q = np.diag([0.01 * dt, 0.01])  # small process noise

    # Observation models
    H_gps = np.array([[1, 0]])   # GPS measures position
    H_imu = np.array([[0, 1]])   # IMU measures velocity
    R_gps = np.array([[gps_std**2]])
    R_imu = np.array([[imu_std**2]])

    estimates = np.zeros(T_imu)
    gps_idx = 0

    for t in range(T_imu):
        # Predict
        x = F @ x
        P = F @ P @ F.T + Q

        # IMU update (always available)
        K_imu = P @ H_imu.T @ np.linalg.inv(H_imu @ P @ H_imu.T + R_imu)
        innov_imu = imu_measurements[t] - H_imu @ x
        x = x + (K_imu @ innov_imu).ravel()
        P = (np.eye(2) - K_imu @ H_imu) @ P

        # GPS update (every gps_ratio steps)
        if t % gps_ratio == 0 and gps_idx < len(gps_measurements):
            K_gps = P @ H_gps.T @ np.linalg.inv(H_gps @ P @ H_gps.T + R_gps)
            innov_gps = gps_measurements[gps_idx] - H_gps @ x
            x = x + (K_gps @ innov_gps).ravel()
            P = (np.eye(2) - K_gps @ H_gps) @ P
            gps_idx += 1

        estimates[t] = x[0]

    return estimates

# Simulate
np.random.seed(42)
T_total = 1000  # 10 seconds at 100 Hz

# True trajectory: constant acceleration
true_position = np.cumsum(np.cumsum(np.ones(T_total) * 0.1))
true_velocity = np.cumsum(np.ones(T_total) * 0.1)

# GPS: 10 Hz, 3m accuracy
gps_times = np.arange(0, T_total, 10)
gps_obs = true_position[gps_times] + np.random.normal(0, 3.0, len(gps_times))

# IMU: 100 Hz, 0.1 m/s accuracy
imu_obs = true_velocity + np.random.normal(0, 0.1, T_total)

fused = sensor_fusion_kalman(gps_obs, imu_obs)

gps_interp_rmse = np.sqrt(np.mean((np.interp(np.arange(T_total), gps_times, gps_obs) - true_position)**2))
fused_rmse = np.sqrt(np.mean((fused - true_position)**2))
print(f"GPS-only RMSE (interpolated): {gps_interp_rmse:.4f} m")
print(f"Fused (KF) RMSE:              {fused_rmse:.4f} m")

Example 3: EM Algorithm for Parameter Learning

In practice, the noise covariances $\mathbf{Q}$ and $\mathbf{R}$ are unknown. The Expectation-Maximization (EM) algorithm (or its Kalman variant, the "EM algorithm for SSMs") can learn them from data.

def kalman_em(
    observations: np.ndarray,
    F: np.ndarray,
    H: np.ndarray,
    n_iterations: int = 20,
    verbose: bool = True
) -> tuple[np.ndarray, np.ndarray, float]:
    """
    EM algorithm for learning Q and R in a linear Gaussian SSM.

    E-step: Run Kalman filter + smoother (compute expected sufficient statistics)
    M-step: Update Q and R using closed-form updates

    Returns:
        Q_learned, R_learned, log_likelihood
    """
    T, m = observations.shape
    n = F.shape[0]

    # Initialize Q, R
    Q = np.eye(n) * 0.1
    R = np.eye(m) * 1.0
    x0 = np.zeros(n)
    P0 = np.eye(n)

    log_likelihoods = []

    for iteration in range(n_iterations):
        # E-step: Kalman filter + smoother
        kf_temp = KalmanFilter(F=F, H=H, Q=Q, R=R, x0=x0, P0=P0)
        filt_out = kf_temp.filter(observations)
        smooth_out = kf_temp.smooth(filt_out)

        x_s = smooth_out['x_smooth']        # (T, n) smoothed states
        P_s = smooth_out['P_smooth']        # (T, n, n) smoothed covariances
        ll = filt_out['log_likelihood']
        log_likelihoods.append(ll)

        # Compute E[x_t x_{t-1}^T] (cross-covariance for Q update)
        # Using RTS smoother lag-1 cross-covariance formula
        x_pred = filt_out['x_pred']
        P_pred = filt_out['P_pred']
        P_filt = filt_out['P_filt']

        # Simplified M-step: closed-form updates for R and Q
        # M-step for R: R = (1/T) * sum_t [(y_t - H x_t^s)(y_t - H x_t^s)^T + H P_t^s H^T]
        R_new = np.zeros((m, m))
        for t in range(T):
            resid = observations[t] - H @ x_s[t]
            R_new += np.outer(resid, resid) + H @ P_s[t] @ H.T
        R_new /= T

        # M-step for Q: requires lag-1 smoother cross-covariance (simplified)
        Q_new = np.zeros((n, n))
        for t in range(1, T):
            # Approximate: use |x_t^s - F x_{t-1}^s|^2 + covariance terms
            diff = x_s[t] - F @ x_s[t-1]
            Q_new += np.outer(diff, diff) + P_s[t] + F @ P_s[t-1] @ F.T
        Q_new /= (T - 1)

        # Regularize to keep positive definite
        Q = Q_new + np.eye(n) * 1e-6
        R = R_new + np.eye(m) * 1e-6

        if verbose and (iteration % 5 == 0 or iteration == n_iterations - 1):
            print(f"Iteration {iteration+1:3d}: log-likelihood = {ll:.4f}")

    return Q, R, log_likelihoods

print("EM Parameter Learning:")
obs = observations.reshape(-1, 1)
F_local = np.array([[1, 1.0], [0, 1.0]])
H_local = np.array([[1, 0]])
Q_learned, R_learned, lls = kalman_em(obs, F_local, H_local, n_iterations=20)
print(f"\nLearned R (measurement noise var): {R_learned[0,0]:.4f} (true: {obs_noise_std**2:.4f})")

Connection to RNNs and LSTMs

The deep learning connection is not superficial - LSTMs are learned non-linear Kalman filters.

Linear SSM → non-linear SSM → RNN

The state space model: $\mathbf{x}_t = \mathbf{F} \mathbf{x}_{t-1} + \mathbf{w}_t$ $\mathbf{y}_t = \mathbf{H} \mathbf{x}_t + \mathbf{v}_t$

Generalizes to a non-linear SSM: $\mathbf{x}_t = f(\mathbf{x}_{t-1}, \mathbf{u}_t) + \mathbf{w}_t$ $\mathbf{y}_t = h(\mathbf{x}_t) + \mathbf{v}_t$

An RNN is exactly this, with $f$ and $h$ parameterized by neural networks: $\mathbf{h}_t = \tanh(\mathbf{W}_h \mathbf{h}_{t-1} + \mathbf{W}_x \mathbf{x}_t + \mathbf{b})$ $\hat{\mathbf{y}}_t = \mathbf{W}_o \mathbf{h}_t + \mathbf{b}_o$

The hidden state $\mathbf{h}_t$ plays the role of $\hat{\mathbf{x}}_{t|t}$ - the filtered state estimate.

The LSTM as a learned Kalman filter

Kalman Filter	LSTM
Prediction: $\hat{\mathbf{x}}_{t\\|t-1} = \mathbf{F}\hat{\mathbf{x}}_{t-1\\|t-1}$	Forget gate: $\mathbf{f}_t = \sigma(\mathbf{W}_f[\mathbf{h}_{t-1}, \mathbf{x}_t])$
Kalman gain $\mathbf{K}_t$ (how much to trust new data)	Input gate: $\mathbf{i}_t = \sigma(\mathbf{W}_i[\mathbf{h}_{t-1}, \mathbf{x}_t])$
Update: $\hat{\mathbf{x}} \mathrel{+}= \mathbf{K}_t \tilde{\mathbf{y}}_t$	Cell update: $\mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_t$
Covariance $\mathbf{P}_t$ (uncertainty)	No explicit uncertainty (limitation of LSTMs)
Known $\mathbf{F}, \mathbf{H}, \mathbf{Q}, \mathbf{R}$ (or EM-learned)	Learned $\mathbf{W}_f, \mathbf{W}_i, \mathbf{W}_o$ (gradient descent)

Key insight: LSTM gates learn to imitate Kalman gain computation. The forget gate decides how much of the past state to retain; the input gate decides how much new information to incorporate - exactly what $\mathbf{K}_t$ does in the Kalman filter.

Where Kalman filter wins over LSTM:

Linear Gaussian systems: Kalman is provably optimal; LSTM approximates it
Small data: Kalman requires no training; LSTM needs thousands of examples
Interpretable uncertainty: Kalman provides exact covariance; LSTM doesn't

Where LSTM wins:

Non-linear dynamics
Unknown system equations (no $\mathbf{F}$ or $\mathbf{H}$ )
Complex distributions (non-Gaussian noise)

import torch
import torch.nn as nn

class KalmanInspiredLSTM(nn.Module):
    """
    LSTM with explicit Kalman-filter-inspired structure.
    Shows the conceptual connection between the two.
    """

    def __init__(self, input_size: int, state_size: int, output_size: int):
        super().__init__()
        self.state_size = state_size

        # Learned state transition (analogous to F)
        self.state_transition = nn.Linear(state_size, state_size, bias=False)

        # Learned observation matrix (analogous to H)
        self.observation = nn.Linear(state_size, output_size)

        # Learned Kalman gain network (input → gain)
        self.gain_net = nn.Sequential(
            nn.Linear(input_size + state_size, state_size),
            nn.Sigmoid()  # output in [0, 1] - how much to update state
        )

    def forward(self, x: torch.Tensor, h: torch.Tensor = None) -> tuple:
        """
        x: (batch, seq_len, input_size)
        Returns: outputs (batch, seq_len, output_size), final state
        """
        batch, seq_len, _ = x.shape
        if h is None:
            h = torch.zeros(batch, self.state_size)

        outputs = []
        for t in range(seq_len):
            x_t = x[:, t, :]  # (batch, input_size)

            # Prediction (analogous to Kalman predict step)
            h_pred = torch.tanh(self.state_transition(h))

            # Compute "Kalman gain" from current input and predicted state
            gain = self.gain_net(torch.cat([x_t, h_pred], dim=-1))

            # Compute innovation (difference between input and expected)
            # In full implementation: innovation = y_t - H @ x_pred
            # Simplified: use a learned innovation network
            innovation = x_t - self.observation(h_pred).detach()  # simplified

            # Update (analogous to Kalman update step)
            h = h_pred + gain * torch.tanh(nn.functional.linear(
                torch.cat([x_t, h_pred], dim=-1),
                torch.randn(self.state_size, x_t.shape[-1] + self.state_size)
                # In practice: a proper learned update network
            ))

            y_t = self.observation(h)
            outputs.append(y_t)

        return torch.stack(outputs, dim=1), h

Extended Kalman Filter (EKF): Non-Linear Systems

For non-linear state transitions $f$ and observations $h$ , linearize via Taylor expansion:

$\mathbf{x}_t \approx f(\hat{\mathbf{x}}_{t-1|t-1}) + \mathbf{J}_f (\mathbf{x}_{t-1} - \hat{\mathbf{x}}_{t-1|t-1}) + \mathbf{w}_t$

where $\mathbf{J}_f = \partial f / \partial \mathbf{x}|_{\hat{\mathbf{x}}_{t-1|t-1}}$ is the Jacobian.

Replace $\mathbf{F}_t \leftarrow \mathbf{J}_f$ and $\mathbf{H}_t \leftarrow \mathbf{J}_h$ in the Kalman equations. This is the Extended Kalman Filter.

Applications: robot SLAM (Simultaneous Localization and Mapping), non-linear econometric models, GPS with non-linear measurement equations (range vs position).

Limitation: EKF linearizes locally - poor for highly non-linear systems (use Unscented KF or particle filters instead).

Structural Time Series Models

State space models provide a principled framework for decomposing time series:

$y_t = \mu_t + \gamma_t + \delta_t + \epsilon_t$

where:

$\mu_t$ : trend (local level or local linear trend)
$\gamma_t$ : seasonal component
$\delta_t$ : cyclical component
$\epsilon_t \sim \mathcal{N}(0, \sigma^2_\epsilon)$ : irregular

Each component has its own state space representation; combine them into a single SSM and use the Kalman filter for inference.

from statsmodels.tsa.statespace.structural import UnobservedComponents

def fit_structural_time_series(series: pd.Series) -> dict:
    """
    Fit a structural time series model using statsmodels.
    Decomposes into trend + seasonal + irregular.

    Under the hood: state space model + Kalman filter for inference.
    """
    # Local linear trend + stochastic seasonal
    model = UnobservedComponents(
        series,
        level='local linear trend',  # trend: random walk in level and slope
        seasonal=12,                  # 12-period seasonal (monthly)
        stochastic_seasonal=True      # seasonal component can evolve
    )

    result = model.fit(disp=False)

    # Extract components
    components = result.get_prediction()
    component_means = components.predicted_mean

    print(f"Structural TSM Log-likelihood: {result.llf:.2f}")
    print(f"AIC: {result.aic:.2f}")
    print(f"\nNoise parameters:")
    print(result.params.to_string())

    return {
        'result': result,
        'trend': result.trend,
        'seasonal': result.seasonal,
        'irregular': result.irregular
    }

# Generate seasonal series for structural decomposition
np.random.seed(0)
n = 144  # 12 years of monthly data
t = np.arange(n)
y_struct = (
    200 + 2*t +                                           # trend
    30*np.sin(2*np.pi*t/12) + 15*np.sin(2*np.pi*2*t/12) # seasonal
    + np.random.normal(0, 10, n)                          # irregular
)
series_struct = pd.Series(y_struct,
    index=pd.date_range('2010-01', periods=n, freq='ME'))

structural_result = fit_structural_time_series(series_struct)

Interview Questions

Q1: Explain the Kalman filter in one minute. What does it compute and why is it optimal?

The Kalman filter is a recursive algorithm that estimates a hidden state $\mathbf{x}_t$ from noisy observations $\mathbf{y}_{1:t}$ .

It alternates between two steps:

Predict: Use the state transition model to predict $\hat{\mathbf{x}}_{t|t-1}$ and its uncertainty $\mathbf{P}_{t|t-1}$ from the previous estimate.
Update: When a new observation arrives, compute the Kalman gain $\mathbf{K}_t$ - a matrix that weights how much to trust the measurement vs the prediction - and update: $\hat{\mathbf{x}}_{t|t} = \hat{\mathbf{x}}_{t|t-1} + \mathbf{K}_t (y_t - \mathbf{H}\hat{\mathbf{x}}_{t|t-1})$ .

Why optimal? For linear Gaussian systems, the Kalman filter computes the exact conditional distribution $p(\mathbf{x}_t | \mathbf{y}_{1:t}) = \mathcal{N}(\hat{\mathbf{x}}_{t|t}, \mathbf{P}_{t|t})$ . Among all estimators, it minimizes mean squared error - it is the MMSE (minimum mean squared error) estimator. This follows from the fact that for Gaussian distributions, the conditional mean is the MMSE estimator.

Q2: What is the Kalman gain and what do extreme values (near 0 or near 1) mean?

The Kalman gain is: $\mathbf{K}_t = \mathbf{P}_{t|t-1} \mathbf{H}^\top (\mathbf{H}\mathbf{P}_{t|t-1}\mathbf{H}^\top + \mathbf{R})^{-1}$

It determines the trade-off between the model prediction and the measurement.

$\mathbf{K}_t \to 0$ (small gain): Ignore the measurement. This happens when:

$\mathbf{R} \to \infty$ : measurement noise is very high (GPS in a tunnel)
$\mathbf{P}_{t|t-1} \to 0$ : the model prediction is highly confident Result: posterior $\approx$ prior prediction

$\mathbf{K}_t \to \mathbf{H}^{-1}$ (large gain): Trust the measurement fully. This happens when:

$\mathbf{R} \to 0$ : measurement is very accurate (RTK GPS)
$\mathbf{P}_{t|t-1} \to \infty$ : the model prediction is very uncertain Result: posterior $\approx$ scaled measurement

Intuition: Kalman gain balances two sources of information. If your GPS is precise and your motion model is uncertain, trust GPS. If your GPS is noisy and your model is well-calibrated, trust the model. The Kalman filter finds the statistically optimal balance.

Q3: How is an LSTM related to a Kalman filter?

Both are sequential state estimators: given a new observation, update a hidden state.

Structural parallels:

The LSTM cell state $\mathbf{c}_t$ corresponds to the Kalman state estimate $\hat{\mathbf{x}}_{t|t}$
The LSTM forget gate $\mathbf{f}_t = \sigma(\cdot)$ corresponds to how much of the prior state survives - in the Kalman filter, this is determined by the transition matrix $\mathbf{F}$ and the Kalman gain
The LSTM input gate $\mathbf{i}_t$ controls how much new information enters - Kalman's $\mathbf{K}_t$ does the same
Both compute a weighted combination of prediction and new observation

Key differences:

Kalman filter: linear dynamics, Gaussian noise, known model equations → provably optimal
LSTM: non-linear dynamics, learned from data, no distributional assumptions → flexible but not provably optimal
Uncertainty: Kalman tracks uncertainty explicitly via covariance $\mathbf{P}_t$ ; LSTM has no built-in uncertainty quantification
Data requirements: Kalman requires model specification; LSTM requires training data

Practical implication: When you have a known dynamical system (physics-based model), use Kalman. When the dynamics are unknown and you have data, use LSTM. Hybrid approaches (differentiable Kalman filters, Kalman-LSTM) learn dynamics while maintaining uncertainty tracking.

Q4: What is the difference between Kalman filtering and Kalman smoothing?

Kalman filtering is causal: at time $t$ , compute $p(\mathbf{x}_t | \mathbf{y}_{1:t})$ - only past and current observations are used. This is the appropriate algorithm for real-time systems (the filter cannot see the future).

Kalman smoothing (specifically the Rauch-Tung-Striebel smoother) is non-causal: at time $t$ , compute $p(\mathbf{x}_t | \mathbf{y}_{1:T})$ - all observations including future ones are used. This gives better estimates when you have the full time series available.

When to use each:

Filtering: real-time tracking (autonomous vehicles, robot control), streaming forecasting
Smoothing: offline analysis (EEG signal processing, time series decomposition, retrospective analysis)

Why smoother is better?: At time $t$ , future observations $\mathbf{y}_{t+1:T}$ carry information about $\mathbf{x}_t$ through the transition model. The smoother uses this information via a backward pass (RTS algorithm), giving lower RMSE than the forward-only filter.

In terms of ML analogies: filtering ≈ autoregressive model (left-to-right); smoothing ≈ bidirectional model (uses both past and future context, like BERT vs GPT).

Q5: How would you use a Kalman filter for anomaly detection?

The Kalman filter provides a natural anomaly detection framework via the innovation sequence $\tilde{\mathbf{y}}_t = \mathbf{y}_t - \mathbf{H}\hat{\mathbf{x}}_{t|t-1}$ .

Under the nominal model, innovations should be i.i.d. Gaussian with zero mean and covariance $\mathbf{S}_t = \mathbf{H}\mathbf{P}_{t|t-1}\mathbf{H}^\top + \mathbf{R}$ .

Anomaly score: The normalized innovation squared (NIS): $\text{NIS}_t = \tilde{\mathbf{y}}_t^\top \mathbf{S}_t^{-1} \tilde{\mathbf{y}}_t \sim \chi^2(m)$ under the null hypothesis (no anomaly), where $m$ is the observation dimension.

Algorithm:

Fit Kalman filter on normal training data (learn Q, R via EM)
At test time, compute NIS at each step
Flag $t$ as anomaly if $\text{NIS}_t > \chi^2_{m, 0.99}$ (99th percentile)

Advantages over threshold-based methods:

Automatically adapts to the model's current uncertainty (if the model is uncertain, the threshold widens)
Distinguishes "large value" from "unexpected value given model" - a large value from a known trend is normal; a deviation from the trend is anomalous
Works well for sensor data with known dynamics (temperature, pressure, motor vibration)

This approach is used in spacecraft telemetry monitoring, industrial predictive maintenance, and financial trade surveillance.

Key Takeaways

The state space model separates dynamics (transition equation) from observations (measurement equation)
The Kalman filter is the optimal linear estimator: predict → compute Kalman gain → update
Kalman gain $\mathbf{K}_t$ balances model prediction vs measurement, inversely proportional to measurement noise
RTS smoother runs the Kalman filter forward, then backward - uses all data for each estimate (not real-time)
LSTMs are learned non-linear Kalman filters: forget gate ≈ forgetting old state, input gate ≈ Kalman gain
Structural time series models (trend + seasonal + irregular) use the Kalman filter internally for optimal decomposition
For non-linear systems: Extended Kalman Filter (linearize Jacobians), Unscented KF (sigma points), or particle filters

Next: Lesson 06: Cointegration and Granger Causality →

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Kalman Filter demo on the EngineersOfAI Playground - no code required.

:::

The Real Interview Moment​

State Space Representation​

ARIMA as a state space model​

Local level model (random walk + noise)​

The Kalman Filter: Optimal Linear Estimation​

Prediction step (time update)​

Update step (measurement update)​

Interpretation of Kalman gain​

Python Implementation: Full Kalman Filter​

Example 1: 1D Noisy Signal Tracking​

Example 2: Sensor Fusion​

Example 3: EM Algorithm for Parameter Learning​

Connection to RNNs and LSTMs​

Linear SSM → non-linear SSM → RNN​

The LSTM as a learned Kalman filter​

Extended Kalman Filter (EKF): Non-Linear Systems​

Structural Time Series Models​

Interview Questions​

Key Takeaways​