What is state space models?

How control theory's state space models became a competitive sequence modeling architecture - continuous-time SSMs, the S4 paper, HiPPO initialization, and the convolutional/recurrent duality.

How does S4 work in practice?

State Space Model Foundations covers state space models, S4, HiPPO from first principles with code examples. Free lesson at https://engineersofai.com/docs/llms/state-space-models/ssm-foundations

What is the difference between state space models and HiPPO?

See the full breakdown at https://engineersofai.com/docs/llms/state-space-models/ssm-foundations

State Space Model Foundations

Opening Scenario: The Engineer Who Read a Control Theory Textbook

Albert Gu was a PhD student at Stanford in 2020, working on sequence modeling. The standard approach - transformers with attention - had the quadratic scaling problem everyone knew about but worked around. Gu's advisor suggested he look at state space models, a mathematical framework developed by Rudolf Kalman in 1960 for modeling physical dynamical systems like spacecraft trajectories and electrical circuits.

The connection was not obvious. Kalman developed SSMs to track a satellite's position given noisy sensor readings - continuous differential equations describing how physical state evolves over time. Language modeling is a discrete, high-dimensional prediction problem. The idea that the same formalism could work for predicting the next word in a sentence seemed far-fetched.

But Gu noticed something. The SSM framework had a property that made it powerful for long-range dependencies: with the right initialization, the hidden state could preserve information from arbitrarily far in the past. And crucially, he found a way to make SSMs computationally efficient for training - expressing them as convolutional operations that could be parallelized across the sequence. The result was the S4 paper (Efficiently Modeling Long Sequences with Structured State Spaces, Gu et al., 2021, NeurIPS), one of the most influential machine learning papers of that year.

Understanding S4 is understanding the foundation that Mamba builds on. This lesson walks you through the mathematics from first principles, building up from continuous-time systems to the discrete models used in practice, and explaining every design choice along the way.

Why Control Theory? The Intuition

Before the math, here is the intuition for why control theory's state space framework is a good fit for sequence modeling.

A physical dynamical system - say, a pendulum - has a state (angle and angular velocity), receives inputs (a push), and produces outputs (the observed position). The state evolves continuously over time. The engineer's job is to model this evolution: given the current state and input, what is the next state?

A sequence model has an analogous structure. The "state" is everything the model has learned from the tokens it has processed so far. The "input" is the current token. The "output" is the prediction for the next token. The "evolution" is the computation that updates the state given a new input.

The state space framework captures this cleanly, with well-studied mathematical properties for stability, expressivity, and long-range information preservation. The key question Gu asked was: if we borrow this framework and train it with gradient descent on sequence data, can we build something better than RNNs?

Continuous-Time State Space Models

The foundational SSM is defined by two linear differential equations:

$\dot{x}(t) = Ax(t) + Bu(t)$ $y(t) = Cx(t) + Du(t)$

Where:

$x(t) \in \mathbb{R}^N$ is the hidden state (N is the state dimension)
$u(t) \in \mathbb{R}$ is the input signal (one-dimensional for simplicity)
$y(t) \in \mathbb{R}$ is the output
$A \in \mathbb{R}^{N \times N}$ is the state transition matrix (how the state evolves)
$B \in \mathbb{R}^{N \times 1}$ is the input projection (how input affects state)
$C \in \mathbb{R}^{1 \times N}$ is the output projection (how state maps to output)
$D \in \mathbb{R}$ is the skip connection (often set to 0)

The dot notation $\dot{x}(t)$ means the time derivative of $x$ - the rate of change of the state. The first equation says: the rate of change of the hidden state equals the current state transformed by A, plus the current input projected by B. The second equation says: the output is the current state projected by C, plus a direct skip connection from the input.

This is a linear system. The future state depends linearly on the current state and input. There are no nonlinearities inside the SSM itself (nonlinearities come from surrounding components in the neural network, like SiLU activation functions).

The A Matrix: The Key to Long-Range Dependencies

The A matrix is the most important parameter. It governs how information persists in the hidden state over time. If the eigenvalues of A have large negative real parts, the hidden state decays quickly - information from the past is forgotten rapidly. If eigenvalues are near zero, the state persists longer.

For long-range sequence modeling, you want A to be structured so that information from far in the past can still influence the current state. This is the fundamental challenge: naive random initialization of A leads to either exploding states (unstable) or vanishing memory (forgetting everything).

The HiPPO Framework: Principled A Initialization

HiPPO (High-order Polynomial Projection Operators, Gu et al., 2020) is the mathematical framework that defines how A should be structured to optimally compress historical information. The core insight: design A so that the hidden state x(t) at time t is the optimal polynomial approximation of the input history $u(s)$ for $s \leq t$ .

The HiPPO-LegS matrix (Legendre measure) is:

$A_{nk} = -\begin{cases} (2n+1)^{1/2}(2k+1)^{1/2} & \text{if } n > k \\ n+1 & \text{if } n = k \\ 0 & \text{if } n < k \end{cases}$

This is a lower-triangular matrix with a specific structure. The remarkable property: with this initialization, the hidden state of the SSM at time t contains coefficients for an optimal polynomial approximation of the entire input history. Information is not simply decaying - it is being optimally compressed.

You do not need to memorize the HiPPO matrix formula. The key intuition is: the structure of A determines what the hidden state "remembers," and HiPPO gives us a theoretically grounded way to initialize A so the state retains long-range information.

Discretization: From Continuous to Discrete

Neural networks work with discrete sequences (tokens at integer positions), not continuous signals. We need to convert the continuous-time SSM to a discrete recurrence that processes one element at a time.

The most common discretization method is the Zero-Order Hold (ZOH). Given a step size $\Delta$ (the "timescale" parameter), we compute:

$\bar{A} = e^{\Delta A}$ $\bar{B} = (\Delta A)^{-1}(e^{\Delta A} - I) \cdot \Delta B$

The discrete recurrence is then:

$x_k = \bar{A} x_{k-1} + \bar{B} u_k$ $y_k = C x_k + D u_k$

This is now a standard linear recurrence: the new state is a linear function of the previous state plus the current input. The matrices $\bar{A}$ and $\bar{B}$ are learned during training (via their continuous-time counterparts A, B, and the timescale $\Delta$ ).

:::note Why Delta Matters The timescale parameter $\Delta$ controls how fast the discrete model "moves" through continuous time. Small $\Delta$ means each discrete step corresponds to a small increment - the model responds quickly to new inputs. Large $\Delta$ means each step corresponds to a large increment - the model integrates information over longer timescales. In S4 and its successors, $\Delta$ is a learnable parameter, allowing the model to adapt its memory timescale to the data. :::

The Convolutional View: Why SSMs Can Train in Parallel

Here is the key mathematical insight that makes SSMs practical for training. Starting from the discrete recurrence:

$x_k = \bar{A} x_{k-1} + \bar{B} u_k$ $y_k = C x_k$

If we assume zero initial state $x_0 = 0$ , we can unroll the recurrence:

$x_1 = \bar{B} u_1$ $x_2 = \bar{A} \bar{B} u_1 + \bar{B} u_2$ $x_k = \sum_{j=1}^{k} \bar{A}^{k-j} \bar{B} u_j$

The output at position k is:

$y_k = C x_k = \sum_{j=1}^{k} C \bar{A}^{k-j} \bar{B} u_j$

This is a convolution! The output sequence $y$ is the convolution of the input sequence $u$ with the kernel $K$ :

$K = (C\bar{B}, \ C\bar{A}\bar{B}, \ C\bar{A}^2\bar{B}, \ \ldots, \ C\bar{A}^{L-1}\bar{B})$

$y = K * u$

Where $*$ denotes convolution and $L$ is the sequence length. Convolutions can be computed extremely efficiently using the Fast Fourier Transform (FFT), with $O(L \log L)$ complexity - much faster than the $O(L^2)$ that would be needed to compute each position sequentially.

This duality - the same mathematical object computed two ways - is the breakthrough. During training, we use the convolutional view for fast parallel computation. During inference, we use the recurrent view for memory-efficient token-by-token generation.

The S4 Paper: Structured State Spaces

The Efficiently Modeling Long Sequences with Structured State Spaces (S4) paper (Gu, Goel, Re, 2021) introduced the first SSM that was competitive with transformers on sequence modeling tasks.

The S4 paper's core contributions:

1. The HIPPO initialization: Using HiPPO matrices for A ensures long-range memory. This was the key missing piece compared to earlier SSM attempts.

2. Structured parameterization (DPLR): Computing the SSM kernel $K$ requires computing powers of A ( $A, A^2, A^3, \ldots, A^L$ ). For a general N×N matrix, this is expensive. S4 restricts A to the "Diagonal Plus Low Rank" (DPLR) structure, which allows computing the kernel efficiently via the Cauchy kernel - reducing the cost from O(LN²) to O(L log L · N).

3. Empirical results: S4 achieved state-of-the-art on the Long Range Arena benchmark, including tasks with sequences up to 16,000 elements where transformers fail. S4 with 93.6% accuracy on the Path-X task (a 128×128 image sequence task requiring integrating information across 16K steps) compared to 50% for transformers at the time.

The Long Range Arena Benchmark

LRA (Tay et al., 2021) is a benchmark suite designed specifically to stress-test long-range sequence models. Tasks include:

ListOps (2K tokens): Mathematical operations in a hierarchical list
Text (4K tokens): Document sentiment classification
Retrieval (4K tokens): Matching document pairs
Image (1K tokens): Image classification with pixels as a 1D sequence
Pathfinder (1K tokens): Determining if two points are connected by a path in a noisy image
Path-X (16K tokens): Hard version of Pathfinder at higher resolution

Standard transformers score near 50% (random chance) on Path-X because the relevant information is spread across 16K steps - far beyond what attention can practically handle. S4 scored 88.1%, demonstrating genuine long-range modeling capability.

Implementing an SSM from Scratch

Here is a minimal PyTorch implementation of a discrete SSM layer, building intuition for the computation:

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np


def make_hippo_matrix(N: int) -> np.ndarray:
    """
    Construct the HiPPO-LegS matrix for optimal polynomial memory.

    This is the A matrix that ensures the SSM hidden state
    maintains an optimal approximation of input history.
    """
    P = np.sqrt(1 + 2 * np.arange(N))  # shape [N]
    A = P[:, None] * P[None, :]         # outer product, shape [N, N]
    A = np.tril(A) - np.diag(np.arange(N))
    return -A


def discretize_zoh(A: np.ndarray, B: np.ndarray, delta: float):
    """
    Zero-Order Hold discretization of continuous SSM.

    Converts continuous (A, B) to discrete (Ā, B̄) using step size delta.
    """
    N = A.shape[0]
    I = np.eye(N)

    # Matrix exponential for Ā
    A_bar = np.linalg.matrix_power(
        np.eye(N) + delta * A / 2,  # Approximation: e^(Δ*A) ≈ (I + ΔA/2)
        1
    )
    # More accurate: A_bar = scipy.linalg.expm(delta * A)

    # B̄ = Δ * B (simplified approximation)
    B_bar = delta * B

    return A_bar, B_bar


class SimpleSSMLayer(nn.Module):
    """
    A minimal SSM layer demonstrating the core computation.

    This is a simplified version - not the full S4 implementation,
    but illustrative of the key ideas.

    Can run in:
    - recurrent mode: process one token at a time (inference)
    - convolutional mode: process full sequence in parallel (training)
    """
    def __init__(self, d_model: int, d_state: int = 16):
        super().__init__()
        self.d_model = d_model
        self.d_state = d_state

        # Initialize A with HiPPO structure
        A = make_hippo_matrix(d_state)
        self.register_buffer('A_hippo', torch.tensor(A, dtype=torch.float32))

        # Learnable parameters
        # A_log: we parameterize A indirectly for numerical stability
        self.A_log = nn.Parameter(
            torch.log(torch.abs(torch.tensor(A, dtype=torch.float32)) + 1e-6)
        )
        self.B = nn.Parameter(torch.randn(d_state, 1) * 0.01)
        self.C = nn.Parameter(torch.randn(1, d_state) * 0.01)
        self.D = nn.Parameter(torch.ones(1))

        # Input/output projections (for multi-dimensional input)
        self.in_proj = nn.Linear(d_model, 1)
        self.out_proj = nn.Linear(1, d_model)

        # Step size (learnable timescale)
        self.log_delta = nn.Parameter(torch.zeros(1))

    def get_discrete_params(self):
        """Discretize continuous parameters."""
        delta = torch.exp(self.log_delta).clamp(1e-4, 0.1)
        # Simplified discretization
        A_bar = torch.exp(-delta * torch.exp(self.A_log))  # diagonal approx
        B_bar = delta * self.B
        return A_bar, B_bar

    def compute_ssm_kernel(self, L: int) -> torch.Tensor:
        """
        Compute the SSM convolution kernel of length L.
        K[i] = C @ Ā^i @ B̄
        """
        A_bar, B_bar = self.get_discrete_params()

        # For a diagonal approximation, powers are element-wise
        # A_bar^i = A_bar ** i

        kernel = []
        A_power = torch.ones_like(A_bar)  # A^0 = I (diagonal: all 1s)

        for i in range(L):
            # K[i] = C * A_bar^i * B_bar (element-wise for diagonal A)
            k_i = (self.C * A_power * B_bar.T).sum()
            kernel.append(k_i)
            A_power = A_power * A_bar

        return torch.stack(kernel)  # shape [L]

    def forward_convolutional(self, u: torch.Tensor) -> torch.Tensor:
        """
        Training mode: compute output using convolution (parallel over sequence).

        u: shape [batch, seq_len, d_model]
        """
        batch, L, _ = u.shape

        # Project to 1D for this simplified example
        u_1d = self.in_proj(u).squeeze(-1)  # [batch, L]

        # Compute kernel
        K = self.compute_ssm_kernel(L)  # [L]

        # Convolve using FFT
        # For 1D convolution: y = ifft(fft(u) * fft(K))
        u_fft = torch.fft.rfft(u_1d, n=2*L)
        K_fft = torch.fft.rfft(K, n=2*L)
        y_1d = torch.fft.irfft(u_fft * K_fft, n=2*L)[:, :L]  # [batch, L]

        # Add skip connection
        y_1d = y_1d + self.D * u_1d

        # Project back to d_model
        y = self.out_proj(y_1d.unsqueeze(-1))  # [batch, L, d_model]
        return y

    def forward_recurrent(self, u: torch.Tensor, h: torch.Tensor) -> tuple:
        """
        Inference mode: process one token with recurrent update.

        u: shape [batch, d_model] - single token
        h: shape [batch, d_state] - current hidden state
        Returns: (output, new_hidden_state)
        """
        A_bar, B_bar = self.get_discrete_params()

        # Project input to 1D
        u_1d = self.in_proj(u)  # [batch, 1]

        # Recurrent update: h_new = Ā * h + B̄ * u
        h_new = h * A_bar.squeeze() + (B_bar * u_1d).squeeze(-1)

        # Output: y = C * h + D * u
        y_1d = (self.C @ h_new.T).T + self.D * u_1d

        # Project back
        y = self.out_proj(y_1d)  # [batch, d_model]

        return y, h_new

    def forward(self, u: torch.Tensor, mode: str = 'conv') -> torch.Tensor:
        if mode == 'conv':
            return self.forward_convolutional(u)
        else:
            # Recurrent mode for inference (loop over sequence)
            batch, L, _ = u.shape
            h = torch.zeros(batch, self.d_state, device=u.device)
            outputs = []
            for t in range(L):
                y_t, h = self.forward_recurrent(u[:, t, :], h)
                outputs.append(y_t)
            return torch.stack(outputs, dim=1)


# Test both modes produce similar outputs
if __name__ == "__main__":
    torch.manual_seed(42)
    model = SimpleSSMLayer(d_model=64, d_state=16)

    batch, seq_len = 2, 128
    x = torch.randn(batch, seq_len, 64)

    y_conv = model(x, mode='conv')
    y_rec = model(x, mode='recurrent')

    # Should be approximately equal (differences due to simplified discretization)
    print(f"Convolutional output shape: {y_conv.shape}")
    print(f"Recurrent output shape: {y_rec.shape}")
    print(f"Max difference: {(y_conv - y_rec).abs().max().item():.4f}")

S4's Successors: S4D, DSS, and Simplifications

After S4, several papers found simpler parameterizations that achieved similar or better performance:

DSS: Diagonal State Spaces

Diagonal State Spaces (DSS) (Gupta et al., 2022) showed that restricting A to be purely diagonal (not diagonal plus low rank) achieved competitive performance with S4, at significantly lower computational cost. The full DPLR structure of S4 was shown to be largely unnecessary.

S4D: Diagonal Structured State Space

S4D (Gu et al., 2022) formalized the diagonal simplification, showing that a diagonal A with careful initialization (derived from the HiPPO theory) matches S4's performance. S4D is simpler to implement and faster to compute.

The simplified discrete recurrence for S4D (diagonal A means element-wise operations):

$h_k = \bar{a} \odot h_{k-1} + \bar{b} u_k$ $y_k = \text{Re}(c^* \odot h_k)$

Where $\odot$ is element-wise multiplication, $\bar{a} \in \mathbb{C}^N$ and $\bar{b} \in \mathbb{C}^N$ are complex-valued parameters (the diagonal of $\bar{A}$ and $\bar{B}$ ), and $c^* \in \mathbb{C}^N$ is the conjugate of C.

The use of complex numbers is important: complex exponentials ( $e^{i\omega t}$ ) can represent oscillatory memory, capturing periodic patterns in sequences that purely real-valued states would struggle with.

S5: Parallel Scans for Efficiency

S5 (Smith et al., 2022) introduced the parallel scan algorithm (also called the prefix scan or scan operation) for SSMs. Rather than computing the recurrence sequentially or converting to convolution, the parallel scan computes all recurrent steps in $O(\log L)$ parallel steps using a tree reduction. This is computationally equivalent to the FFT-based approach but more numerically stable and easier to implement on modern hardware.

The parallel scan insight became foundational for Mamba's training algorithm.

The S4 Block: Building a Sequence Model

A single SSM maps a 1D input sequence to a 1D output sequence. For high-dimensional sequence modeling (like language), we need multiple channels working in parallel. The S4 block stacks:

Each of the d_model channels runs its own independent SSM with shared structural initialization but separate learned parameters. The channel mixing step (a position-wise MLP or linear layer) allows information to flow between channels.

The Key Insight: What SSMs Cannot Do (Yet)

Understanding S4's success requires understanding its fundamental limitation relative to transformers: the SSM parameters (A, B, C, delta) do not depend on the input.

This means:

The state transition matrix A is fixed once trained
The model applies the same "memory compression" strategy regardless of what it's reading
It cannot decide "this is an important piece of information, I should remember it more carefully"

A transformer with attention can do exactly this: the attention weights are computed dynamically from the actual content of the tokens, so the model can focus on what matters for the current prediction.

The S4 SSM is like a note-taker who always writes in the same format regardless of the meeting's content. Sometimes this works well (the format is general-purpose enough). Sometimes it misses critical details. Mamba, the subject of Lesson 3, fixes exactly this limitation - and the fix requires a fundamentally different design.

Production Notes

Training SSMs in practice: The convolutional mode requires computing the SSM kernel, which involves the matrix exponential and (for S4) Cauchy kernel computations. In practice, you would use optimized implementations from libraries like state-spaces/mamba rather than implementing from scratch.

Numerical precision: SSMs with complex eigenvalues (like S4D) require careful handling of complex arithmetic. The real parts of the complex state are taken as the output. In float16 training, small imaginary components can accumulate floating point errors. Some implementations keep the SSM computation in float32 even when the rest of the model runs in float16.

State space dimension N: Typical values are 16–64. Larger N captures more complex patterns but increases compute. The paper found N=64 sufficient for most tasks.

# Quick benchmark: SSM vs attention memory usage during inference
import torch

def benchmark_inference_memory():
    """Compare memory growth: SSM hidden state vs transformer KV cache."""

    # Model configuration (7B-scale)
    n_layers = 32
    d_model = 4096
    d_state = 64       # SSM state dimension
    n_heads = 32
    head_dim = 128
    dtype = torch.float16
    bytes_per_element = 2

    print("Memory at inference (per batch item):\n")
    print(f"{'Seq Len':>12} | {'Transformer KV (GB)':>22} | {'SSM State (MB)':>18}")
    print("-" * 60)

    for seq_len in [1_000, 10_000, 50_000, 100_000, 500_000]:
        # Transformer KV cache grows linearly with seq_len
        kv_bytes = n_layers * 2 * n_heads * seq_len * head_dim * bytes_per_element
        kv_gb = kv_bytes / 1e9

        # SSM state is constant regardless of seq_len
        ssm_bytes = n_layers * d_model * d_state * bytes_per_element
        ssm_mb = ssm_bytes / 1e6

        print(f"{seq_len:>12,} | {kv_gb:>22.2f} | {ssm_mb:>18.2f}")

benchmark_inference_memory()

Common Mistakes

:::danger Confusing the Convolutional and Recurrent Modes The same SSM computes identical outputs in convolutional and recurrent modes - they are two computations of the same function. The convolutional mode is used during training for efficiency. The recurrent mode is used during inference for memory efficiency. A common implementation bug is using the convolutional mode during inference, which recomputes the full output from scratch for each new token rather than updating the state incrementally. :::

:::warning The A Matrix Must Be Structured Initializing A randomly leads to training instability. The eigenvalues of A determine long-range memory: too negative and the model forgets immediately; too near zero and the gradients explode during training. The HiPPO initialization (or learned diagonal with careful initialization) is essential. The failure mode - training loss that doesn't decrease, or NaN gradients - is often caused by a poorly conditioned A matrix. :::

:::warning SSMs Are Not Recurrent in the RNN Sense Unlike LSTMs and GRUs, the SSM recurrence $x_k = \bar{A} x_{k-1} + \bar{B} u_k$ is linear - there is no nonlinearity inside the recurrence. This makes it trainable (linear systems don't suffer from vanishing/exploding gradients in the same way), but it also limits expressivity. The nonlinearities that make SSMs powerful come from surrounding components: the input/output projections, activation functions between layers, and (in Mamba) the selective gating mechanism. :::

Interview Q&A

Q1: What is the S4 paper, and what problem does it solve that earlier SSMs could not?

S4 (Efficiently Modeling Long Sequences with Structured State Spaces, Gu, Goel, Re, NeurIPS 2021) showed that SSMs could match or beat transformers on long-range sequence tasks, particularly those requiring information retention over thousands of steps. Earlier SSMs and related approaches (LRU, HiPPO) established the theoretical framework but were not computationally efficient at scale. S4's key contributions were: (1) the HiPPO initialization for matrix A, ensuring long-range memory; (2) the Diagonal Plus Low Rank (DPLR) structure for A, enabling efficient kernel computation via Cauchy matrices; and (3) the insight that the SSM can be computed as a convolution during training, enabling parallel computation over sequences.

Q2: Explain the convolutional/recurrent duality of SSMs and why it matters for practical deployment.

An SSM defines a linear recurrence $x_k = \bar{A}x_{k-1} + \bar{B}u_k$ , $y_k = Cx_k$ . When unrolled from zero initial state, the output sequence is the convolution of the input sequence with the kernel $K = (C\bar{B}, C\bar{A}\bar{B}, C\bar{A}^2\bar{B}, \ldots)$ . This means during training, the entire output sequence can be computed in parallel using FFT-based convolution in $O(L \log L)$ operations - as efficient as a transformer's forward pass. During inference, we use the recurrent form: process one token at a time with a constant-size hidden state. This gives O(1) memory at inference regardless of sequence length, unlike transformers where the KV cache grows linearly.

Q3: What is the HiPPO initialization and why is it necessary for long-range dependencies?

HiPPO (High-order Polynomial Projection Operators) defines A matrices with the property that the resulting hidden state $x(t)$ is the optimal polynomial approximation of the entire input history up to time t. Intuitively, HiPPO structures A so that the hidden state compresses past information in a way that minimizes information loss. Without this structure, a randomly initialized A either has eigenvalues with large negative real parts (fast forgetting) or eigenvalues near the unit circle (long memory but unstable training). HiPPO finds the sweet spot: structured matrices that are both stable during training and capable of long-range memory.

Q4: How does S4 achieve O(L log L) training complexity, and what simplifications do S4D and DSS make?

S4's O(L log L) training comes from: (1) expressing the output as a convolution $y = K * u$ , and (2) computing the convolution via FFT in O(L log L). The bottleneck is computing the kernel K, which requires computing $C\bar{A}^k\bar{B}$ for k=0 to L-1. For a general matrix, this requires O(LN²) work, but S4's DPLR structure allows the kernel to be computed as a Cauchy kernel evaluation in O(L log L · N). S4D simplifies by using a purely diagonal A (no low-rank correction), which makes kernel computation straightforward element-wise operations. DSS showed empirically that diagonal structure matches DPLR performance, removing the mathematical complexity of S4 at the cost of a slightly different initialization.

Q5: What is the fundamental expressivity limitation of S4 compared to transformers?

S4's parameters (A, B, C, Delta) are fixed after training - they do not depend on the input content. The model applies the same state transition regardless of what it is reading. This means it cannot selectively allocate memory to important information: it compresses everything according to the same learned strategy. A transformer's attention weights are computed dynamically from the actual query and key contents, letting the model focus on what is relevant for the current prediction. This matters for tasks requiring precise retrieval ("what did the document say about X?") or context-dependent reasoning ("does the word 'bank' here mean a financial institution or a river bank?"). Mamba (Lesson 3) addresses this by making the SSM parameters input-dependent - the key architectural innovation of selective state spaces.

Understanding SSM Expressivity: What Can Linear Recurrences Model?

S4's linear recurrence $x_k = \bar{A}x_{k-1} + \bar{B}u_k$ can express certain computational patterns efficiently and others not at all. Understanding these limits clarifies when SSMs work and when they don't.

What SSMs can express well:

Linear time-invariant systems: Any system where the "rule" for updating state doesn't change with time or input
Frequency analysis: Complex eigenvalues encode oscillatory patterns - SSMs are excellent at modeling periodic signals (audio, physiological signals)
Exponential memory decay: The $e^{\Delta A}$ term naturally encodes geometric forgetting - useful for signals where recent information is more relevant
Polynomial approximation of history: Via HiPPO, the state is an optimal polynomial fit to recent history - excellent for compression

What SSMs cannot express:

Content-dependent operations: "Pay extra attention to this token because it contains the keyword 'IMPORTANT'" - requires the state update to depend on the token's content, not just its position
Exact copying: Reproducing a specific sequence from earlier in the input requires the exact values to survive the compression - challenging for long distances
Associative lookup: Finding the value associated with a specific key ("What did document say was the invoice number?") requires selective retrieval that fixed-parameter SSMs struggle with

This expressivity analysis is precisely why Mamba's input-dependent parameters are such a significant advancement - they address the exact weaknesses of fixed-parameter SSMs.

The LSSL and Earlier Precursors

Before S4, several other works attempted to use state space formulations for deep learning:

Linear State-Space Layers (LSSL) (Gu et al., 2021, earlier work): The paper that introduced using S4-like computations in neural networks. Less efficient than S4 due to not having the DPLR structure, but established the core idea.

Legendre Memory Units (LMU) (Voelker et al., 2019): A recurrent neural network cell based on HiPPO-like polynomials. Showed that principled memory initialization gives strong long-range performance, predating S4.

Neural ODEs and CDE: Continuous-time neural networks as ordinary differential equations. SSMs can be seen as a special case with linear dynamics.

The S4 paper brought together HiPPO initialization, efficient kernel computation, and the convolutional/recurrent duality into a cohesive, practical architecture. The predecessors were the necessary stepping stones.

Benchmarking S4 vs Transformers on Long Range Arena

The Long Range Arena (LRA) results from the S4 paper are the definitive demonstration of SSMs' long-range capability:

# LRA benchmark results (from S4 paper, 2021)
lra_results = {
    "ListOps": {
        "seq_len": 2048,
        "Transformer": 36.4,
        "Longformer": 35.7,
        "S4": 59.6,
        "task": "Mathematical list operations, hierarchical structure"
    },
    "Text": {
        "seq_len": 4096,
        "Transformer": 64.3,
        "Longformer": 62.9,
        "S4": 86.8,
        "task": "Character-level text classification (IMDb)"
    },
    "Retrieval": {
        "seq_len": 4000,
        "Transformer": 57.5,
        "Longformer": 56.9,
        "S4": 90.9,
        "task": "Document pair matching, requires integrating both docs"
    },
    "Image": {
        "seq_len": 1024,
        "Transformer": 42.2,
        "Longformer": 42.2,
        "S4": 88.7,
        "task": "Image classification with pixels as 1D sequence"
    },
    "Pathfinder": {
        "seq_len": 1024,
        "Transformer": 71.4,
        "Longformer": 72.0,
        "S4": 86.1,
        "task": "Path connectivity in noisy image"
    },
    "Path-X": {
        "seq_len": 16384,
        "Transformer": 50.0,  # Random chance - complete failure
        "Longformer": 50.0,  # Also fails - sequence too long
        "S4": 88.1,
        "task": "Hard Path connectivity at 128×128 resolution"
    },
}

print(f"{'Task':<15} | {'Seq Len':>8} | {'Transformer':>12} | {'Longformer':>12} | {'S4':>8}")
print("-" * 70)
for task, data in lra_results.items():
    print(
        f"{task:<15} | {data['seq_len']:>8} | "
        f"{data['Transformer']:>11.1f}% | "
        f"{data['Longformer']:>11.1f}% | "
        f"{data['S4']:>7.1f}%"
    )

print("\nPath-X is the most striking: Transformer = 50% (random chance).")
print("S4 = 88.1% - demonstrating genuine long-range sequence modeling.")

The Path-X result is remarkable: a 16,384-step sequence where the transformer fails completely (performs at chance level) while S4 achieves 88% accuracy. This single result made the ML community take SSMs seriously as a research direction worth pursuing.

Mathematical Connection: SSMs and Recurrent Neural Networks

It is worth establishing how SSMs relate to classic RNNs to avoid confusion:

Standard RNN: $h_t = \sigma(W_h h_{t-1} + W_x x_t + b)$

Nonlinear activation $\sigma$ inside the recurrence
Full matrix multiplication for both $W_h$ and $W_x$
Cannot be parallelized during training (sequential due to nonlinearity)
Can express arbitrary functions but suffers from vanishing/exploding gradients

SSM (S4, Mamba): $x_k = \bar{A}x_{k-1} + \bar{B}u_k$

No nonlinearity inside the recurrence - purely linear
$\bar{A}$ is structured (diagonal or DPLR), not a general matrix
Can be parallelized during training via convolution
Cannot directly express arbitrary functions, but can be stacked with nonlinear layers

The linearity of the SSM recurrence is what enables the convolutional/recurrent duality - and what limits its expressivity. Real SSM-based models compensate for this by surrounding the linear SSM with nonlinear components: SiLU activations, gating mechanisms, and position-wise MLPs. The SSM handles long-range sequence mixing; the nonlinear layers handle content processing.

This design is analogous to how attention handles long-range dependencies in transformers while the MLP handles position-wise feature transformation. Both architectures separate "where to look" (attention/SSM) from "what to do with it" (MLP).

:::tip 🎮 Interactive Playground

Visualize this concept: Try the State Space Model Foundations demo on the EngineersOfAI Playground - no code required.

:::

Opening Scenario: The Engineer Who Read a Control Theory Textbook​

Why Control Theory? The Intuition​

Continuous-Time State Space Models​

The A Matrix: The Key to Long-Range Dependencies​

The HiPPO Framework: Principled A Initialization​

Discretization: From Continuous to Discrete​

The Convolutional View: Why SSMs Can Train in Parallel​

The S4 Paper: Structured State Spaces​

The Long Range Arena Benchmark​

Implementing an SSM from Scratch​

S4's Successors: S4D, DSS, and Simplifications​

DSS: Diagonal State Spaces​

S4D: Diagonal Structured State Space​

S5: Parallel Scans for Efficiency​

The S4 Block: Building a Sequence Model​

The Key Insight: What SSMs Cannot Do (Yet)​

Production Notes​

Common Mistakes​

Interview Q&A​

Understanding SSM Expressivity: What Can Linear Recurrences Model?​

The LSSL and Earlier Precursors​

Benchmarking S4 vs Transformers on Long Range Arena​

Mathematical Connection: SSMs and Recurrent Neural Networks​