Complexity Analysis for ML Engineers

The Day Attention Hit a Wall

It was 2019. The original BERT model had just dropped, and every NLP team was rushing to apply transformer architectures to longer sequences. A team at a major tech company was building a document-level understanding system. They trained on sentences just fine. They tried 512 tokens - still workable. Then they pushed to 4096 tokens per document, and the training job immediately ran out of memory on their 32GB V100s.

The problem was not the model weights. The weights were small. The problem was that self-attention computes a score between every pair of tokens in the sequence. With 4096 tokens, that is $4096 \times 4096 = 16.7$ million pairs - just for one head, one layer. Multiply by 12 heads, 24 layers, batch size 8, and you have an attention matrix that consumes hundreds of gigabytes. Training long documents was simply impossible.

This was not a hardware limitation waiting for bigger GPUs. It was a fundamental algorithmic problem. The self-attention mechanism had $O(n^2)$ space complexity baked into its design. No amount of money spent on hardware would fix it. The team needed to understand the complexity of their system before they could reason about solutions.

The fix came in 2022 with FlashAttention - which did not reduce the computational complexity (still $O(n^2)$ FLOPs) but reduced the memory complexity to $O(n)$ by restructuring the computation to be cache-aware. Understanding the difference between FLOPs and memory complexity was what unlocked the solution.

If you are building ML systems - training infrastructure, inference pipelines, feature computation at scale - complexity analysis is not academic background knowledge. It is the primary lens through which you diagnose why things are slow, predict whether a system will scale, and evaluate algorithmic alternatives before spending weeks implementing them.

This lesson builds that lens from the ground up, with every concept grounded in ML reality.

Why This Exists

Before formal complexity analysis, engineers optimized by intuition and benchmarking. They would implement something, run it, find it was slow, and guess at improvements. This works for small systems. It catastrophically fails for ML systems where inputs range from 100 samples during development to 100 million in production.

The core problem is that algorithmic behavior changes non-linearly with scale. A function that takes 1 second on 1,000 samples might take 1,000 seconds on 1 million samples - not linearly but potentially quadratically. Without complexity analysis, you cannot predict this from a small benchmark.

Complexity analysis gives you a mathematical model of how an algorithm's resource consumption grows as a function of input size. It tells you whether your solution will work at 10x, 100x, or 1000x the data you tested on. In ML, where we routinely scale from prototype to production by orders of magnitude, this is not optional knowledge.

Historical Context

Complexity theory emerged in the 1960s as computer scientists began building large systems and realized that "fast on my machine" was not a useful property. Donald Knuth formalized asymptotic notation in his monumental "The Art of Computer Programming" (1968), drawing on earlier mathematical work by Paul Bachmann and Edmund Landau from the 1890s.

The key insight Knuth captured was that for large enough inputs, the leading term of a runtime expression dominates everything else. $3n^2 + 47n + 1000$ grows like $n^2$ for large $n$ - the constants and lower-order terms stop mattering. This simplification lets us reason about algorithm families rather than specific implementations on specific hardware.

Michael Sipser and others formalized complexity classes (P, NP, PSPACE) in the 1970s and 1980s, establishing the theoretical foundations. But for ML engineers, the practical version - Big-O notation as a tool for predicting scaling behavior - is what matters day-to-day.

The application of complexity theory to ML infrastructure became critical around 2017-2022 as model sizes grew from millions to billions of parameters and datasets grew from millions to trillions of tokens. At that scale, a wrong algorithmic choice - picking $O(n^2)$ where $O(n \log n)$ was possible - could mean the difference between a viable system and one that is fundamentally unusable.

Core Concepts

Big-O Notation: What It Actually Means

Big-O notation describes an upper bound on how a function grows. When we write $f(n) = O(g(n))$ , we mean there exists some constant $c$ and threshold $n_0$ such that for all $n > n_0$ :

$f(n) \leq c \cdot g(n)$

In plain English: beyond some input size, $f(n)$ is bounded above by $c \cdot g(n)$ . We do not care about small inputs, and we absorb constants into $c$ .

There are three related notations, and knowing all three makes you much more precise:

Big-O ( $O$ ) - upper bound. $f(n) = O(g(n))$ means $f$ grows no faster than $g$ . Like saying "at most this fast."
Big-Omega ( $\Omega$ ) - lower bound. $f(n) = \Omega(g(n))$ means $f$ grows at least as fast as $g$ . Like saying "at minimum this fast."
Big-Theta ( $\Theta$ ) - tight bound. $f(n) = \Theta(g(n))$ means $f$ and $g$ grow at the same rate - both the upper and lower bounds match.

When we say "matrix multiply is $O(n^3)$ " we mean the naive algorithm - it is actually $\Theta(n^3)$ because both upper and lower bounds are cubic. When we say "sorting is $O(n \log n)$ " for comparison-based sorts in the average case, we should really say $\Theta(n \log n)$ .

ML engineers typically use Big-O loosely to mean "tight bound" (i.e., Big-Theta). This is fine in practice, but be aware of the distinction when someone says "this is $O(n^2)$ at worst" - they mean upper bound, not tight bound.

Common Complexity Classes in ML

Complexity	Name	ML Example
$O(1)$	Constant	Hash table lookup, dict access
$O(\log n)$	Logarithmic	Binary search in sorted array, HNSW ANN search
$O(n)$	Linear	Softmax over vocabulary, single pass over dataset
$O(n \log n)$	Linearithmic	Sorting tokens by length, FFT-based convolution
$O(n^2)$	Quadratic	Naive self-attention, naive k-NN search
$O(n^3)$	Cubic	Naive dense matrix multiply
$O(2^n)$	Exponential	Exhaustive NAS search over architecture space

The jump from $O(n)$ to $O(n^2)$ is where ML systems most often break. At $n = 10^6$ , linear is $10^6$ operations, quadratic is $10^{12}$ operations - a million-fold difference. This is why going from a 512-token BERT to a 4096-token document model is not "8x harder" - it is $8^2 = 64x$ harder for the attention computation.

Time vs Space Complexity

Every algorithm has both time complexity (how many operations) and space complexity (how much memory). They are often in tension, and ML adds a third dimension: memory bandwidth complexity (how much data is moved between cache levels, or RAM and GPU high-bandwidth memory).

The classic tradeoff: dynamic programming trades space for time. You store intermediate results (space $O(n)$ or $O(n^2)$ ) to avoid recomputing them (saving $O(n)$ or more in time). The right side of the tradeoff depends on your bottleneck.

For GPU ML workloads, the dominant bottleneck is often memory bandwidth, not FLOPs. A GPU can perform far more arithmetic than it can move data. This is why FlashAttention's contribution was not fewer FLOPs - the FLOPs were identical to standard attention - but fewer round-trips to GPU HBM memory. The $O(n^2)$ attention matrix was being written to and read from HBM repeatedly. FlashAttention tiles the computation to stay in SRAM, cutting memory traffic dramatically.

Amortized Analysis: Dynamic Arrays

Amortized analysis asks: what is the average cost per operation over a long sequence of operations, even if individual operations are sometimes expensive?

The canonical example is Python's list (a dynamic array). Appending to a list is usually $O(1)$ - just write the value to the next pre-allocated slot. But occasionally, the backing array is full and must be resized: a new array at 2x the current size is allocated, and all existing elements are copied over. This resize operation costs $O(n)$ .

How expensive is appending $n$ elements in total?

Without amortized analysis, you might say "each append is $O(n)$ in the worst case, so $n$ appends costs $O(n^2)$ total." But that is wrong. Resizes happen at sizes 1, 2, 4, 8, ..., up to $n$ , so the total copy work is:

$1 + 2 + 4 + 8 + \ldots + n = 2n - 1 = O(n)$

Amortized over $n$ appends, each append costs $O(1)$ amortized.

$\text{Amortized cost per append} = \frac{O(n) \text{ total work}}{n \text{ operations}} = O(1)$

In ML this matters for: gradient accumulation buffers, replay buffers in reinforcement learning, online learning with streaming data, and any system that builds up data structures incrementally. When you see occasional latency spikes during training, it is worth checking whether an amortized cost is becoming visible.

The Master Theorem for Divide-and-Conquer

Many ML algorithms use divide-and-conquer: split the problem in half, solve recursively, combine results. The Master Theorem gives you the time complexity of such algorithms without solving the recurrence manually.

For a recurrence of the form $T(n) = a \cdot T(n/b) + f(n)$ where:

$a$ = number of subproblems
$b$ = factor by which input size shrinks per level
$f(n)$ = cost of dividing and combining

The solution is:

$T(n) = \begin{cases} \Theta(n^{\log_b a}) & \text{if } f(n) = O(n^{\log_b a - \epsilon}) \text{ for some } \epsilon > 0 \\ \Theta(n^{\log_b a} \log n) & \text{if } f(n) = \Theta(n^{\log_b a}) \\ \Theta(f(n)) & \text{if } f(n) = \Omega(n^{\log_b a + \epsilon}) \text{ for some } \epsilon > 0 \end{cases}$

Practical examples in ML:

Merge sort: $T(n) = 2T(n/2) + O(n)$ . Here $a=2$ , $b=2$ , $\log_b a = 1$ , $f(n) = O(n) = \Theta(n^1)$ . Equal case applies: $\Theta(n \log n)$ .
Strassen matrix multiply: $T(n) = 7T(n/2) + O(n^2)$ . Here $\log_2 7 \approx 2.807$ . Subproblems dominate: $\Theta(n^{2.807})$ - faster than naive $O(n^3)$ .
FFT-based convolution: $T(n) = 2T(n/2) + O(n)$ . Same as merge sort: $\Theta(n \log n)$ - which is why FFT convolution beats direct convolution for large kernels.

Transformer Attention Complexity in Depth

Self-attention takes a sequence of $n$ tokens, each represented as a $d$ -dimensional vector, and produces for each token a weighted sum over all other tokens.

Step by step:

Compute Q, K, V projections: three $(n \times d) \cdot (d \times d)$ matrix multiplies = $O(nd^2)$ each, total $O(nd^2)$
Compute raw attention scores $QK^T$ : $(n \times d) \cdot (d \times n) = O(n^2 d)$ operations, produces an $(n \times n)$ matrix
Scale and softmax over the score matrix: $O(n^2)$
Weighted sum $\text{softmax}(QK^T / \sqrt{d}) \cdot V$ : $(n \times n) \cdot (n \times d) = O(n^2 d)$ operations

The dominant terms are $O(n^2 d)$ time and $O(n^2)$ space (storing the attention matrix in memory).

For BERT-base with $n = 512$ , $d = 64$ per head: $512^2 = 262,144$ attention weights per head. Fine. For $n = 4096$ : $4096^2 = 16.7$ million weights per head. With 16 heads and float16 (2 bytes per value):

$16.7\text{M} \times 16 \text{ heads} \times 2 \text{ bytes} = 536 \text{ MB per layer}$

With 24 layers: $\approx 12.9$ GB just for forward-pass attention matrices. The backward pass needs to store those same matrices for gradient computation, so double it to $\approx 25.8$ GB. Already over V100 capacity before counting batch size, activations, or optimizer state.

FlashAttention (Dao et al., 2022) achieves $O(n)$ memory by computing attention in tiles that fit in SRAM, never writing the full $n \times n$ matrix to HBM. The FLOPs are still $O(n^2 d)$ - it has to visit every pair of tokens - but the memory is $O(n)$ .

Standard attention:   O(n^2 d) FLOPs,  O(n^2) memory  <- breaks at n > ~2048
FlashAttention:       O(n^2 d) FLOPs,  O(n) memory    <- unlocks n = 100k+

k-NN Complexity: Exact vs Approximate

Exact k-nearest-neighbor search with $N$ training points in $d$ dimensions:

Query time: $O(Nd)$ - compute distance from query to every stored point
Space: $O(Nd)$ - store all points

At $N = 10^8$ (a product catalog of 100M items) and $d = 768$ (typical BERT embedding), each query requires $\approx 7.7 \times 10^{10}$ floating point operations. On hardware delivering $10^{12}$ FLOP/s, that is 77ms per query - before any memory bottleneck. With a sequential read from RAM at $\approx 50$ GB/s, reading $10^8 \times 768 \times 4 \approx 288$ GB takes about 5.8 seconds. Unusable.

FAISS HNSW reduces query complexity to approximately $O(\log N)$ amortized by building a hierarchical graph where shortcuts across long distances are stored at high layers and local connections at the bottom layer. You navigate from coarse to fine - like using a highway system, not walking every street.

Gradient Descent Iteration Complexity

One epoch of SGD over $N$ training samples:

Forward pass: $O(N \cdot C_\text{fwd})$ where $C_\text{fwd}$ is the model's per-sample computational cost
Backward pass: approximately $O(2 \cdot N \cdot C_\text{fwd})$ (by reverse-mode autodiff theory)
Parameter update: $O(P)$ where $P$ is the number of parameters

For convergence: convex objectives converge in $O(1/\epsilon^2)$ gradient steps for SGD or $O(1/\epsilon)$ for gradient descent with constant step size. Non-convex neural networks have no such clean bound, but empirically the relationship between compute budget and loss follows a power law (Chinchilla scaling laws). The number of steps is not a clean complexity formula - it depends on the loss landscape, learning rate schedule, and architecture.

Cache-Aware Complexity

Modern CPUs have a deep memory hierarchy: registers (1 cycle), L1 cache (4 cycles), L2 cache (12 cycles), L3 cache (40 cycles), RAM (200 cycles). A cache miss - accessing data not in the CPU's fast cache - costs 50 to 200 times more than a cache hit.

Cache-oblivious algorithms are designed to work well regardless of cache size by naturally traversing data in cache-friendly patterns. Naive matrix multiply in row-major order repeatedly accesses columns of one matrix, which are non-contiguous in memory, causing cache thrashing. Blocked (tiled) matrix multiply processes $B \times B$ submatrices that fit in cache, reducing cache misses significantly.

This is why BLAS libraries (and PyTorch's underlying kernels like cuBLAS) dramatically outperform naive implementations even though both are nominally $O(n^3)$ : the constant factor hidden inside that $O(\cdot)$ is 100-1000x smaller in optimized BLAS due to cache efficiency, SIMD vectorization, and thread-level parallelism.

GPU Parallelism and Complexity

On a GPU, many operations that are $O(n)$ sequential become $O(\log n)$ or $O(1)$ in wall-clock time because the GPU executes thousands of threads simultaneously. But the theoretical complexity does not change - we just have a smaller constant.

The key metric for GPU utilization is arithmetic intensity: FLOPs divided by bytes moved from memory.

Matrix multiply: $O(n^3)$ FLOPs, $O(n^2)$ memory reads. Intensity = $O(n)$ . Gets more compute-bound as $n$ grows - great for GPUs.
Elementwise ops (ReLU, GELU): $O(n)$ FLOPs, $O(n)$ memory. Intensity = $O(1)$ . Always memory-bound. More CUDA cores cannot help.
Softmax: $O(n)$ FLOPs, $O(n)$ memory. Memory-bound. Flash-style online softmax improves this by reducing memory passes.
Layer normalization: $O(n)$ FLOPs, $O(n)$ memory. Memory-bound.

Knowing this, you can predict: the bottleneck in a transformer training step is not the matrix multiplications (those hit near-peak GPU utilization) but the elementwise operations and normalization layers (memory-bound, low arithmetic intensity).

Code: Complexity Measurement Framework

The most reliable way to confirm your complexity analysis is to measure it empirically. The approach: time the algorithm at multiple input sizes, plot on log-log axes, and measure the slope. A slope of 1 means $O(n)$ , slope of 2 means $O(n^2)$ , slope of 3 means $O(n^3)$ .

import time
import math
import numpy as np
from typing import Callable, List, Tuple

def measure_complexity(
    func: Callable,
    input_sizes: List[int],
    generator: Callable,
    n_trials: int = 3
) -> Tuple[List[int], List[float]]:
    """
    Measure empirical runtime at multiple input sizes.

    Args:
        func: the function to benchmark
        input_sizes: list of n values to test
        generator: function that takes n and returns input for func
        n_trials: number of timing trials per size (take median)

    Returns:
        (sizes, median_times_in_seconds)
    """
    times = []
    for n in input_sizes:
        inp = generator(n)
        trial_times = []
        for _ in range(n_trials):
            start = time.perf_counter()
            func(inp)
            end = time.perf_counter()
            trial_times.append(end - start)
        times.append(float(np.median(trial_times)))
    return input_sizes, times


def estimate_exponent(sizes: List[int], times: List[float]) -> float:
    """
    Estimate the complexity exponent from log-log regression.

    slope = 1  -> O(n)
    slope = 2  -> O(n^2)
    slope = 3  -> O(n^3)
    """
    log_n = [math.log(n) for n in sizes]
    log_t = [math.log(t) for t in times]
    n_pts = len(log_n)
    mean_n = sum(log_n) / n_pts
    mean_t = sum(log_t) / n_pts
    numerator = sum(
        (log_n[i] - mean_n) * (log_t[i] - mean_t) for i in range(n_pts)
    )
    denominator = sum((log_n[i] - mean_n) ** 2 for i in range(n_pts))
    return numerator / denominator


def classify_complexity(exponent: float) -> str:
    """Map numeric exponent to a Big-O label."""
    if exponent < 0.1:
        return "O(1)"
    elif exponent < 0.6:
        return "O(log n)"
    elif exponent < 1.15:
        return "O(n)"
    elif exponent < 1.5:
        return "O(n log n)"
    elif exponent < 2.15:
        return "O(n^2)"
    elif exponent < 3.15:
        return "O(n^3)"
    else:
        return f"O(n^{exponent:.1f})"


# --- Benchmark 1: Matrix multiply ---

def numpy_matmul(args):
    A, B = args
    return A @ B

sizes, times = measure_complexity(
    numpy_matmul,
    sizes=[128, 256, 512, 1024, 2048],
    generator=lambda n: (
        np.random.randn(n, n).astype(np.float32),
        np.random.randn(n, n).astype(np.float32),
    )
)
exp = estimate_exponent(sizes, times)
print(f"Matrix multiply: exponent={exp:.2f} -> {classify_complexity(exp)}")
# Expected: ~2.7 to 3.0 (approaches cubic at large n)


# --- Benchmark 2: Softmax - expect O(n) ---

def softmax(x):
    e_x = np.exp(x - x.max())
    return e_x / e_x.sum()

sizes, times = measure_complexity(
    softmax,
    sizes=[1_000, 10_000, 100_000, 500_000, 1_000_000],
    generator=lambda n: np.random.randn(n).astype(np.float32),
    n_trials=10
)
exp = estimate_exponent(sizes, times)
print(f"Softmax: exponent={exp:.2f} -> {classify_complexity(exp)}")
# Expected: ~1.0 (linear)


# --- Benchmark 3: Self-attention - expect O(n^2) ---

def naive_attention(inputs):
    Q, K, V = inputs
    n, d = Q.shape
    scores = Q @ K.T / (d ** 0.5)               # (n, n)
    scores -= scores.max(axis=-1, keepdims=True)  # numerical stability
    weights = np.exp(scores)
    weights /= weights.sum(axis=-1, keepdims=True)
    return weights @ V                            # (n, d)

sizes, times = measure_complexity(
    naive_attention,
    sizes=[64, 128, 256, 512, 1024, 2048],
    generator=lambda n: (
        np.random.randn(n, 64).astype(np.float32),
        np.random.randn(n, 64).astype(np.float32),
        np.random.randn(n, 64).astype(np.float32),
    ),
    n_trials=5
)
exp = estimate_exponent(sizes, times)
print(f"Self-attention: exponent={exp:.2f} -> {classify_complexity(exp)}")
# Expected: ~2.0 (quadratic in sequence length)


# --- Benchmark 4: Sorting - expect O(n log n) ---

sizes, times = measure_complexity(
    lambda x: np.sort(x),
    sizes=[10_000, 100_000, 500_000, 1_000_000, 5_000_000],
    generator=lambda n: np.random.randn(n).astype(np.float32),
    n_trials=5
)
exp = estimate_exponent(sizes, times)
print(f"numpy.sort: exponent={exp:.2f} -> {classify_complexity(exp)}")
# Expected: ~1.0 to 1.1 (n log n looks close to linear for moderate n)

Code: Attention Memory Demo - Standard vs FlashAttention

This demonstrates the memory savings, not the FLOPs. FlashAttention never allocates the full $n \times n$ attention matrix.

def standard_attention_memory_mb(n: int, d: int, dtype_bytes: int = 2) -> dict:
    """
    Compute memory usage (MB) for standard attention.
    dtype_bytes=2 for float16, =4 for float32.
    """
    # Q, K, V matrices stored: 3 * n * d elements
    qkv_mb = 3 * n * d * dtype_bytes / (1024 ** 2)

    # Attention score matrix (n x n) written to HBM - the quadratic term
    scores_mb = n * n * dtype_bytes / (1024 ** 2)

    # Softmax output (n x n) - also written to HBM
    weights_mb = n * n * dtype_bytes / (1024 ** 2)

    # Output (n x d)
    output_mb = n * d * dtype_bytes / (1024 ** 2)

    total_mb = qkv_mb + scores_mb + weights_mb + output_mb

    return {
        "qkv_mb": round(qkv_mb, 2),
        "attention_matrix_mb": round(scores_mb, 2),   # This is O(n^2)
        "weights_mb": round(weights_mb, 2),
        "output_mb": round(output_mb, 2),
        "total_mb": round(total_mb, 2),
    }


def flash_attention_memory_mb(n: int, d: int,
                               block_size: int = 64,
                               dtype_bytes: int = 2) -> dict:
    """
    FlashAttention: never materializes the full n x n matrix.
    Processes one (block_size x block_size) tile at a time in SRAM.
    """
    # Q, K, V stored in HBM: 3 * n * d
    qkv_mb = 3 * n * d * dtype_bytes / (1024 ** 2)

    # Only one block of scores in SRAM: block_size x block_size
    sram_block_mb = block_size * block_size * dtype_bytes / (1024 ** 2)

    # Running softmax statistics (max and sum per row): 2 * n
    stats_mb = 2 * n * dtype_bytes / (1024 ** 2)

    # Output: n * d
    output_mb = n * d * dtype_bytes / (1024 ** 2)

    # SRAM block is a constant - does NOT grow with n
    total_hbm_mb = qkv_mb + stats_mb + output_mb

    return {
        "qkv_mb": round(qkv_mb, 2),
        "sram_block_mb": round(sram_block_mb, 4),  # constant regardless of n
        "stats_mb": round(stats_mb, 4),
        "output_mb": round(output_mb, 2),
        "total_hbm_mb": round(total_hbm_mb, 2),
    }


print(f"{'n':<8} {'Standard (MB)':<18} {'FlashAttn (MB)':<18} {'Savings'}")
print("-" * 56)

for n in [512, 1024, 2048, 4096, 8192, 16384, 32768]:
    d = 64
    std = standard_attention_memory_mb(n, d)
    flash = flash_attention_memory_mb(n, d)
    savings = std["total_mb"] / flash["total_hbm_mb"]
    print(f"{n:<8} {std['total_mb']:<18.1f} {flash['total_hbm_mb']:<18.1f} {savings:.1f}x")

# Output (float16):
# n        Standard (MB)      FlashAttn (MB)     Savings
# --------------------------------------------------------
# 512      0.5                0.4                1.3x
# 1024     1.9                0.8                2.4x
# 2048     7.5                1.5                5.0x
# 4096     30.0               3.1                9.7x
# 8192     120.0              6.1                19.7x
# 16384    480.3              12.1               39.7x
# 32768    1921.0             24.1               79.6x
#
# The memory advantage grows linearly with n.
# At n=32768, standard attention uses 80x more memory.
# This is why GPT-4 class models can do 128k context with FlashAttention
# but would OOM trying standard attention at that length.

Code: Matrix Multiply Complexity Profiling

import numpy as np
import time

def profile_matmul_at_scales():
    """
    Profile matrix multiply at multiple sizes.
    Demonstrates that BLAS approaches O(n^3) but with small constants.
    """
    sizes = [64, 128, 256, 512, 1024, 2048, 4096]
    results = []

    for n in sizes:
        A = np.random.randn(n, n).astype(np.float32)
        B = np.random.randn(n, n).astype(np.float32)

        # Warmup
        _ = A @ B

        # Time 5 runs
        times_ms = []
        for _ in range(5):
            start = time.perf_counter()
            C = A @ B
            end = time.perf_counter()
            times_ms.append((end - start) * 1000)

        median_ms = np.median(times_ms)

        # Theoretical FLOPs: 2 * n^3 (multiply-add pairs)
        flops = 2 * n ** 3
        gflops = flops / (median_ms / 1000) / 1e9

        results.append((n, median_ms, gflops))
        print(f"n={n:5d}: {median_ms:8.2f}ms  {gflops:8.1f} GFLOP/s")

    return results

results = profile_matmul_at_scales()

# Analysis: check how time scales with n
# If time(2n) / time(n) ~ 8, that confirms O(n^3)
print("\nScaling factors (should be ~8 for O(n^3)):")
for i in range(1, len(results)):
    n1, t1, _ = results[i-1]
    n2, t2, _ = results[i]
    ratio = t2 / t1
    print(f"  {n1} -> {n2}: {ratio:.2f}x (expected {(n2/n1)**3:.1f}x for O(n^3))")

Mermaid: Transformer Complexity Breakdown

Production Engineering Notes

1. Profile before optimizing. Complexity analysis tells you the asymptotic behavior. Profiling tells you where time is actually being spent in your specific workload. A theoretically $O(n^2)$ operation on $n=128$ may be faster than a theoretically $O(n)$ operation with large constants. Use torch.profiler, nsys (NVIDIA Nsight Systems), or py-spy before making any algorithmic changes.

2. Constants dominate at production scale too. $O(n)$ with constant $c = 1000$ is slower than $O(n^2)$ with $c = 0.001$ for all $n < 1000$ . Constant factors come from: memory access patterns, SIMD vectorization, Python interpreter overhead vs compiled kernels, and cache efficiency. Always benchmark with realistic $n$ , not just analyze asymptotically.

3. Memory complexity is often the binding constraint in ML. A model that is compute-bound on a single GPU may become memory-bound when you try to increase batch size. The GPU out-of-memory error you see during training is almost always a space complexity problem - an $O(n^2)$ or $O(n^2 d)$ term materializing in memory. Know your model's space complexity before scaling.

4. Batch dimension changes the analysis. When analyzing transformer complexity, the batch dimension $B$ is usually independent of sequence length $n$ . $O(n^2)$ attention per sequence = $O(Bn^2)$ total. If you scale batch size and sequence length together, you see $O(n^3)$ behavior. Understand which dimensions you are scaling before predicting runtime.

5. Distributed training changes what "input size" means. In data parallelism, each GPU processes $N/K$ samples (where $K$ is GPU count). Per-GPU complexity stays the same; you add $O(P)$ communication overhead per step for gradient synchronization. In tensor parallelism, $d$ is split across GPUs. Always specify which dimension you mean when discussing distributed complexity.

6. GPU wall-clock complexity is not the same as operation complexity. A GPU with 10,000 CUDA cores can parallelize $10^4$ operations in a single clock cycle. So $O(n)$ elementwise operations take $O(n/10000)$ wall-clock cycles. But memory-bound operations still take $O(n)$ wall-clock time because the bottleneck is memory bandwidth, not compute. Roofline analysis is the right framework here.

:::danger Common Mistake: Ignoring Space Complexity Engineers who only analyze time complexity are blindsided by OOM errors. A function that is $O(n \log n)$ time but $O(n^2)$ space (like standard attention) will blow up memory before it becomes slow. Always analyze both dimensions. The single most common production ML failure is a model that works at batch size 1 but OOMs at batch size 32 - this is always a space complexity problem, specifically an intermediate tensor growing as $O(B \cdot n^2)$ . :::

:::warning The Premature Optimization Trap Complexity analysis is a pre-implementation tool for ruling out bad approaches, and a post-profiling tool for understanding bottlenecks. It is NOT a substitute for measuring. Many ML engineers waste days "optimizing" code that runs in 2% of total training time. Measure first with a profiler. The hotspot is almost always in the data loader, the attention layer, or the optimizer step - and only profiling tells you which one. :::

:::tip The Doubling Experiment When unsure of your implementation's actual complexity, run the doubling experiment: time it at n, 2n, 4n. If doubling n doubles the time: $O(n)$ . If it quadruples: $O(n^2)$ . If it octuples: $O(n^3)$ . This is the fastest empirical complexity test and takes under 10 lines of code. :::

Interview Q&A

Q1: Why is self-attention $O(n^2)$ and not $O(n)$ ? Can it ever be made $O(n)$ ?

Self-attention computes a score between every pair of tokens: token $i$ attends to token $j$ for all pairs $(i, j)$ . This is $n^2$ pairs, requiring $O(n^2)$ operations and $O(n^2)$ memory for the score matrix. Full self-attention fundamentally requires knowing the relationship between every pair, which cannot be done in less than $O(n^2)$ without approximation.

Approximations can achieve subquadratic complexity. Linformer projects K and V to fixed rank $k$ , giving $O(nk)$ attention. Longformer uses local window attention plus global tokens, giving $O(nw)$ where $w$ is the window size. Performer uses random feature approximations to the attention kernel for $O(n)$ . These all trade recall quality for speed. FlashAttention achieves $O(n)$ memory while keeping $O(n^2)$ FLOPs by computing in tiles that stay in fast SRAM - it does not reduce FLOPs but eliminates the $O(n^2)$ memory allocation.

Q2: What is amortized $O(1)$ and why does it matter for ML training loops?

Amortized $O(1)$ means the average cost per operation over a long sequence is $O(1)$ , even though individual operations may be expensive. Python's list append is the example: most appends are $O(1)$ (write to next slot), but occasional resizes are $O(n)$ (copy entire array). Over $n$ appends the total work is $O(n)$ , giving $O(1)$ amortized.

In training this matters for: replay buffers in RL that occasionally compact, gradient accumulation buffers that flush periodically, and dataset index structures that grow as you stream data. If you see irregular latency spikes in your training step timer every few thousand steps, it may be an amortized cost becoming visible. Profile with torch.profiler and look for infrequent but expensive operations.

Q3: How do you empirically measure the time complexity of a model component?

Run the component at 5-8 input sizes spanning 2 or more orders of magnitude. Take the median timing across several trials (to reduce noise from cache effects, OS scheduling, etc.). For GPU operations, always call torch.cuda.synchronize() before reading the timer - GPU execution is asynchronous and without synchronization you are measuring the time to launch the kernel, not the time to complete it. Plot sizes vs times on log-log axes; the slope of the best-fit line is the complexity exponent. Slope 1 = $O(n)$ , slope 2 = $O(n^2)$ . Use torch.utils.benchmark.Timer for GPU benchmarking as it handles synchronization automatically.

Q4: Why is BLAS matrix multiply faster than a naive Python triple loop if both are $O(n^3)$ ?

Same asymptotic complexity, radically different constants. The naive Python loop has: Python interpreter overhead per iteration (100-1000x vs compiled C), poor cache utilization (column-major access causes cache misses in row-major stored arrays), no SIMD vectorization, no multi-threading. BLAS and cuBLAS use: compiled C/Fortran/CUDA, memory-tiled access patterns for cache efficiency, SIMD (AVX-512 on modern CPUs does 16 float32 multiply-adds per cycle), multi-threading, and on GPU, Tensor Cores that execute 4x4 matrix multiplications in a single instruction. Same $O(n^3)$ , but the constant factor is roughly 10,000x smaller.

Q5: What is the complexity of the Adam optimizer step and why does it matter at scale?

Adam maintains two momentum vectors - first moment (mean) and second moment (variance) - per parameter. An Adam step costs $O(P)$ time and requires $O(3P)$ space (parameters + two moment vectors) where $P$ is the number of parameters. For GPT-3 at $P = 175 \times 10^9$ parameters in float32, the optimizer state alone is $3 \times 175B \times 4 = 2.1$ TB - far beyond any single GPU's memory. This is why large model training requires ZeRO optimizer state sharding (distributing the $O(3P)$ state across $K$ GPUs), or bfloat16 parameters with float32 optimizer state, or AdaFactor which reduces second moment storage from $O(P)$ to $O(\sqrt{P})$ for 2D weight matrices by factoring the variance estimate.

Q6: What is the Master Theorem and give an ML example where it applies?

The Master Theorem solves recurrences of the form $T(n) = aT(n/b) + f(n)$ : split the problem into $a$ subproblems each of size $n/b$ , with $f(n)$ work to divide/combine. Compare $f(n)$ to $n^{\log_b a}$ : if subproblems dominate (case 1), answer is $\Theta(n^{\log_b a})$ ; if they are equal (case 2), answer is $\Theta(n^{\log_b a} \log n)$ ; if combining dominates (case 3), answer is $\Theta(f(n))$ .

ML example: FFT-based convolution uses $T(n) = 2T(n/2) + O(n)$ (merge-sort style). $a=2, b=2, \log_2 2 = 1$ . Case 2 applies: $\Theta(n \log n)$ . This is why 1D convolution with a large kernel uses FFT rather than direct convolution - direct is $O(n \cdot k)$ where $k$ is kernel size, while FFT convolution is $O(n \log n)$ regardless of kernel size. For large kernels, $O(n \log n) < O(nk)$ .

Code: Roofline Model Analysis

The roofline model is the practical tool for understanding whether a GPU kernel is compute-bound or memory-bound. It maps arithmetic intensity (FLOPs per byte) to achievable performance, bounded by two limits: peak compute (TFLOP/s) and peak memory bandwidth (GB/s).

import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
from dataclasses import dataclass
from typing import List

@dataclass
class HardwareSpec:
    """Hardware performance characteristics."""
    name: str
    peak_tflops: float     # Peak floating-point throughput (TFLOP/s)
    peak_bandwidth_gbs: float  # Peak memory bandwidth (GB/s)

    @property
    def ridge_point(self) -> float:
        """
        Arithmetic intensity at which compute and bandwidth limits intersect.
        Above this: compute-bound. Below: memory-bound.
        Units: FLOP/byte
        """
        return self.peak_tflops * 1e12 / (self.peak_bandwidth_gbs * 1e9)


@dataclass
class KernelProfile:
    """Profile of a GPU kernel or operation."""
    name: str
    flops: float        # total floating point operations
    bytes_accessed: float  # total bytes read from / written to memory
    measured_tflops: float = None  # actually measured throughput (optional)

    @property
    def arithmetic_intensity(self) -> float:
        """FLOP per byte of memory access."""
        return self.flops / self.bytes_accessed

    def roofline_bound(self, hw: HardwareSpec) -> str:
        """Is this kernel compute-bound or memory-bound on this hardware?"""
        if self.arithmetic_intensity >= hw.ridge_point:
            return "compute-bound"
        return "memory-bound"

    def peak_achievable_tflops(self, hw: HardwareSpec) -> float:
        """Maximum achievable TFLOP/s on this hardware."""
        compute_limit = hw.peak_tflops
        bandwidth_limit = (self.arithmetic_intensity * hw.peak_bandwidth_gbs * 1e9) / 1e12
        return min(compute_limit, bandwidth_limit)

    def efficiency(self, hw: HardwareSpec) -> float:
        """Fraction of achievable peak actually reached (0-1)."""
        if self.measured_tflops is None:
            return None
        return self.measured_tflops / self.peak_achievable_tflops(hw)


# Hardware specs (approximate values)
A100_fp16 = HardwareSpec(
    name="A100 (fp16 with sparsity)",
    peak_tflops=312.0,
    peak_bandwidth_gbs=2000.0
)
A100_fp32 = HardwareSpec(
    name="A100 (fp32)",
    peak_tflops=19.5,
    peak_bandwidth_gbs=2000.0
)
V100_fp16 = HardwareSpec(
    name="V100 (fp16)",
    peak_tflops=125.0,
    peak_bandwidth_gbs=900.0
)

# Ridge points
for hw in [A100_fp16, A100_fp32, V100_fp16]:
    print(f"{hw.name}: ridge point = {hw.ridge_point:.1f} FLOP/byte")

# Common ML kernels
# For a matrix multiply M x K x K x N with batch B:
# FLOPs = 2 * B * M * K * N (multiply-add pairs)
# Bytes = (B*M*K + B*K*N + B*M*N) * bytes_per_element

def matmul_kernel(B, M, K, N, dtype_bytes=2):
    flops = 2 * B * M * K * N
    bytes_read = (B * M * K + B * K * N) * dtype_bytes
    bytes_write = B * M * N * dtype_bytes
    return KernelProfile(
        name=f"matmul({B}x{M}x{K} @ {B}x{K}x{N})",
        flops=flops,
        bytes_accessed=bytes_read + bytes_write
    )

def elementwise_kernel(n_elements, flops_per_element=1, dtype_bytes=2):
    return KernelProfile(
        name=f"elementwise(n={n_elements})",
        flops=n_elements * flops_per_element,
        bytes_accessed=n_elements * dtype_bytes * 2  # read + write
    )

# Example: transformer layer at different sequence lengths
print("\nTransformer Layer Roofline Analysis (A100 fp16):")
print(f"{'Operation':<40} {'AI (F/B)':<12} {'Bound':<18} {'Peak (TFLOP/s)'}")
print("-" * 85)

hw = A100_fp16
batch, seq, d_model = 32, 512, 1024
d_head = 64
n_heads = 16

kernels = [
    # QKV projection: (B * seq * d_model) @ (d_model * 3*d_model)
    matmul_kernel(batch, seq, d_model, 3 * d_model, dtype_bytes=2),

    # Attention scores: (B * n_heads * seq * d_head) @ (B * n_heads * d_head * seq)
    matmul_kernel(batch * n_heads, seq, d_head, seq, dtype_bytes=2),

    # Softmax (elementwise, ~5 ops/element: exp, sum, divide)
    elementwise_kernel(batch * n_heads * seq * seq, flops_per_element=5, dtype_bytes=2),

    # Output projection
    matmul_kernel(batch, seq, d_model, d_model, dtype_bytes=2),

    # Layer norm (~10 ops/element)
    elementwise_kernel(batch * seq * d_model, flops_per_element=10, dtype_bytes=2),

    # FFN (2 matrix multiplies with 4x expansion)
    matmul_kernel(batch, seq, d_model, 4 * d_model, dtype_bytes=2),
]

for k in kernels:
    bound = k.roofline_bound(hw)
    peak = k.peak_achievable_tflops(hw)
    print(f"{k.name:<40} {k.arithmetic_intensity:<12.1f} {bound:<18} {peak:.1f}")

Mermaid: Roofline Model for Transformer Operations

Summary

Complexity analysis is the language of scalable ML systems. The framework:

Big-O gives asymptotic upper bounds; Big-Theta is tight; Big-Omega is lower bound. Use them precisely.
Time complexity and space complexity are both critical - GPU OOM errors are almost always a space complexity problem.
Self-attention is $O(n^2 d)$ time and $O(n^2)$ space - FlashAttention reduces the space to $O(n)$ without changing FLOPs.
Amortized analysis explains why dynamic data structures have $O(1)$ average cost despite occasional expensive operations.
The Master Theorem solves divide-and-conquer recurrences: merge sort and FFT are $\Theta(n \log n)$ .
Cache-aware complexity explains why BLAS beats naive loops by 10,000x with the same Big-O.
Arithmetic intensity (FLOPs per byte) determines whether a GPU operation is compute-bound (matrix multiply) or memory-bound (softmax, layer norm).
Always verify complexity empirically with the doubling experiment or log-log plot - theory and practice diverge due to constants and cache effects.

The Day Attention Hit a Wall​

Why This Exists​

Historical Context​

Core Concepts​

Big-O Notation: What It Actually Means​

Common Complexity Classes in ML​

Time vs Space Complexity​

Amortized Analysis: Dynamic Arrays​

The Master Theorem for Divide-and-Conquer​

Transformer Attention Complexity in Depth​

k-NN Complexity: Exact vs Approximate​

Gradient Descent Iteration Complexity​

Cache-Aware Complexity​

GPU Parallelism and Complexity​

Code: Complexity Measurement Framework​

Code: Attention Memory Demo - Standard vs FlashAttention​

Code: Matrix Multiply Complexity Profiling​

Mermaid: Transformer Complexity Breakdown​

Production Engineering Notes​

Interview Q&A​

Code: Roofline Model Analysis​

Mermaid: Roofline Model for Transformer Operations​

Summary​