Complexity Analysis for ML Engineers
The Day Attention Hit a Wall
It was 2019. The original BERT model had just dropped, and every NLP team was rushing to apply transformer architectures to longer sequences. A team at a major tech company was building a document-level understanding system. They trained on sentences just fine. They tried 512 tokens - still workable. Then they pushed to 4096 tokens per document, and the training job immediately ran out of memory on their 32GB V100s.
The problem was not the model weights. The weights were small. The problem was that self-attention computes a score between every pair of tokens in the sequence. With 4096 tokens, that is million pairs - just for one head, one layer. Multiply by 12 heads, 24 layers, batch size 8, and you have an attention matrix that consumes hundreds of gigabytes. Training long documents was simply impossible.
This was not a hardware limitation waiting for bigger GPUs. It was a fundamental algorithmic problem. The self-attention mechanism had space complexity baked into its design. No amount of money spent on hardware would fix it. The team needed to understand the complexity of their system before they could reason about solutions.
The fix came in 2022 with FlashAttention - which did not reduce the computational complexity (still FLOPs) but reduced the memory complexity to by restructuring the computation to be cache-aware. Understanding the difference between FLOPs and memory complexity was what unlocked the solution.
If you are building ML systems - training infrastructure, inference pipelines, feature computation at scale - complexity analysis is not academic background knowledge. It is the primary lens through which you diagnose why things are slow, predict whether a system will scale, and evaluate algorithmic alternatives before spending weeks implementing them.
This lesson builds that lens from the ground up, with every concept grounded in ML reality.
Why This Exists
Before formal complexity analysis, engineers optimized by intuition and benchmarking. They would implement something, run it, find it was slow, and guess at improvements. This works for small systems. It catastrophically fails for ML systems where inputs range from 100 samples during development to 100 million in production.
The core problem is that algorithmic behavior changes non-linearly with scale. A function that takes 1 second on 1,000 samples might take 1,000 seconds on 1 million samples - not linearly but potentially quadratically. Without complexity analysis, you cannot predict this from a small benchmark.
Complexity analysis gives you a mathematical model of how an algorithm's resource consumption grows as a function of input size. It tells you whether your solution will work at 10x, 100x, or 1000x the data you tested on. In ML, where we routinely scale from prototype to production by orders of magnitude, this is not optional knowledge.
Historical Context
Complexity theory emerged in the 1960s as computer scientists began building large systems and realized that "fast on my machine" was not a useful property. Donald Knuth formalized asymptotic notation in his monumental "The Art of Computer Programming" (1968), drawing on earlier mathematical work by Paul Bachmann and Edmund Landau from the 1890s.
The key insight Knuth captured was that for large enough inputs, the leading term of a runtime expression dominates everything else. grows like for large - the constants and lower-order terms stop mattering. This simplification lets us reason about algorithm families rather than specific implementations on specific hardware.
Michael Sipser and others formalized complexity classes (P, NP, PSPACE) in the 1970s and 1980s, establishing the theoretical foundations. But for ML engineers, the practical version - Big-O notation as a tool for predicting scaling behavior - is what matters day-to-day.
The application of complexity theory to ML infrastructure became critical around 2017-2022 as model sizes grew from millions to billions of parameters and datasets grew from millions to trillions of tokens. At that scale, a wrong algorithmic choice - picking where was possible - could mean the difference between a viable system and one that is fundamentally unusable.
Core Concepts
Big-O Notation: What It Actually Means
Big-O notation describes an upper bound on how a function grows. When we write , we mean there exists some constant and threshold such that for all :
In plain English: beyond some input size, is bounded above by . We do not care about small inputs, and we absorb constants into .
There are three related notations, and knowing all three makes you much more precise:
- Big-O () - upper bound. means grows no faster than . Like saying "at most this fast."
- Big-Omega () - lower bound. means grows at least as fast as . Like saying "at minimum this fast."
- Big-Theta () - tight bound. means and grow at the same rate - both the upper and lower bounds match.
When we say "matrix multiply is " we mean the naive algorithm - it is actually because both upper and lower bounds are cubic. When we say "sorting is " for comparison-based sorts in the average case, we should really say .
ML engineers typically use Big-O loosely to mean "tight bound" (i.e., Big-Theta). This is fine in practice, but be aware of the distinction when someone says "this is at worst" - they mean upper bound, not tight bound.
Common Complexity Classes in ML
| Complexity | Name | ML Example |
|---|---|---|
| Constant | Hash table lookup, dict access | |
| Logarithmic | Binary search in sorted array, HNSW ANN search | |
| Linear | Softmax over vocabulary, single pass over dataset | |
| Linearithmic | Sorting tokens by length, FFT-based convolution | |
| Quadratic | Naive self-attention, naive k-NN search | |
| Cubic | Naive dense matrix multiply | |
| Exponential | Exhaustive NAS search over architecture space |
The jump from to is where ML systems most often break. At , linear is operations, quadratic is operations - a million-fold difference. This is why going from a 512-token BERT to a 4096-token document model is not "8x harder" - it is harder for the attention computation.
Time vs Space Complexity
Every algorithm has both time complexity (how many operations) and space complexity (how much memory). They are often in tension, and ML adds a third dimension: memory bandwidth complexity (how much data is moved between cache levels, or RAM and GPU high-bandwidth memory).
The classic tradeoff: dynamic programming trades space for time. You store intermediate results (space or ) to avoid recomputing them (saving or more in time). The right side of the tradeoff depends on your bottleneck.
For GPU ML workloads, the dominant bottleneck is often memory bandwidth, not FLOPs. A GPU can perform far more arithmetic than it can move data. This is why FlashAttention's contribution was not fewer FLOPs - the FLOPs were identical to standard attention - but fewer round-trips to GPU HBM memory. The attention matrix was being written to and read from HBM repeatedly. FlashAttention tiles the computation to stay in SRAM, cutting memory traffic dramatically.
Amortized Analysis: Dynamic Arrays
Amortized analysis asks: what is the average cost per operation over a long sequence of operations, even if individual operations are sometimes expensive?
The canonical example is Python's list (a dynamic array). Appending to a list is usually - just write the value to the next pre-allocated slot. But occasionally, the backing array is full and must be resized: a new array at 2x the current size is allocated, and all existing elements are copied over. This resize operation costs .
How expensive is appending elements in total?
Without amortized analysis, you might say "each append is in the worst case, so appends costs total." But that is wrong. Resizes happen at sizes 1, 2, 4, 8, ..., up to , so the total copy work is:
Amortized over appends, each append costs amortized.
In ML this matters for: gradient accumulation buffers, replay buffers in reinforcement learning, online learning with streaming data, and any system that builds up data structures incrementally. When you see occasional latency spikes during training, it is worth checking whether an amortized cost is becoming visible.
The Master Theorem for Divide-and-Conquer
Many ML algorithms use divide-and-conquer: split the problem in half, solve recursively, combine results. The Master Theorem gives you the time complexity of such algorithms without solving the recurrence manually.
For a recurrence of the form where:
- = number of subproblems
- = factor by which input size shrinks per level
- = cost of dividing and combining
The solution is:
Practical examples in ML:
- Merge sort: . Here , , , . Equal case applies: .
- Strassen matrix multiply: . Here . Subproblems dominate: - faster than naive .
- FFT-based convolution: . Same as merge sort: - which is why FFT convolution beats direct convolution for large kernels.
Transformer Attention Complexity in Depth
Self-attention takes a sequence of tokens, each represented as a -dimensional vector, and produces for each token a weighted sum over all other tokens.
Step by step:
- Compute Q, K, V projections: three matrix multiplies = each, total
- Compute raw attention scores : operations, produces an matrix
- Scale and softmax over the score matrix:
- Weighted sum : operations
The dominant terms are time and space (storing the attention matrix in memory).
For BERT-base with , per head: attention weights per head. Fine. For : million weights per head. With 16 heads and float16 (2 bytes per value):
With 24 layers: GB just for forward-pass attention matrices. The backward pass needs to store those same matrices for gradient computation, so double it to GB. Already over V100 capacity before counting batch size, activations, or optimizer state.
FlashAttention (Dao et al., 2022) achieves memory by computing attention in tiles that fit in SRAM, never writing the full matrix to HBM. The FLOPs are still - it has to visit every pair of tokens - but the memory is .
Standard attention: O(n^2 d) FLOPs, O(n^2) memory <- breaks at n > ~2048
FlashAttention: O(n^2 d) FLOPs, O(n) memory <- unlocks n = 100k+
k-NN Complexity: Exact vs Approximate
Exact k-nearest-neighbor search with training points in dimensions:
- Query time: - compute distance from query to every stored point
- Space: - store all points
At (a product catalog of 100M items) and (typical BERT embedding), each query requires floating point operations. On hardware delivering FLOP/s, that is 77ms per query - before any memory bottleneck. With a sequential read from RAM at GB/s, reading GB takes about 5.8 seconds. Unusable.
FAISS HNSW reduces query complexity to approximately amortized by building a hierarchical graph where shortcuts across long distances are stored at high layers and local connections at the bottom layer. You navigate from coarse to fine - like using a highway system, not walking every street.
Gradient Descent Iteration Complexity
One epoch of SGD over training samples:
- Forward pass: where is the model's per-sample computational cost
- Backward pass: approximately (by reverse-mode autodiff theory)
- Parameter update: where is the number of parameters
For convergence: convex objectives converge in gradient steps for SGD or for gradient descent with constant step size. Non-convex neural networks have no such clean bound, but empirically the relationship between compute budget and loss follows a power law (Chinchilla scaling laws). The number of steps is not a clean complexity formula - it depends on the loss landscape, learning rate schedule, and architecture.
Cache-Aware Complexity
Modern CPUs have a deep memory hierarchy: registers (1 cycle), L1 cache (4 cycles), L2 cache (12 cycles), L3 cache (40 cycles), RAM (200 cycles). A cache miss - accessing data not in the CPU's fast cache - costs 50 to 200 times more than a cache hit.
Cache-oblivious algorithms are designed to work well regardless of cache size by naturally traversing data in cache-friendly patterns. Naive matrix multiply in row-major order repeatedly accesses columns of one matrix, which are non-contiguous in memory, causing cache thrashing. Blocked (tiled) matrix multiply processes submatrices that fit in cache, reducing cache misses significantly.
This is why BLAS libraries (and PyTorch's underlying kernels like cuBLAS) dramatically outperform naive implementations even though both are nominally : the constant factor hidden inside that is 100-1000x smaller in optimized BLAS due to cache efficiency, SIMD vectorization, and thread-level parallelism.
GPU Parallelism and Complexity
On a GPU, many operations that are sequential become or in wall-clock time because the GPU executes thousands of threads simultaneously. But the theoretical complexity does not change - we just have a smaller constant.
The key metric for GPU utilization is arithmetic intensity: FLOPs divided by bytes moved from memory.
- Matrix multiply: FLOPs, memory reads. Intensity = . Gets more compute-bound as grows - great for GPUs.
- Elementwise ops (ReLU, GELU): FLOPs, memory. Intensity = . Always memory-bound. More CUDA cores cannot help.
- Softmax: FLOPs, memory. Memory-bound. Flash-style online softmax improves this by reducing memory passes.
- Layer normalization: FLOPs, memory. Memory-bound.
Knowing this, you can predict: the bottleneck in a transformer training step is not the matrix multiplications (those hit near-peak GPU utilization) but the elementwise operations and normalization layers (memory-bound, low arithmetic intensity).
Code: Complexity Measurement Framework
The most reliable way to confirm your complexity analysis is to measure it empirically. The approach: time the algorithm at multiple input sizes, plot on log-log axes, and measure the slope. A slope of 1 means , slope of 2 means , slope of 3 means .
import time
import math
import numpy as np
from typing import Callable, List, Tuple
def measure_complexity(
func: Callable,
input_sizes: List[int],
generator: Callable,
n_trials: int = 3
) -> Tuple[List[int], List[float]]:
"""
Measure empirical runtime at multiple input sizes.
Args:
func: the function to benchmark
input_sizes: list of n values to test
generator: function that takes n and returns input for func
n_trials: number of timing trials per size (take median)
Returns:
(sizes, median_times_in_seconds)
"""
times = []
for n in input_sizes:
inp = generator(n)
trial_times = []
for _ in range(n_trials):
start = time.perf_counter()
func(inp)
end = time.perf_counter()
trial_times.append(end - start)
times.append(float(np.median(trial_times)))
return input_sizes, times
def estimate_exponent(sizes: List[int], times: List[float]) -> float:
"""
Estimate the complexity exponent from log-log regression.
slope = 1 -> O(n)
slope = 2 -> O(n^2)
slope = 3 -> O(n^3)
"""
log_n = [math.log(n) for n in sizes]
log_t = [math.log(t) for t in times]
n_pts = len(log_n)
mean_n = sum(log_n) / n_pts
mean_t = sum(log_t) / n_pts
numerator = sum(
(log_n[i] - mean_n) * (log_t[i] - mean_t) for i in range(n_pts)
)
denominator = sum((log_n[i] - mean_n) ** 2 for i in range(n_pts))
return numerator / denominator
def classify_complexity(exponent: float) -> str:
"""Map numeric exponent to a Big-O label."""
if exponent < 0.1:
return "O(1)"
elif exponent < 0.6:
return "O(log n)"
elif exponent < 1.15:
return "O(n)"
elif exponent < 1.5:
return "O(n log n)"
elif exponent < 2.15:
return "O(n^2)"
elif exponent < 3.15:
return "O(n^3)"
else:
return f"O(n^{exponent:.1f})"
# --- Benchmark 1: Matrix multiply ---
def numpy_matmul(args):
A, B = args
return A @ B
sizes, times = measure_complexity(
numpy_matmul,
sizes=[128, 256, 512, 1024, 2048],
generator=lambda n: (
np.random.randn(n, n).astype(np.float32),
np.random.randn(n, n).astype(np.float32),
)
)
exp = estimate_exponent(sizes, times)
print(f"Matrix multiply: exponent={exp:.2f} -> {classify_complexity(exp)}")
# Expected: ~2.7 to 3.0 (approaches cubic at large n)
# --- Benchmark 2: Softmax - expect O(n) ---
def softmax(x):
e_x = np.exp(x - x.max())
return e_x / e_x.sum()
sizes, times = measure_complexity(
softmax,
sizes=[1_000, 10_000, 100_000, 500_000, 1_000_000],
generator=lambda n: np.random.randn(n).astype(np.float32),
n_trials=10
)
exp = estimate_exponent(sizes, times)
print(f"Softmax: exponent={exp:.2f} -> {classify_complexity(exp)}")
# Expected: ~1.0 (linear)
# --- Benchmark 3: Self-attention - expect O(n^2) ---
def naive_attention(inputs):
Q, K, V = inputs
n, d = Q.shape
scores = Q @ K.T / (d ** 0.5) # (n, n)
scores -= scores.max(axis=-1, keepdims=True) # numerical stability
weights = np.exp(scores)
weights /= weights.sum(axis=-1, keepdims=True)
return weights @ V # (n, d)
sizes, times = measure_complexity(
naive_attention,
sizes=[64, 128, 256, 512, 1024, 2048],
generator=lambda n: (
np.random.randn(n, 64).astype(np.float32),
np.random.randn(n, 64).astype(np.float32),
np.random.randn(n, 64).astype(np.float32),
),
n_trials=5
)
exp = estimate_exponent(sizes, times)
print(f"Self-attention: exponent={exp:.2f} -> {classify_complexity(exp)}")
# Expected: ~2.0 (quadratic in sequence length)
# --- Benchmark 4: Sorting - expect O(n log n) ---
sizes, times = measure_complexity(
lambda x: np.sort(x),
sizes=[10_000, 100_000, 500_000, 1_000_000, 5_000_000],
generator=lambda n: np.random.randn(n).astype(np.float32),
n_trials=5
)
exp = estimate_exponent(sizes, times)
print(f"numpy.sort: exponent={exp:.2f} -> {classify_complexity(exp)}")
# Expected: ~1.0 to 1.1 (n log n looks close to linear for moderate n)
Code: Attention Memory Demo - Standard vs FlashAttention
This demonstrates the memory savings, not the FLOPs. FlashAttention never allocates the full attention matrix.
def standard_attention_memory_mb(n: int, d: int, dtype_bytes: int = 2) -> dict:
"""
Compute memory usage (MB) for standard attention.
dtype_bytes=2 for float16, =4 for float32.
"""
# Q, K, V matrices stored: 3 * n * d elements
qkv_mb = 3 * n * d * dtype_bytes / (1024 ** 2)
# Attention score matrix (n x n) written to HBM - the quadratic term
scores_mb = n * n * dtype_bytes / (1024 ** 2)
# Softmax output (n x n) - also written to HBM
weights_mb = n * n * dtype_bytes / (1024 ** 2)
# Output (n x d)
output_mb = n * d * dtype_bytes / (1024 ** 2)
total_mb = qkv_mb + scores_mb + weights_mb + output_mb
return {
"qkv_mb": round(qkv_mb, 2),
"attention_matrix_mb": round(scores_mb, 2), # This is O(n^2)
"weights_mb": round(weights_mb, 2),
"output_mb": round(output_mb, 2),
"total_mb": round(total_mb, 2),
}
def flash_attention_memory_mb(n: int, d: int,
block_size: int = 64,
dtype_bytes: int = 2) -> dict:
"""
FlashAttention: never materializes the full n x n matrix.
Processes one (block_size x block_size) tile at a time in SRAM.
"""
# Q, K, V stored in HBM: 3 * n * d
qkv_mb = 3 * n * d * dtype_bytes / (1024 ** 2)
# Only one block of scores in SRAM: block_size x block_size
sram_block_mb = block_size * block_size * dtype_bytes / (1024 ** 2)
# Running softmax statistics (max and sum per row): 2 * n
stats_mb = 2 * n * dtype_bytes / (1024 ** 2)
# Output: n * d
output_mb = n * d * dtype_bytes / (1024 ** 2)
# SRAM block is a constant - does NOT grow with n
total_hbm_mb = qkv_mb + stats_mb + output_mb
return {
"qkv_mb": round(qkv_mb, 2),
"sram_block_mb": round(sram_block_mb, 4), # constant regardless of n
"stats_mb": round(stats_mb, 4),
"output_mb": round(output_mb, 2),
"total_hbm_mb": round(total_hbm_mb, 2),
}
print(f"{'n':<8} {'Standard (MB)':<18} {'FlashAttn (MB)':<18} {'Savings'}")
print("-" * 56)
for n in [512, 1024, 2048, 4096, 8192, 16384, 32768]:
d = 64
std = standard_attention_memory_mb(n, d)
flash = flash_attention_memory_mb(n, d)
savings = std["total_mb"] / flash["total_hbm_mb"]
print(f"{n:<8} {std['total_mb']:<18.1f} {flash['total_hbm_mb']:<18.1f} {savings:.1f}x")
# Output (float16):
# n Standard (MB) FlashAttn (MB) Savings
# --------------------------------------------------------
# 512 0.5 0.4 1.3x
# 1024 1.9 0.8 2.4x
# 2048 7.5 1.5 5.0x
# 4096 30.0 3.1 9.7x
# 8192 120.0 6.1 19.7x
# 16384 480.3 12.1 39.7x
# 32768 1921.0 24.1 79.6x
#
# The memory advantage grows linearly with n.
# At n=32768, standard attention uses 80x more memory.
# This is why GPT-4 class models can do 128k context with FlashAttention
# but would OOM trying standard attention at that length.
Code: Matrix Multiply Complexity Profiling
import numpy as np
import time
def profile_matmul_at_scales():
"""
Profile matrix multiply at multiple sizes.
Demonstrates that BLAS approaches O(n^3) but with small constants.
"""
sizes = [64, 128, 256, 512, 1024, 2048, 4096]
results = []
for n in sizes:
A = np.random.randn(n, n).astype(np.float32)
B = np.random.randn(n, n).astype(np.float32)
# Warmup
_ = A @ B
# Time 5 runs
times_ms = []
for _ in range(5):
start = time.perf_counter()
C = A @ B
end = time.perf_counter()
times_ms.append((end - start) * 1000)
median_ms = np.median(times_ms)
# Theoretical FLOPs: 2 * n^3 (multiply-add pairs)
flops = 2 * n ** 3
gflops = flops / (median_ms / 1000) / 1e9
results.append((n, median_ms, gflops))
print(f"n={n:5d}: {median_ms:8.2f}ms {gflops:8.1f} GFLOP/s")
return results
results = profile_matmul_at_scales()
# Analysis: check how time scales with n
# If time(2n) / time(n) ~ 8, that confirms O(n^3)
print("\nScaling factors (should be ~8 for O(n^3)):")
for i in range(1, len(results)):
n1, t1, _ = results[i-1]
n2, t2, _ = results[i]
ratio = t2 / t1
print(f" {n1} -> {n2}: {ratio:.2f}x (expected {(n2/n1)**3:.1f}x for O(n^3))")
Mermaid: Transformer Complexity Breakdown
Production Engineering Notes
1. Profile before optimizing. Complexity analysis tells you the asymptotic behavior. Profiling tells you where time is actually being spent in your specific workload. A theoretically operation on may be faster than a theoretically operation with large constants. Use torch.profiler, nsys (NVIDIA Nsight Systems), or py-spy before making any algorithmic changes.
2. Constants dominate at production scale too. with constant is slower than with for all . Constant factors come from: memory access patterns, SIMD vectorization, Python interpreter overhead vs compiled kernels, and cache efficiency. Always benchmark with realistic , not just analyze asymptotically.
3. Memory complexity is often the binding constraint in ML. A model that is compute-bound on a single GPU may become memory-bound when you try to increase batch size. The GPU out-of-memory error you see during training is almost always a space complexity problem - an or term materializing in memory. Know your model's space complexity before scaling.
4. Batch dimension changes the analysis. When analyzing transformer complexity, the batch dimension is usually independent of sequence length . attention per sequence = total. If you scale batch size and sequence length together, you see behavior. Understand which dimensions you are scaling before predicting runtime.
5. Distributed training changes what "input size" means. In data parallelism, each GPU processes samples (where is GPU count). Per-GPU complexity stays the same; you add communication overhead per step for gradient synchronization. In tensor parallelism, is split across GPUs. Always specify which dimension you mean when discussing distributed complexity.
6. GPU wall-clock complexity is not the same as operation complexity. A GPU with 10,000 CUDA cores can parallelize operations in a single clock cycle. So elementwise operations take wall-clock cycles. But memory-bound operations still take wall-clock time because the bottleneck is memory bandwidth, not compute. Roofline analysis is the right framework here.
:::danger Common Mistake: Ignoring Space Complexity Engineers who only analyze time complexity are blindsided by OOM errors. A function that is time but space (like standard attention) will blow up memory before it becomes slow. Always analyze both dimensions. The single most common production ML failure is a model that works at batch size 1 but OOMs at batch size 32 - this is always a space complexity problem, specifically an intermediate tensor growing as . :::
:::warning The Premature Optimization Trap Complexity analysis is a pre-implementation tool for ruling out bad approaches, and a post-profiling tool for understanding bottlenecks. It is NOT a substitute for measuring. Many ML engineers waste days "optimizing" code that runs in 2% of total training time. Measure first with a profiler. The hotspot is almost always in the data loader, the attention layer, or the optimizer step - and only profiling tells you which one. :::
:::tip The Doubling Experiment When unsure of your implementation's actual complexity, run the doubling experiment: time it at n, 2n, 4n. If doubling n doubles the time: . If it quadruples: . If it octuples: . This is the fastest empirical complexity test and takes under 10 lines of code. :::
Interview Q&A
Q1: Why is self-attention and not ? Can it ever be made ?
Self-attention computes a score between every pair of tokens: token attends to token for all pairs . This is pairs, requiring operations and memory for the score matrix. Full self-attention fundamentally requires knowing the relationship between every pair, which cannot be done in less than without approximation.
Approximations can achieve subquadratic complexity. Linformer projects K and V to fixed rank , giving attention. Longformer uses local window attention plus global tokens, giving where is the window size. Performer uses random feature approximations to the attention kernel for . These all trade recall quality for speed. FlashAttention achieves memory while keeping FLOPs by computing in tiles that stay in fast SRAM - it does not reduce FLOPs but eliminates the memory allocation.
Q2: What is amortized and why does it matter for ML training loops?
Amortized means the average cost per operation over a long sequence is , even though individual operations may be expensive. Python's list append is the example: most appends are (write to next slot), but occasional resizes are (copy entire array). Over appends the total work is , giving amortized.
In training this matters for: replay buffers in RL that occasionally compact, gradient accumulation buffers that flush periodically, and dataset index structures that grow as you stream data. If you see irregular latency spikes in your training step timer every few thousand steps, it may be an amortized cost becoming visible. Profile with torch.profiler and look for infrequent but expensive operations.
Q3: How do you empirically measure the time complexity of a model component?
Run the component at 5-8 input sizes spanning 2 or more orders of magnitude. Take the median timing across several trials (to reduce noise from cache effects, OS scheduling, etc.). For GPU operations, always call torch.cuda.synchronize() before reading the timer - GPU execution is asynchronous and without synchronization you are measuring the time to launch the kernel, not the time to complete it. Plot sizes vs times on log-log axes; the slope of the best-fit line is the complexity exponent. Slope 1 = , slope 2 = . Use torch.utils.benchmark.Timer for GPU benchmarking as it handles synchronization automatically.
Q4: Why is BLAS matrix multiply faster than a naive Python triple loop if both are ?
Same asymptotic complexity, radically different constants. The naive Python loop has: Python interpreter overhead per iteration (100-1000x vs compiled C), poor cache utilization (column-major access causes cache misses in row-major stored arrays), no SIMD vectorization, no multi-threading. BLAS and cuBLAS use: compiled C/Fortran/CUDA, memory-tiled access patterns for cache efficiency, SIMD (AVX-512 on modern CPUs does 16 float32 multiply-adds per cycle), multi-threading, and on GPU, Tensor Cores that execute 4x4 matrix multiplications in a single instruction. Same , but the constant factor is roughly 10,000x smaller.
Q5: What is the complexity of the Adam optimizer step and why does it matter at scale?
Adam maintains two momentum vectors - first moment (mean) and second moment (variance) - per parameter. An Adam step costs time and requires space (parameters + two moment vectors) where is the number of parameters. For GPT-3 at parameters in float32, the optimizer state alone is TB - far beyond any single GPU's memory. This is why large model training requires ZeRO optimizer state sharding (distributing the state across GPUs), or bfloat16 parameters with float32 optimizer state, or AdaFactor which reduces second moment storage from to for 2D weight matrices by factoring the variance estimate.
Q6: What is the Master Theorem and give an ML example where it applies?
The Master Theorem solves recurrences of the form : split the problem into subproblems each of size , with work to divide/combine. Compare to : if subproblems dominate (case 1), answer is ; if they are equal (case 2), answer is ; if combining dominates (case 3), answer is .
ML example: FFT-based convolution uses (merge-sort style). . Case 2 applies: . This is why 1D convolution with a large kernel uses FFT rather than direct convolution - direct is where is kernel size, while FFT convolution is regardless of kernel size. For large kernels, .
Code: Roofline Model Analysis
The roofline model is the practical tool for understanding whether a GPU kernel is compute-bound or memory-bound. It maps arithmetic intensity (FLOPs per byte) to achievable performance, bounded by two limits: peak compute (TFLOP/s) and peak memory bandwidth (GB/s).
import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
from dataclasses import dataclass
from typing import List
@dataclass
class HardwareSpec:
"""Hardware performance characteristics."""
name: str
peak_tflops: float # Peak floating-point throughput (TFLOP/s)
peak_bandwidth_gbs: float # Peak memory bandwidth (GB/s)
@property
def ridge_point(self) -> float:
"""
Arithmetic intensity at which compute and bandwidth limits intersect.
Above this: compute-bound. Below: memory-bound.
Units: FLOP/byte
"""
return self.peak_tflops * 1e12 / (self.peak_bandwidth_gbs * 1e9)
@dataclass
class KernelProfile:
"""Profile of a GPU kernel or operation."""
name: str
flops: float # total floating point operations
bytes_accessed: float # total bytes read from / written to memory
measured_tflops: float = None # actually measured throughput (optional)
@property
def arithmetic_intensity(self) -> float:
"""FLOP per byte of memory access."""
return self.flops / self.bytes_accessed
def roofline_bound(self, hw: HardwareSpec) -> str:
"""Is this kernel compute-bound or memory-bound on this hardware?"""
if self.arithmetic_intensity >= hw.ridge_point:
return "compute-bound"
return "memory-bound"
def peak_achievable_tflops(self, hw: HardwareSpec) -> float:
"""Maximum achievable TFLOP/s on this hardware."""
compute_limit = hw.peak_tflops
bandwidth_limit = (self.arithmetic_intensity * hw.peak_bandwidth_gbs * 1e9) / 1e12
return min(compute_limit, bandwidth_limit)
def efficiency(self, hw: HardwareSpec) -> float:
"""Fraction of achievable peak actually reached (0-1)."""
if self.measured_tflops is None:
return None
return self.measured_tflops / self.peak_achievable_tflops(hw)
# Hardware specs (approximate values)
A100_fp16 = HardwareSpec(
name="A100 (fp16 with sparsity)",
peak_tflops=312.0,
peak_bandwidth_gbs=2000.0
)
A100_fp32 = HardwareSpec(
name="A100 (fp32)",
peak_tflops=19.5,
peak_bandwidth_gbs=2000.0
)
V100_fp16 = HardwareSpec(
name="V100 (fp16)",
peak_tflops=125.0,
peak_bandwidth_gbs=900.0
)
# Ridge points
for hw in [A100_fp16, A100_fp32, V100_fp16]:
print(f"{hw.name}: ridge point = {hw.ridge_point:.1f} FLOP/byte")
# Common ML kernels
# For a matrix multiply M x K x K x N with batch B:
# FLOPs = 2 * B * M * K * N (multiply-add pairs)
# Bytes = (B*M*K + B*K*N + B*M*N) * bytes_per_element
def matmul_kernel(B, M, K, N, dtype_bytes=2):
flops = 2 * B * M * K * N
bytes_read = (B * M * K + B * K * N) * dtype_bytes
bytes_write = B * M * N * dtype_bytes
return KernelProfile(
name=f"matmul({B}x{M}x{K} @ {B}x{K}x{N})",
flops=flops,
bytes_accessed=bytes_read + bytes_write
)
def elementwise_kernel(n_elements, flops_per_element=1, dtype_bytes=2):
return KernelProfile(
name=f"elementwise(n={n_elements})",
flops=n_elements * flops_per_element,
bytes_accessed=n_elements * dtype_bytes * 2 # read + write
)
# Example: transformer layer at different sequence lengths
print("\nTransformer Layer Roofline Analysis (A100 fp16):")
print(f"{'Operation':<40} {'AI (F/B)':<12} {'Bound':<18} {'Peak (TFLOP/s)'}")
print("-" * 85)
hw = A100_fp16
batch, seq, d_model = 32, 512, 1024
d_head = 64
n_heads = 16
kernels = [
# QKV projection: (B * seq * d_model) @ (d_model * 3*d_model)
matmul_kernel(batch, seq, d_model, 3 * d_model, dtype_bytes=2),
# Attention scores: (B * n_heads * seq * d_head) @ (B * n_heads * d_head * seq)
matmul_kernel(batch * n_heads, seq, d_head, seq, dtype_bytes=2),
# Softmax (elementwise, ~5 ops/element: exp, sum, divide)
elementwise_kernel(batch * n_heads * seq * seq, flops_per_element=5, dtype_bytes=2),
# Output projection
matmul_kernel(batch, seq, d_model, d_model, dtype_bytes=2),
# Layer norm (~10 ops/element)
elementwise_kernel(batch * seq * d_model, flops_per_element=10, dtype_bytes=2),
# FFN (2 matrix multiplies with 4x expansion)
matmul_kernel(batch, seq, d_model, 4 * d_model, dtype_bytes=2),
]
for k in kernels:
bound = k.roofline_bound(hw)
peak = k.peak_achievable_tflops(hw)
print(f"{k.name:<40} {k.arithmetic_intensity:<12.1f} {bound:<18} {peak:.1f}")
Mermaid: Roofline Model for Transformer Operations
Summary
Complexity analysis is the language of scalable ML systems. The framework:
- Big-O gives asymptotic upper bounds; Big-Theta is tight; Big-Omega is lower bound. Use them precisely.
- Time complexity and space complexity are both critical - GPU OOM errors are almost always a space complexity problem.
- Self-attention is time and space - FlashAttention reduces the space to without changing FLOPs.
- Amortized analysis explains why dynamic data structures have average cost despite occasional expensive operations.
- The Master Theorem solves divide-and-conquer recurrences: merge sort and FFT are .
- Cache-aware complexity explains why BLAS beats naive loops by 10,000x with the same Big-O.
- Arithmetic intensity (FLOPs per byte) determines whether a GPU operation is compute-bound (matrix multiply) or memory-bound (softmax, layer norm).
- Always verify complexity empirically with the doubling experiment or log-log plot - theory and practice diverge due to constants and cache effects.
