Complexity Analysis - Computational Thinking for ML Engineers
Reading time: ~40 min | Interview relevance: High | Roles: MLE, AI Eng, Research Engineer, Applied Scientist
The Real Interview Moment
You are in a Google L5 MLE interview. You have just implemented a function that computes pairwise cosine similarities between all embeddings in a dataset. The interviewer asks: "What is the time complexity? What about space? Now, what if you have 10 million embeddings with 768 dimensions - how long will this take, and what would you do instead?"
You wrote similarities = embeddings @ embeddings.T / (norms[:, None] * norms[None, :]). You know the matrix multiplication is O(n^2 * d), and the result is an n x n matrix using O(n^2) memory. For n = 10M and d = 768, that is 10^14 operations and 800 TB of memory for the similarity matrix. You immediately propose approximate nearest neighbor search - FAISS with IVF or HNSW - which gives approximate results in O(n * log(n)) time and O(n * d) memory.
This is what complexity analysis in ML interviews looks like. It is not about reciting Big-O definitions - it is about connecting algorithmic complexity to real-world resource constraints (time, memory, cost) and knowing when an exact algorithm is infeasible and what approximation to use instead.
Candidates who say "it is O(n^2)" and stop get a "lean hire." Candidates who compute the actual wall-clock time, identify the bottleneck, and propose a scalable alternative get a "strong hire."
What You Will Master
- Define Big-O, Big-Omega, and Big-Theta with mathematical precision
- Analyze time and space complexity of common data structures and algorithms
- Apply amortized analysis to dynamic arrays and hash tables
- Compute complexity of ML-specific operations: training, inference, attention, convolution
- Estimate real-world resource requirements from complexity analysis
- Identify common interview traps in complexity questions
- Propose scalable alternatives when exact algorithms are infeasible
Self-Assessment: Where Are You Now?
| Skill | 1 - Cannot | 2 - Vaguely | 3 - Can Analyze | 4 - Can Estimate | 5 - Can Teach | Your Score |
|---|---|---|---|---|---|---|
| Define Big-O formally | ___ | |||||
| Analyze loops and recursion | ___ | |||||
| Know complexity of common data structures | ___ | |||||
| Amortized analysis | ___ | |||||
| ML training complexity | ___ | |||||
| ML inference complexity | ___ | |||||
| Translate O(.) to wall-clock estimates | ___ | |||||
| Propose scalable alternatives | ___ |
Target: All 4s and 5s before your interview.
Part 1 - Formal Definitions
Big-O, Big-Omega, Big-Theta
Formal definitions:
- O(g(n)): f(n) = O(g(n)) if there exist constants c > 0 and n_0 such that f(n) <= c * g(n) for all n >= n_0
- Omega(g(n)): f(n) = Omega(g(n)) if there exist constants c > 0 and n_0 such that f(n) >= c * g(n) for all n >= n_0
- Theta(g(n)): f(n) = Theta(g(n)) if f(n) = O(g(n)) AND f(n) = Omega(g(n))
"Big-O describes the upper bound on growth rate - how an algorithm scales in the worst case as input size grows. O(n) means linear scaling, O(n^2) means quadratic. But Big-O alone is incomplete: it tells you the ceiling, not the floor. Big-Omega is the lower bound, and Big-Theta is the tight bound. In interviews, we usually say Big-O but mean Big-Theta - we describe the typical growth rate, not just the worst case. The distinction matters for algorithms like quicksort, which is O(n^2) worst case but Theta(n log n) average case."
Common Growth Rates
| Notation | Name | Example | n = 10^6 Operations |
|---|---|---|---|
| O(1) | Constant | Hash table lookup | 1 |
| O(log n) | Logarithmic | Binary search | 20 |
| O(n) | Linear | Array scan | 10^6 |
| O(n log n) | Linearithmic | Merge sort | 2 * 10^7 |
| O(n^2) | Quadratic | Nested loops, attention | 10^12 |
| O(n^3) | Cubic | Matrix multiplication (naive) | 10^18 |
| O(2^n) | Exponential | All subsets | Infeasible |
| O(n!) | Factorial | All permutations | Infeasible |
Rules for Combining
# Rule 1: Drop constants
# O(3n + 5) = O(n)
# Rule 2: Drop lower-order terms
# O(n^2 + n + 1) = O(n^2)
# Rule 3: Sequential operations add
# O(n) + O(m) = O(n + m)
# Rule 4: Nested operations multiply
# for i in range(n): # O(n)
# for j in range(m): # O(m)
# ... # O(1)
# Total: O(n * m)
# Rule 5: Different inputs = different variables
# Do NOT simplify O(n + m) to O(n) unless you know m <= n
Do NOT say "O(n + m) = O(2n) = O(n)" unless m is bounded by n. If n is the number of users and m is the number of products, they are independent variables. Interviewers specifically test whether you distinguish between different input dimensions. In ML: n = samples, d = features, k = clusters, c = classes - all independent.
Part 2 - Data Structure Complexity
Essential Data Structures
| Data Structure | Access | Search | Insert | Delete | Space | ML Use Case |
|---|---|---|---|---|---|---|
| Array | O(1) | O(n) | O(n) | O(n) | O(n) | Feature vectors, predictions |
| Dynamic Array | O(1) | O(n) | O(1)* | O(n) | O(n) | Building datasets |
| Linked List | O(n) | O(n) | O(1) | O(1) | O(n) | Rarely used in ML |
| Hash Map | O(1)* | O(1)* | O(1)* | O(1)* | O(n) | Vocabulary, feature maps |
| Binary Search Tree | O(log n) | O(log n) | O(log n) | O(log n) | O(n) | Sorted access patterns |
| Heap (Min/Max) | O(1) top | O(n) | O(log n) | O(log n) | O(n) | Top-k selection, priority queues |
| Trie | O(L) | O(L) | O(L) | O(L) | O(N*L) | Tokenization, prefix matching |
*Amortized or average case
Amortized Analysis
# Dynamic array (Python list): append is O(1) amortized
# How? When the array is full, it doubles in capacity.
# Doubling copies all n elements: O(n).
# But this happens only once every n insertions.
# Cost for n insertions: n * O(1) + O(1+2+4+...+n) = n + 2n = O(n)
# Amortized cost per insertion: O(n) / n = O(1)
# Hash table: lookup is O(1) amortized (with good hash function)
# Worst case: O(n) if all keys hash to same bucket
# Average case with load factor < 0.75: O(1)
# Resizing: same amortized argument as dynamic arrays
Part 3 - Sorting and Searching Complexity
| Algorithm | Best | Average | Worst | Space | Stable? | ML Relevance |
|---|---|---|---|---|---|---|
| Quicksort | O(n log n) | O(n log n) | O(n^2) | O(log n) | No | numpy.sort default |
| Mergesort | O(n log n) | O(n log n) | O(n log n) | O(n) | Yes | Stable sort for ties |
| Heapsort | O(n log n) | O(n log n) | O(n log n) | O(1) | No | In-place sorting |
| Counting Sort | O(n+k) | O(n+k) | O(n+k) | O(k) | Yes | Label histograms |
| Binary Search | O(1) | O(log n) | O(log n) | O(1) | N/A | Threshold search |
| Argpartition | O(n) | O(n) | O(n^2) | O(1) | No | Top-k selection (NumPy) |
Google tests sorting/searching complexity directly. They may ask: "Why is np.argpartition O(n) for top-k while np.argsort is O(n log n)?" (Answer: argpartition uses Introselect, a variant of quickselect, which only partially sorts.) Meta and Amazon test this indirectly through system design: "Your recommendation system needs to return top 100 items from 10M candidates - what is the complexity?" (Answer: O(n) with argpartition vs O(n log n) with full sort.)
Part 4 - ML Algorithm Complexity
Training Complexity
| Algorithm | Time (Training) | Space (Training) | Key Variables |
|---|---|---|---|
| Linear Regression (OLS) | O(nd^2 + d^3) | O(d^2) | n=samples, d=features |
| Linear Regression (GD) | O(ndt) | O(d) | t=iterations |
| Logistic Regression (GD) | O(ndt) | O(d) | Same as linear GD |
| Decision Tree | O(n * d * depth) | O(nodes) | Greedy split search |
| Random Forest | O(T * n * sqrt(d) * depth) | O(T * nodes) | T=trees |
| Gradient Boosting | O(T * n * d * depth) | O(T * nodes) | Sequential trees |
| K-NN | O(1) or O(n log n) | O(nd) | Lazy learner or KD-tree |
| K-Means | O(nkdt) | O(nd + kd) | k=clusters, t=iterations |
| SVM (kernel) | O(n^2 * d) to O(n^3) | O(n^2) | Kernel matrix |
| PCA | O(nd^2 + d^3) or O(n^2 d + n^3) | O(d^2) or O(n^2) | Depends on n vs d |
| Neural Network | O(B * L * d^2) per epoch | O(L * d^2 + B * d) | B=batch, L=layers, d=width |
Inference Complexity
| Algorithm | Time (Inference, single sample) | Space | Notes |
|---|---|---|---|
| Linear/Logistic Regression | O(d) | O(d) | Just a dot product |
| Decision Tree | O(depth) | O(nodes) | Follow one path |
| Random Forest | O(T * depth) | O(T * nodes) | T independent trees |
| K-NN | O(nd) | O(nd) | Must store all training data |
| Neural Network | O(L * d^2) | O(L * d^2) | Matrix multiplications |
| Transformer (self-attention) | O(n^2 * d) | O(n^2 + nd) | n=sequence length |
Part 5 - Deep Learning Complexity
Transformer Self-Attention: The O(n^2) Problem
# Standard self-attention
# Q, K, V each have shape (batch, seq_len, d_model)
# Step 1: QK^T - the bottleneck
# (batch, seq_len, d_model) @ (batch, d_model, seq_len)
# = (batch, seq_len, seq_len)
# Time: O(n^2 * d)
# Space: O(n^2) for the attention matrix
scores = Q @ K.transpose(0, 2, 1) # O(n^2 * d)
# Step 2: Softmax - along last axis
# Time: O(n^2)
weights = softmax(scores / sqrt(d)) # O(n^2)
# Step 3: Weighted sum with V
# (batch, seq_len, seq_len) @ (batch, seq_len, d_model)
# Time: O(n^2 * d)
output = weights @ V # O(n^2 * d)
# Total: O(n^2 * d) time, O(n^2) space
# For GPT-4 with n=128K and d=12288:
# n^2 = 1.6 * 10^10 - that is 16 billion attention entries per layer
Approaches to O(n^2) Attention
| Method | Time | Space | Tradeoff |
|---|---|---|---|
| Standard attention | O(n^2 d) | O(n^2) | Exact but quadratic |
| Flash Attention | O(n^2 d) | O(n) | Same time, O(n) memory via tiling |
| Sparse attention | O(n * sqrt(n) * d) | O(n * sqrt(n)) | Attends to fixed patterns |
| Linear attention | O(n * d^2) | O(d^2) | Approximate, kernel trick |
| Sliding window | O(n * w * d) | O(n * w) | Local attention only |
| Ring attention | O(n^2 d / P) | O(n^2 / P) per device | Distributed across P devices |
Never say "attention is O(n^2)" and leave it at that. You must specify what n is (sequence length), what the full complexity is (O(n^2 * d)), and whether you are talking about time or space. Then follow up with what this means practically: "For a 128K context window with d=12288, the attention matrix alone is 128K^2 * 4 bytes = 64 GB per layer, which is why we need Flash Attention or sparse methods." This demonstrates systems awareness.
Convolution Complexity
# 2D Convolution: input (batch, C_in, H, W), kernel (C_out, C_in, kH, kW)
# Output: (batch, C_out, H_out, W_out)
# where H_out = (H - kH + 2*pad) / stride + 1
# Time: O(batch * C_out * H_out * W_out * C_in * kH * kW)
# Space: O(batch * C_out * H_out * W_out) for output
# For a typical conv layer:
# batch=32, C_in=64, C_out=128, H=W=56, k=3
# FLOPs = 32 * 128 * 56 * 56 * 64 * 3 * 3 ≈ 7.3 * 10^9
Batch Size Effects
| Aspect | Small Batch (32) | Large Batch (4096) | Effect on Complexity |
|---|---|---|---|
| Gradient computation | O(B * model_params) | O(B * model_params) | Linear in B |
| Weight update | Once per batch | Once per batch | Independent of B |
| Total epochs | More updates/epoch | Fewer updates/epoch | Same total training |
| Memory | Low | High | Linear in B |
| GPU utilization | Low (underutilized) | High (saturated) | Throughput scales with B (up to a point) |
| Convergence | More noise, may generalize better | Less noise, may need LR warmup | Not purely a complexity question |
Part 6 - Space Complexity for ML
Memory Estimation
# Model parameters
# Linear layer: W (d_in, d_out) + b (d_out)
# Params = d_in * d_out + d_out
# Transformer block:
# Self-attention: W_Q, W_K, W_V, W_O - each (d, d) = 4 * d^2
# FFN: W1 (d, 4d) + W2 (4d, d) = 8 * d^2
# Total per block: ~12 * d^2
# GPT-3 (175B parameters):
# d = 12288, L = 96 layers, vocab = 50257
# Per layer: 12 * 12288^2 = 1.8 * 10^9
# All layers: 96 * 1.8 * 10^9 = 1.7 * 10^11
# + Embeddings: 50257 * 12288 ≈ 6 * 10^8
# Total: ~175 * 10^9 parameters
# Memory for parameters (float16):
# 175 * 10^9 * 2 bytes = 350 GB
# Memory for Adam optimizer states (float32):
# 175 * 10^9 * (4 + 4 + 4) bytes = 2.1 TB (params + momentum + variance)
Memory During Training
| Component | Memory | Depends On |
|---|---|---|
| Model parameters | O(P) | P = total parameters |
| Gradients | O(P) | Same size as parameters |
| Optimizer state (Adam) | O(3P) | Momentum + variance + master copy |
| Activations | O(B * L * d) | Batch * layers * hidden size |
| Workspace (GEMM, temp) | O(B * d^2) | Temporary buffers |
When I ask "how much memory does training this model require?", I am testing three things: (1) Can you enumerate all the memory components (not just parameters)? (2) Do you know that optimizer states often dominate? (3) Can you estimate actual numbers? A candidate who says "175B params * 2 bytes = 350 GB" is incomplete. A candidate who says "350 GB for params + 350 GB for gradients + 700 GB for Adam states + activations = ~2 TB minimum" gets strong marks.
Part 7 - Common Complexity Analysis Patterns
Pattern 1: Nested Loops
# O(n^2)
for i in range(n):
for j in range(n):
# O(1) work
# O(n^2) but NOT n^2 iterations (upper triangle)
for i in range(n):
for j in range(i, n):
# n + (n-1) + ... + 1 = n(n+1)/2 = O(n^2)
# O(n * m) - two different inputs
for i in range(n):
for j in range(m):
# O(1) work
Pattern 2: Logarithmic
# O(log n) - halving the problem
def binary_search(arr, target):
lo, hi = 0, len(arr) - 1
while lo <= hi: # Runs O(log n) times
mid = (lo + hi) // 2
if arr[mid] == target:
return mid
elif arr[mid] < target:
lo = mid + 1
else:
hi = mid - 1
return -1
# O(n log n) - divide and conquer
def merge_sort(arr):
if len(arr) <= 1:
return arr
mid = len(arr) // 2
left = merge_sort(arr[:mid]) # T(n/2)
right = merge_sort(arr[mid:]) # T(n/2)
return merge(left, right) # O(n)
# T(n) = 2T(n/2) + O(n) = O(n log n)
Pattern 3: Matrix Operations
# Matrix multiply: (n, d) @ (d, m) = (n, m)
# Time: O(n * d * m) - three nested loops
# Space: O(n * m) for the result
# Key insight for ML: most operations are matrix multiplies
# Linear layer: X @ W = (batch, d_in) @ (d_in, d_out) = O(batch * d_in * d_out)
# Attention QK^T: (batch, seq, d) @ (batch, d, seq) = O(batch * seq^2 * d)
Pattern 4: Recursion and the Master Theorem
# Recurrence: T(n) = a * T(n/b) + O(n^c)
# a = number of subproblems
# b = size reduction factor
# c = exponent of work per level
# Case 1: c < log_b(a) → T(n) = O(n^(log_b(a)))
# Case 2: c = log_b(a) → T(n) = O(n^c * log n)
# Case 3: c > log_b(a) → T(n) = O(n^c)
# Examples:
# Merge sort: T(n) = 2T(n/2) + O(n) → a=2, b=2, c=1
# log_2(2) = 1 = c → Case 2 → O(n log n)
# Binary search: T(n) = T(n/2) + O(1) → a=1, b=2, c=0
# log_2(1) = 0 = c → Case 2 → O(log n)
# Strassen: T(n) = 7T(n/2) + O(n^2) → a=7, b=2, c=2
# log_2(7) ≈ 2.81 > 2 → Case 1 → O(n^2.81)
Part 8 - Real-World Estimation
Translating Big-O to Wall-Clock Time
# Rule of thumb: modern hardware does ~10^9 simple operations per second
# O(n) with n = 10^6: ~1 ms (fast)
# O(n log n) with n = 10^6: ~20 ms (fast)
# O(n^2) with n = 10^6: ~10^12 ops ÷ 10^9 = ~1000 seconds (too slow)
# O(n^2) with n = 10^4: ~10^8 ops ÷ 10^9 = ~0.1 seconds (okay)
# For matrix operations (BLAS-optimized on GPU):
# GPU: ~10^13 FLOPS (A100)
# Matrix multiply (n, n) @ (n, n):
# n = 1000: 10^9 ops → ~0.1 ms on GPU
# n = 10000: 10^12 ops → ~100 ms on GPU
# n = 100000: 10^15 ops → ~100 seconds on GPU
The "Will It Fit?" Framework
| Scenario | n | d | Algorithm | Time | Memory | Feasible? |
|---|---|---|---|---|---|---|
| Pairwise distances | 10K | 100 | O(n^2 d) | ~1s | ~800 MB | Yes |
| Pairwise distances | 1M | 100 | O(n^2 d) | ~28 hours | ~8 TB | No |
| KNN search | 1M | 100 | Brute force O(nd) | ~0.1s per query | ~800 MB | Marginal |
| KNN search | 1M | 100 | HNSW O(d log n) | ~0.001s per query | ~1 GB | Yes |
| Attention | 1K seq | 768 | O(n^2 d) | ~0.001s | ~4 MB | Yes |
| Attention | 128K seq | 768 | O(n^2 d) | ~17s | ~64 GB | Need FlashAttention |
Practice Problems
Problem 1: Complexity of Feature Engineering Pipeline
Analyze the time and space complexity of this feature engineering pipeline:
def create_features(df, user_col, event_col, value_col):
# Step 1: Group by user
grouped = df.groupby(user_col) # O(?)
# Step 2: Aggregate per user
features = grouped[value_col].agg(['mean', 'std', 'max', 'count']) # O(?)
# Step 3: Pairwise interaction features
for col_a in feature_cols:
for col_b in feature_cols:
if col_a != col_b:
df[f'{col_a}_x_{col_b}'] = df[col_a] * df[col_b] # O(?)
# Step 4: Sort and compute lag features
df = df.sort_values([user_col, event_col]) # O(?)
df['lag_1'] = df.groupby(user_col)[value_col].shift(1) # O(?)
return df, features
Hint 1 - Direction
Analyze each step independently, then identify the bottleneck. Pay attention to the nested loop in Step 3 - how many feature pairs are there?
Hint 2 - Variables
Let n = number of rows, u = number of unique users, f = number of feature columns. Each step's complexity depends on different variables.
Hint 3 - Full Solution
| Step | Operation | Time | Space | Notes |
|---|---|---|---|---|
| 1 | GroupBy | O(n) | O(n) | Hash-based grouping |
| 2 | Aggregate | O(n) | O(u) | Single pass over grouped data |
| 3 | Pairwise interactions | O(f^2 * n) | O(f^2 * n) | f^2 new columns, each O(n) |
| 4a | Sort | O(n log n) | O(n) | Comparison sort |
| 4b | Lag per group | O(n) | O(n) | Single pass |
Bottleneck: Step 3 with O(f^2 * n). If f = 100 and n = 1M, this creates 10,000 new columns with 10^10 values - 80 GB at float64. The space is the real problem, not the time.
Optimization: Use feature selection to reduce f before creating interactions. Or use sparse representation. Or compute interactions only for the top-k most important features.
Scoring:
- Strong Hire: Correctly identifies the bottleneck, computes actual memory requirements, proposes optimization.
- Lean Hire: Correct analysis but does not estimate actual numbers.
- No Hire: Says "it is O(n)" without analyzing the nested loop.
Problem 2: Training vs Inference Complexity
A transformer model has L=12 layers, d=768, 8 attention heads, and a vocabulary of 50,000. For a sequence length of 512:
(a) What is the inference time complexity for generating one token? (b) What is the inference time for generating a 100-token response? (c) How does the KV cache help?
Hint 1 - Direction
For autoregressive generation, each new token requires attending to all previous tokens. Without KV cache, you recompute attention for the entire sequence. With KV cache, you only compute the new token's query against cached keys/values.
Hint 2 - Key insight
Without KV cache: generating token t requires O(t * d) attention per layer. Generating n tokens requires sum from t=1 to n of O(t * d * L) = O(n^2 * d * L). With KV cache: generating token t requires O(t * d) attention (same), but you skip re-computing K and V for previous tokens, saving O((t-1) * d^2) per layer.
Hint 3 - Full Solution
(a) Single token inference (at position t):
Per layer:
- Self-attention: Q for new token is O(d^2). K, V if not cached: O(d^2). Attention scores: O(t * d). Weighted sum: O(t * d). Total: O(d^2 + t * d).
- FFN: O(d * 4d) = O(d^2).
- Per layer total: O(d^2 + t * d).
All layers: O(L * (d^2 + t * d)).
For t=512, L=12, d=768: ~12 * (768^2 + 512*768) = ~12 * (590K + 393K) = ~12M ops.
(b) 100-token generation:
Without KV cache: for each token t, recompute the entire forward pass for the full sequence up to t. Total: sum from t=1 to 100 of O(L * t^2 * d) = O(L * d * 100^3 / 3) -- this is a loose upper bound. More precisely: O(L * n * d^2 + L * n^2 * d) where n = sequence length at each step.
With KV cache: for each token t, only compute the new Q and attend to cached K/V. Total: sum from t=start to start+100 of O(L * (d^2 + t * d)) = approximately O(100 * L * d^2 + L * d * sum(t)).
(c) KV cache benefit:
Without cache: must recompute K and V projections for ALL previous tokens at each generation step. Cost: O(t * d^2) per layer per token.
With cache: store K and V from previous steps. Only compute K, V for the NEW token. Cost: O(d^2) per layer per token (not O(t * d^2)). The attention computation itself is still O(t * d) - that cannot be cached.
Memory cost of KV cache: O(2 * L * t * d) - storing K and V for each layer for all previous positions. For L=12, t=512, d=768, float16: 2 * 12 * 512 * 768 * 2 bytes = ~18 MB per sequence.
Scoring:
- Strong Hire: Distinguishes between cached and uncached, gives concrete memory estimates, mentions that attention computation is still O(t*d) even with cache.
- Lean Hire: Understands the general idea but cannot give precise complexity.
- No Hire: Does not know what KV cache is.
Problem 3: Scaling Laws Back-of-Envelope
An experiment trains a 1B parameter model on 20B tokens in 24 hours on 8 A100 GPUs. Estimate: (a) how long to train a 10B parameter model on 200B tokens, (b) how many GPUs you need to keep it under 24 hours.
Hint 1 - Direction
Training FLOPs scale approximately as 6 * N * D where N = parameters and D = tokens (the Chinchilla approximation). Wall-clock time = FLOPs / (num_GPUs * GPU_throughput * MFU).
Hint 2 - Compute the ratio
FLOPs_1 = 6 * 1B * 20B = 1.2 * 10^20. FLOPs_2 = 6 * 10B * 200B = 1.2 * 10^22. Ratio = 100x more compute. If hardware is the same, time = 100x * 24 hours on 8 GPUs.
Hint 3 - Full Solution
(a) Time estimate:
FLOPs for 1B model: 6 * 10^9 * 2 * 10^10 = 1.2 * 10^20 FLOPs for 10B model: 6 * 10^10 * 2 * 10^11 = 1.2 * 10^22
Ratio: 100x more compute.
Assuming linear scaling (same GPU throughput): 100 * 24 = 2,400 hours = 100 days on 8 A100s.
(b) GPUs for 24 hours:
Need 100x speedup over the 8-GPU baseline. Assuming perfect linear scaling: 8 * 100 = 800 A100 GPUs.
In practice, communication overhead reduces efficiency. With ~70% scaling efficiency at 800 GPUs: need ~1,150 GPUs. Communication overhead increases with GPU count due to all-reduce operations for gradient synchronization.
Real-world considerations:
- 10B parameters in float16 = 20 GB. Fits on a single A100 (80 GB).
- But training requires: params (20 GB) + gradients (20 GB) + Adam states (60 GB) + activations (~40 GB for batch 32) = ~140 GB. Need at least 2 GPUs with model parallelism.
- At 800+ GPUs, data parallelism is sufficient. Use ZeRO-2 to shard optimizer states.
- MFU (Model FLOPS Utilization) typically 40-55% for large-scale training.
Scoring:
- Strong Hire: Correct FLOP estimation, accounts for scaling efficiency, mentions memory requirements and parallelism strategy.
- Lean Hire: Correct FLOP estimation but assumes perfect scaling.
- No Hire: Cannot estimate FLOPs for a training run.
Problem 4: Identify the Complexity
What is the time and space complexity of each operation?
| # | Code | Your Answer: Time | Your Answer: Space |
|---|---|---|---|
| 1 | np.dot(a, b) where a, b are (n,) | ||
| 2 | A @ B where A is (m, n), B is (n, p) | ||
| 3 | np.linalg.svd(X) where X is (m, n) | ||
| 4 | sorted(lst) where lst has n elements | ||
| 5 | set(lst) where lst has n elements | ||
| 6 | dict_lookup[key] | ||
| 7 | np.unique(arr) where arr has n elements | ||
| 8 | Attention for (batch, seq, dim) | ||
| 9 | df.groupby('col').mean() with n rows | ||
| 10 | Training 1 epoch of GD with n samples, d features |
Full Answers
| # | Code | Time | Space |
|---|---|---|---|
| 1 | Dot product of two vectors | O(n) | O(1) |
| 2 | Matrix multiplication | O(m * n * p) | O(m * p) |
| 3 | SVD | O(min(mn^2, m^2 n)) | O(m * n) |
| 4 | Sorting (Timsort) | O(n log n) | O(n) |
| 5 | Set construction | O(n) amortized | O(n) |
| 6 | Hash map lookup | O(1) amortized | O(1) |
| 7 | Unique (sort + dedup) | O(n log n) | O(n) |
| 8 | Self-attention | O(batch * seq^2 * dim) | O(batch * seq^2) |
| 9 | GroupBy mean | O(n) | O(groups) |
| 10 | Gradient descent epoch | O(n * d) | O(d) |
Interview Cheat Sheet
| Concept | Key Insight | One-Liner | Red Flag |
|---|---|---|---|
| Big-O | Upper bound on growth rate | Drop constants and lower terms | Saying "O(2n)" instead of "O(n)" |
| Big-Theta | Tight bound (both upper and lower) | The "real" complexity | Confusing Big-O with exact count |
| Amortized | Average over a sequence of operations | Dynamic array append is O(1) amortized | Saying "dynamic array insert is O(n)" |
| Matrix multiply | O(n * m * p) for (n,m) x (m,p) | Foundation of all ML compute | Not knowing this is the bottleneck |
| Attention | O(n^2 * d) time, O(n^2) space | Quadratic in sequence length | Saying "attention is O(n)" |
| Training complexity | 6 * N * D FLOPs (Chinchilla) | Parameters x tokens x 6 | Not accounting for optimizer memory |
| Inference | O(L * d^2) per token with KV cache | Cache K/V, compute only new Q | Not knowing KV cache |
| Space vs time | Sometimes trade one for the other | DP uses space to save time, streaming saves space | Analyzing only time, forgetting space |
| Back-of-envelope | 10^9 ops/sec CPU, 10^13 FLOPS GPU | Convert Big-O to wall clock | Cannot estimate actual running time |
| When O(n^2) breaks | n > 10^5 on CPU, n > 10^6 on GPU | Need approximate/sublinear methods | Proposing O(n^2) for 10M data points |
Spaced Repetition Checkpoints
Day 0 - Initial Learning
- Read this entire page
- Write the formal definition of Big-O from memory
- Fill in the "Identify the Complexity" table without looking at answers
- Complete the self-assessment
Day 3 - First Recall
- State the time complexity of: hash table ops, binary search, merge sort, matrix multiply
- Explain why attention is O(n^2 * d) and what n and d represent
- Estimate the memory required to train a 7B parameter model
Day 7 - Connections
- Solve Problem 1 (feature pipeline complexity) without hints
- Explain training vs inference complexity for a transformer
- Convert O(n^2) with n=100K to wall-clock time on CPU and GPU
Day 14 - Application
- Solve Problem 3 (scaling laws) under timed conditions (10 min)
- Explain KV cache and its effect on autoregressive generation complexity
- Given a system design problem, identify the computational bottleneck
Day 21 - Mock Interview
- Have someone give you code - analyze its complexity in real time
- Estimate training time for a given model/dataset/hardware configuration
- Propose scalable alternatives for O(n^2) operations on large datasets
Key Takeaways
-
Complexity analysis in ML is about feasibility, not theory. The question is never "what is the Big-O?" in isolation. It is always "is this feasible at the scale we need, and if not, what is the alternative?" Always connect complexity to real numbers.
-
Attention's O(n^2) is the defining constraint of modern AI. Understanding this quadratic bottleneck - and the zoo of methods to address it (Flash Attention, sparse attention, linear attention) - is essential for any transformer-related role.
-
Training is dominated by memory, inference by latency. During training, optimizer states often consume 4-12x the model parameter memory. During inference, the KV cache and autoregressive generation dominate. Know both dimensions.
-
Back-of-envelope estimation is a senior skill. Being able to say "this will take ~2 hours on 8 A100s" from first principles (FLOPs, throughput, MFU) is what separates senior engineers from junior ones. Practice this.
Next Steps
Continue to Mock Coding Problems for 15-20 full interview problems that combine everything from this section - DSA, NumPy, Pandas, SQL, ML implementation, and complexity analysis.
