Skip to main content

Complexity Analysis - Computational Thinking for ML Engineers

Reading time: ~40 min | Interview relevance: High | Roles: MLE, AI Eng, Research Engineer, Applied Scientist

The Real Interview Moment

You are in a Google L5 MLE interview. You have just implemented a function that computes pairwise cosine similarities between all embeddings in a dataset. The interviewer asks: "What is the time complexity? What about space? Now, what if you have 10 million embeddings with 768 dimensions - how long will this take, and what would you do instead?"

You wrote similarities = embeddings @ embeddings.T / (norms[:, None] * norms[None, :]). You know the matrix multiplication is O(n^2 * d), and the result is an n x n matrix using O(n^2) memory. For n = 10M and d = 768, that is 10^14 operations and 800 TB of memory for the similarity matrix. You immediately propose approximate nearest neighbor search - FAISS with IVF or HNSW - which gives approximate results in O(n * log(n)) time and O(n * d) memory.

This is what complexity analysis in ML interviews looks like. It is not about reciting Big-O definitions - it is about connecting algorithmic complexity to real-world resource constraints (time, memory, cost) and knowing when an exact algorithm is infeasible and what approximation to use instead.

Candidates who say "it is O(n^2)" and stop get a "lean hire." Candidates who compute the actual wall-clock time, identify the bottleneck, and propose a scalable alternative get a "strong hire."

What You Will Master

  • Define Big-O, Big-Omega, and Big-Theta with mathematical precision
  • Analyze time and space complexity of common data structures and algorithms
  • Apply amortized analysis to dynamic arrays and hash tables
  • Compute complexity of ML-specific operations: training, inference, attention, convolution
  • Estimate real-world resource requirements from complexity analysis
  • Identify common interview traps in complexity questions
  • Propose scalable alternatives when exact algorithms are infeasible

Self-Assessment: Where Are You Now?

Skill1 - Cannot2 - Vaguely3 - Can Analyze4 - Can Estimate5 - Can TeachYour Score
Define Big-O formally___
Analyze loops and recursion___
Know complexity of common data structures___
Amortized analysis___
ML training complexity___
ML inference complexity___
Translate O(.) to wall-clock estimates___
Propose scalable alternatives___

Target: All 4s and 5s before your interview.

Part 1 - Formal Definitions

Big-O, Big-Omega, Big-Theta

Asymptotic Notation - Big-O, Omega, Theta

Formal definitions:

  • O(g(n)): f(n) = O(g(n)) if there exist constants c > 0 and n_0 such that f(n) <= c * g(n) for all n >= n_0
  • Omega(g(n)): f(n) = Omega(g(n)) if there exist constants c > 0 and n_0 such that f(n) >= c * g(n) for all n >= n_0
  • Theta(g(n)): f(n) = Theta(g(n)) if f(n) = O(g(n)) AND f(n) = Omega(g(n))
60-Second Answer

"Big-O describes the upper bound on growth rate - how an algorithm scales in the worst case as input size grows. O(n) means linear scaling, O(n^2) means quadratic. But Big-O alone is incomplete: it tells you the ceiling, not the floor. Big-Omega is the lower bound, and Big-Theta is the tight bound. In interviews, we usually say Big-O but mean Big-Theta - we describe the typical growth rate, not just the worst case. The distinction matters for algorithms like quicksort, which is O(n^2) worst case but Theta(n log n) average case."

Common Growth Rates

NotationNameExamplen = 10^6 Operations
O(1)ConstantHash table lookup1
O(log n)LogarithmicBinary search20
O(n)LinearArray scan10^6
O(n log n)LinearithmicMerge sort2 * 10^7
O(n^2)QuadraticNested loops, attention10^12
O(n^3)CubicMatrix multiplication (naive)10^18
O(2^n)ExponentialAll subsetsInfeasible
O(n!)FactorialAll permutationsInfeasible

Rules for Combining

# Rule 1: Drop constants
# O(3n + 5) = O(n)

# Rule 2: Drop lower-order terms
# O(n^2 + n + 1) = O(n^2)

# Rule 3: Sequential operations add
# O(n) + O(m) = O(n + m)

# Rule 4: Nested operations multiply
# for i in range(n): # O(n)
# for j in range(m): # O(m)
# ... # O(1)
# Total: O(n * m)

# Rule 5: Different inputs = different variables
# Do NOT simplify O(n + m) to O(n) unless you know m <= n
Common Trap

Do NOT say "O(n + m) = O(2n) = O(n)" unless m is bounded by n. If n is the number of users and m is the number of products, they are independent variables. Interviewers specifically test whether you distinguish between different input dimensions. In ML: n = samples, d = features, k = clusters, c = classes - all independent.

Part 2 - Data Structure Complexity

Essential Data Structures

Data StructureAccessSearchInsertDeleteSpaceML Use Case
ArrayO(1)O(n)O(n)O(n)O(n)Feature vectors, predictions
Dynamic ArrayO(1)O(n)O(1)*O(n)O(n)Building datasets
Linked ListO(n)O(n)O(1)O(1)O(n)Rarely used in ML
Hash MapO(1)*O(1)*O(1)*O(1)*O(n)Vocabulary, feature maps
Binary Search TreeO(log n)O(log n)O(log n)O(log n)O(n)Sorted access patterns
Heap (Min/Max)O(1) topO(n)O(log n)O(log n)O(n)Top-k selection, priority queues
TrieO(L)O(L)O(L)O(L)O(N*L)Tokenization, prefix matching

*Amortized or average case

Amortized Analysis

# Dynamic array (Python list): append is O(1) amortized
# How? When the array is full, it doubles in capacity.
# Doubling copies all n elements: O(n).
# But this happens only once every n insertions.
# Cost for n insertions: n * O(1) + O(1+2+4+...+n) = n + 2n = O(n)
# Amortized cost per insertion: O(n) / n = O(1)

# Hash table: lookup is O(1) amortized (with good hash function)
# Worst case: O(n) if all keys hash to same bucket
# Average case with load factor < 0.75: O(1)
# Resizing: same amortized argument as dynamic arrays

Part 3 - Sorting and Searching Complexity

AlgorithmBestAverageWorstSpaceStable?ML Relevance
QuicksortO(n log n)O(n log n)O(n^2)O(log n)Nonumpy.sort default
MergesortO(n log n)O(n log n)O(n log n)O(n)YesStable sort for ties
HeapsortO(n log n)O(n log n)O(n log n)O(1)NoIn-place sorting
Counting SortO(n+k)O(n+k)O(n+k)O(k)YesLabel histograms
Binary SearchO(1)O(log n)O(log n)O(1)N/AThreshold search
ArgpartitionO(n)O(n)O(n^2)O(1)NoTop-k selection (NumPy)
Company Variation

Google tests sorting/searching complexity directly. They may ask: "Why is np.argpartition O(n) for top-k while np.argsort is O(n log n)?" (Answer: argpartition uses Introselect, a variant of quickselect, which only partially sorts.) Meta and Amazon test this indirectly through system design: "Your recommendation system needs to return top 100 items from 10M candidates - what is the complexity?" (Answer: O(n) with argpartition vs O(n log n) with full sort.)

Part 4 - ML Algorithm Complexity

Training Complexity

AlgorithmTime (Training)Space (Training)Key Variables
Linear Regression (OLS)O(nd^2 + d^3)O(d^2)n=samples, d=features
Linear Regression (GD)O(ndt)O(d)t=iterations
Logistic Regression (GD)O(ndt)O(d)Same as linear GD
Decision TreeO(n * d * depth)O(nodes)Greedy split search
Random ForestO(T * n * sqrt(d) * depth)O(T * nodes)T=trees
Gradient BoostingO(T * n * d * depth)O(T * nodes)Sequential trees
K-NNO(1) or O(n log n)O(nd)Lazy learner or KD-tree
K-MeansO(nkdt)O(nd + kd)k=clusters, t=iterations
SVM (kernel)O(n^2 * d) to O(n^3)O(n^2)Kernel matrix
PCAO(nd^2 + d^3) or O(n^2 d + n^3)O(d^2) or O(n^2)Depends on n vs d
Neural NetworkO(B * L * d^2) per epochO(L * d^2 + B * d)B=batch, L=layers, d=width

Inference Complexity

AlgorithmTime (Inference, single sample)SpaceNotes
Linear/Logistic RegressionO(d)O(d)Just a dot product
Decision TreeO(depth)O(nodes)Follow one path
Random ForestO(T * depth)O(T * nodes)T independent trees
K-NNO(nd)O(nd)Must store all training data
Neural NetworkO(L * d^2)O(L * d^2)Matrix multiplications
Transformer (self-attention)O(n^2 * d)O(n^2 + nd)n=sequence length

ML Model Scaling - Complexity Analysis

Part 5 - Deep Learning Complexity

Transformer Self-Attention: The O(n^2) Problem

# Standard self-attention
# Q, K, V each have shape (batch, seq_len, d_model)

# Step 1: QK^T - the bottleneck
# (batch, seq_len, d_model) @ (batch, d_model, seq_len)
# = (batch, seq_len, seq_len)
# Time: O(n^2 * d)
# Space: O(n^2) for the attention matrix
scores = Q @ K.transpose(0, 2, 1) # O(n^2 * d)

# Step 2: Softmax - along last axis
# Time: O(n^2)
weights = softmax(scores / sqrt(d)) # O(n^2)

# Step 3: Weighted sum with V
# (batch, seq_len, seq_len) @ (batch, seq_len, d_model)
# Time: O(n^2 * d)
output = weights @ V # O(n^2 * d)

# Total: O(n^2 * d) time, O(n^2) space
# For GPT-4 with n=128K and d=12288:
# n^2 = 1.6 * 10^10 - that is 16 billion attention entries per layer

Approaches to O(n^2) Attention

MethodTimeSpaceTradeoff
Standard attentionO(n^2 d)O(n^2)Exact but quadratic
Flash AttentionO(n^2 d)O(n)Same time, O(n) memory via tiling
Sparse attentionO(n * sqrt(n) * d)O(n * sqrt(n))Attends to fixed patterns
Linear attentionO(n * d^2)O(d^2)Approximate, kernel trick
Sliding windowO(n * w * d)O(n * w)Local attention only
Ring attentionO(n^2 d / P)O(n^2 / P) per deviceDistributed across P devices
Instant Rejection

Never say "attention is O(n^2)" and leave it at that. You must specify what n is (sequence length), what the full complexity is (O(n^2 * d)), and whether you are talking about time or space. Then follow up with what this means practically: "For a 128K context window with d=12288, the attention matrix alone is 128K^2 * 4 bytes = 64 GB per layer, which is why we need Flash Attention or sparse methods." This demonstrates systems awareness.

Convolution Complexity

# 2D Convolution: input (batch, C_in, H, W), kernel (C_out, C_in, kH, kW)
# Output: (batch, C_out, H_out, W_out)
# where H_out = (H - kH + 2*pad) / stride + 1

# Time: O(batch * C_out * H_out * W_out * C_in * kH * kW)
# Space: O(batch * C_out * H_out * W_out) for output

# For a typical conv layer:
# batch=32, C_in=64, C_out=128, H=W=56, k=3
# FLOPs = 32 * 128 * 56 * 56 * 64 * 3 * 3 ≈ 7.3 * 10^9

Batch Size Effects

AspectSmall Batch (32)Large Batch (4096)Effect on Complexity
Gradient computationO(B * model_params)O(B * model_params)Linear in B
Weight updateOnce per batchOnce per batchIndependent of B
Total epochsMore updates/epochFewer updates/epochSame total training
MemoryLowHighLinear in B
GPU utilizationLow (underutilized)High (saturated)Throughput scales with B (up to a point)
ConvergenceMore noise, may generalize betterLess noise, may need LR warmupNot purely a complexity question

Part 6 - Space Complexity for ML

Memory Estimation

# Model parameters
# Linear layer: W (d_in, d_out) + b (d_out)
# Params = d_in * d_out + d_out

# Transformer block:
# Self-attention: W_Q, W_K, W_V, W_O - each (d, d) = 4 * d^2
# FFN: W1 (d, 4d) + W2 (4d, d) = 8 * d^2
# Total per block: ~12 * d^2

# GPT-3 (175B parameters):
# d = 12288, L = 96 layers, vocab = 50257
# Per layer: 12 * 12288^2 = 1.8 * 10^9
# All layers: 96 * 1.8 * 10^9 = 1.7 * 10^11
# + Embeddings: 50257 * 12288 ≈ 6 * 10^8
# Total: ~175 * 10^9 parameters

# Memory for parameters (float16):
# 175 * 10^9 * 2 bytes = 350 GB
# Memory for Adam optimizer states (float32):
# 175 * 10^9 * (4 + 4 + 4) bytes = 2.1 TB (params + momentum + variance)

Memory During Training

ComponentMemoryDepends On
Model parametersO(P)P = total parameters
GradientsO(P)Same size as parameters
Optimizer state (Adam)O(3P)Momentum + variance + master copy
ActivationsO(B * L * d)Batch * layers * hidden size
Workspace (GEMM, temp)O(B * d^2)Temporary buffers
Interviewer's Perspective

When I ask "how much memory does training this model require?", I am testing three things: (1) Can you enumerate all the memory components (not just parameters)? (2) Do you know that optimizer states often dominate? (3) Can you estimate actual numbers? A candidate who says "175B params * 2 bytes = 350 GB" is incomplete. A candidate who says "350 GB for params + 350 GB for gradients + 700 GB for Adam states + activations = ~2 TB minimum" gets strong marks.

Part 7 - Common Complexity Analysis Patterns

Pattern 1: Nested Loops

# O(n^2)
for i in range(n):
for j in range(n):
# O(1) work

# O(n^2) but NOT n^2 iterations (upper triangle)
for i in range(n):
for j in range(i, n):
# n + (n-1) + ... + 1 = n(n+1)/2 = O(n^2)

# O(n * m) - two different inputs
for i in range(n):
for j in range(m):
# O(1) work

Pattern 2: Logarithmic

# O(log n) - halving the problem
def binary_search(arr, target):
lo, hi = 0, len(arr) - 1
while lo <= hi: # Runs O(log n) times
mid = (lo + hi) // 2
if arr[mid] == target:
return mid
elif arr[mid] < target:
lo = mid + 1
else:
hi = mid - 1
return -1

# O(n log n) - divide and conquer
def merge_sort(arr):
if len(arr) <= 1:
return arr
mid = len(arr) // 2
left = merge_sort(arr[:mid]) # T(n/2)
right = merge_sort(arr[mid:]) # T(n/2)
return merge(left, right) # O(n)
# T(n) = 2T(n/2) + O(n) = O(n log n)

Pattern 3: Matrix Operations

# Matrix multiply: (n, d) @ (d, m) = (n, m)
# Time: O(n * d * m) - three nested loops
# Space: O(n * m) for the result

# Key insight for ML: most operations are matrix multiplies
# Linear layer: X @ W = (batch, d_in) @ (d_in, d_out) = O(batch * d_in * d_out)
# Attention QK^T: (batch, seq, d) @ (batch, d, seq) = O(batch * seq^2 * d)

Pattern 4: Recursion and the Master Theorem

# Recurrence: T(n) = a * T(n/b) + O(n^c)
# a = number of subproblems
# b = size reduction factor
# c = exponent of work per level

# Case 1: c < log_b(a) → T(n) = O(n^(log_b(a)))
# Case 2: c = log_b(a) → T(n) = O(n^c * log n)
# Case 3: c > log_b(a) → T(n) = O(n^c)

# Examples:
# Merge sort: T(n) = 2T(n/2) + O(n) → a=2, b=2, c=1
# log_2(2) = 1 = c → Case 2 → O(n log n)

# Binary search: T(n) = T(n/2) + O(1) → a=1, b=2, c=0
# log_2(1) = 0 = c → Case 2 → O(log n)

# Strassen: T(n) = 7T(n/2) + O(n^2) → a=7, b=2, c=2
# log_2(7) ≈ 2.81 > 2 → Case 1 → O(n^2.81)

Part 8 - Real-World Estimation

Translating Big-O to Wall-Clock Time

# Rule of thumb: modern hardware does ~10^9 simple operations per second

# O(n) with n = 10^6: ~1 ms (fast)
# O(n log n) with n = 10^6: ~20 ms (fast)
# O(n^2) with n = 10^6: ~10^12 ops ÷ 10^9 = ~1000 seconds (too slow)
# O(n^2) with n = 10^4: ~10^8 ops ÷ 10^9 = ~0.1 seconds (okay)

# For matrix operations (BLAS-optimized on GPU):
# GPU: ~10^13 FLOPS (A100)
# Matrix multiply (n, n) @ (n, n):
# n = 1000: 10^9 ops → ~0.1 ms on GPU
# n = 10000: 10^12 ops → ~100 ms on GPU
# n = 100000: 10^15 ops → ~100 seconds on GPU

The "Will It Fit?" Framework

ML Operation Feasibility Check

ScenariondAlgorithmTimeMemoryFeasible?
Pairwise distances10K100O(n^2 d)~1s~800 MBYes
Pairwise distances1M100O(n^2 d)~28 hours~8 TBNo
KNN search1M100Brute force O(nd)~0.1s per query~800 MBMarginal
KNN search1M100HNSW O(d log n)~0.001s per query~1 GBYes
Attention1K seq768O(n^2 d)~0.001s~4 MBYes
Attention128K seq768O(n^2 d)~17s~64 GBNeed FlashAttention

Practice Problems

Problem 1: Complexity of Feature Engineering Pipeline

Analyze the time and space complexity of this feature engineering pipeline:

def create_features(df, user_col, event_col, value_col):
# Step 1: Group by user
grouped = df.groupby(user_col) # O(?)

# Step 2: Aggregate per user
features = grouped[value_col].agg(['mean', 'std', 'max', 'count']) # O(?)

# Step 3: Pairwise interaction features
for col_a in feature_cols:
for col_b in feature_cols:
if col_a != col_b:
df[f'{col_a}_x_{col_b}'] = df[col_a] * df[col_b] # O(?)

# Step 4: Sort and compute lag features
df = df.sort_values([user_col, event_col]) # O(?)
df['lag_1'] = df.groupby(user_col)[value_col].shift(1) # O(?)

return df, features
Hint 1 - Direction

Analyze each step independently, then identify the bottleneck. Pay attention to the nested loop in Step 3 - how many feature pairs are there?

Hint 2 - Variables

Let n = number of rows, u = number of unique users, f = number of feature columns. Each step's complexity depends on different variables.

Hint 3 - Full Solution
StepOperationTimeSpaceNotes
1GroupByO(n)O(n)Hash-based grouping
2AggregateO(n)O(u)Single pass over grouped data
3Pairwise interactionsO(f^2 * n)O(f^2 * n)f^2 new columns, each O(n)
4aSortO(n log n)O(n)Comparison sort
4bLag per groupO(n)O(n)Single pass

Bottleneck: Step 3 with O(f^2 * n). If f = 100 and n = 1M, this creates 10,000 new columns with 10^10 values - 80 GB at float64. The space is the real problem, not the time.

Optimization: Use feature selection to reduce f before creating interactions. Or use sparse representation. Or compute interactions only for the top-k most important features.

Scoring:

  • Strong Hire: Correctly identifies the bottleneck, computes actual memory requirements, proposes optimization.
  • Lean Hire: Correct analysis but does not estimate actual numbers.
  • No Hire: Says "it is O(n)" without analyzing the nested loop.

Problem 2: Training vs Inference Complexity

A transformer model has L=12 layers, d=768, 8 attention heads, and a vocabulary of 50,000. For a sequence length of 512:

(a) What is the inference time complexity for generating one token? (b) What is the inference time for generating a 100-token response? (c) How does the KV cache help?

Hint 1 - Direction

For autoregressive generation, each new token requires attending to all previous tokens. Without KV cache, you recompute attention for the entire sequence. With KV cache, you only compute the new token's query against cached keys/values.

Hint 2 - Key insight

Without KV cache: generating token t requires O(t * d) attention per layer. Generating n tokens requires sum from t=1 to n of O(t * d * L) = O(n^2 * d * L). With KV cache: generating token t requires O(t * d) attention (same), but you skip re-computing K and V for previous tokens, saving O((t-1) * d^2) per layer.

Hint 3 - Full Solution

(a) Single token inference (at position t):

Per layer:

  • Self-attention: Q for new token is O(d^2). K, V if not cached: O(d^2). Attention scores: O(t * d). Weighted sum: O(t * d). Total: O(d^2 + t * d).
  • FFN: O(d * 4d) = O(d^2).
  • Per layer total: O(d^2 + t * d).

All layers: O(L * (d^2 + t * d)).

For t=512, L=12, d=768: ~12 * (768^2 + 512*768) = ~12 * (590K + 393K) = ~12M ops.

(b) 100-token generation:

Without KV cache: for each token t, recompute the entire forward pass for the full sequence up to t. Total: sum from t=1 to 100 of O(L * t^2 * d) = O(L * d * 100^3 / 3) -- this is a loose upper bound. More precisely: O(L * n * d^2 + L * n^2 * d) where n = sequence length at each step.

With KV cache: for each token t, only compute the new Q and attend to cached K/V. Total: sum from t=start to start+100 of O(L * (d^2 + t * d)) = approximately O(100 * L * d^2 + L * d * sum(t)).

(c) KV cache benefit:

Without cache: must recompute K and V projections for ALL previous tokens at each generation step. Cost: O(t * d^2) per layer per token.

With cache: store K and V from previous steps. Only compute K, V for the NEW token. Cost: O(d^2) per layer per token (not O(t * d^2)). The attention computation itself is still O(t * d) - that cannot be cached.

Memory cost of KV cache: O(2 * L * t * d) - storing K and V for each layer for all previous positions. For L=12, t=512, d=768, float16: 2 * 12 * 512 * 768 * 2 bytes = ~18 MB per sequence.

Scoring:

  • Strong Hire: Distinguishes between cached and uncached, gives concrete memory estimates, mentions that attention computation is still O(t*d) even with cache.
  • Lean Hire: Understands the general idea but cannot give precise complexity.
  • No Hire: Does not know what KV cache is.

Problem 3: Scaling Laws Back-of-Envelope

An experiment trains a 1B parameter model on 20B tokens in 24 hours on 8 A100 GPUs. Estimate: (a) how long to train a 10B parameter model on 200B tokens, (b) how many GPUs you need to keep it under 24 hours.

Hint 1 - Direction

Training FLOPs scale approximately as 6 * N * D where N = parameters and D = tokens (the Chinchilla approximation). Wall-clock time = FLOPs / (num_GPUs * GPU_throughput * MFU).

Hint 2 - Compute the ratio

FLOPs_1 = 6 * 1B * 20B = 1.2 * 10^20. FLOPs_2 = 6 * 10B * 200B = 1.2 * 10^22. Ratio = 100x more compute. If hardware is the same, time = 100x * 24 hours on 8 GPUs.

Hint 3 - Full Solution

(a) Time estimate:

FLOPs for 1B model: 6 * 10^9 * 2 * 10^10 = 1.2 * 10^20 FLOPs for 10B model: 6 * 10^10 * 2 * 10^11 = 1.2 * 10^22

Ratio: 100x more compute.

Assuming linear scaling (same GPU throughput): 100 * 24 = 2,400 hours = 100 days on 8 A100s.

(b) GPUs for 24 hours:

Need 100x speedup over the 8-GPU baseline. Assuming perfect linear scaling: 8 * 100 = 800 A100 GPUs.

In practice, communication overhead reduces efficiency. With ~70% scaling efficiency at 800 GPUs: need ~1,150 GPUs. Communication overhead increases with GPU count due to all-reduce operations for gradient synchronization.

Real-world considerations:

  • 10B parameters in float16 = 20 GB. Fits on a single A100 (80 GB).
  • But training requires: params (20 GB) + gradients (20 GB) + Adam states (60 GB) + activations (~40 GB for batch 32) = ~140 GB. Need at least 2 GPUs with model parallelism.
  • At 800+ GPUs, data parallelism is sufficient. Use ZeRO-2 to shard optimizer states.
  • MFU (Model FLOPS Utilization) typically 40-55% for large-scale training.

Scoring:

  • Strong Hire: Correct FLOP estimation, accounts for scaling efficiency, mentions memory requirements and parallelism strategy.
  • Lean Hire: Correct FLOP estimation but assumes perfect scaling.
  • No Hire: Cannot estimate FLOPs for a training run.

Problem 4: Identify the Complexity

What is the time and space complexity of each operation?

#CodeYour Answer: TimeYour Answer: Space
1np.dot(a, b) where a, b are (n,)
2A @ B where A is (m, n), B is (n, p)
3np.linalg.svd(X) where X is (m, n)
4sorted(lst) where lst has n elements
5set(lst) where lst has n elements
6dict_lookup[key]
7np.unique(arr) where arr has n elements
8Attention for (batch, seq, dim)
9df.groupby('col').mean() with n rows
10Training 1 epoch of GD with n samples, d features
Full Answers
#CodeTimeSpace
1Dot product of two vectorsO(n)O(1)
2Matrix multiplicationO(m * n * p)O(m * p)
3SVDO(min(mn^2, m^2 n))O(m * n)
4Sorting (Timsort)O(n log n)O(n)
5Set constructionO(n) amortizedO(n)
6Hash map lookupO(1) amortizedO(1)
7Unique (sort + dedup)O(n log n)O(n)
8Self-attentionO(batch * seq^2 * dim)O(batch * seq^2)
9GroupBy meanO(n)O(groups)
10Gradient descent epochO(n * d)O(d)

Interview Cheat Sheet

ConceptKey InsightOne-LinerRed Flag
Big-OUpper bound on growth rateDrop constants and lower termsSaying "O(2n)" instead of "O(n)"
Big-ThetaTight bound (both upper and lower)The "real" complexityConfusing Big-O with exact count
AmortizedAverage over a sequence of operationsDynamic array append is O(1) amortizedSaying "dynamic array insert is O(n)"
Matrix multiplyO(n * m * p) for (n,m) x (m,p)Foundation of all ML computeNot knowing this is the bottleneck
AttentionO(n^2 * d) time, O(n^2) spaceQuadratic in sequence lengthSaying "attention is O(n)"
Training complexity6 * N * D FLOPs (Chinchilla)Parameters x tokens x 6Not accounting for optimizer memory
InferenceO(L * d^2) per token with KV cacheCache K/V, compute only new QNot knowing KV cache
Space vs timeSometimes trade one for the otherDP uses space to save time, streaming saves spaceAnalyzing only time, forgetting space
Back-of-envelope10^9 ops/sec CPU, 10^13 FLOPS GPUConvert Big-O to wall clockCannot estimate actual running time
When O(n^2) breaksn > 10^5 on CPU, n > 10^6 on GPUNeed approximate/sublinear methodsProposing O(n^2) for 10M data points

Spaced Repetition Checkpoints

Day 0 - Initial Learning

  • Read this entire page
  • Write the formal definition of Big-O from memory
  • Fill in the "Identify the Complexity" table without looking at answers
  • Complete the self-assessment

Day 3 - First Recall

  • State the time complexity of: hash table ops, binary search, merge sort, matrix multiply
  • Explain why attention is O(n^2 * d) and what n and d represent
  • Estimate the memory required to train a 7B parameter model

Day 7 - Connections

  • Solve Problem 1 (feature pipeline complexity) without hints
  • Explain training vs inference complexity for a transformer
  • Convert O(n^2) with n=100K to wall-clock time on CPU and GPU

Day 14 - Application

  • Solve Problem 3 (scaling laws) under timed conditions (10 min)
  • Explain KV cache and its effect on autoregressive generation complexity
  • Given a system design problem, identify the computational bottleneck

Day 21 - Mock Interview

  • Have someone give you code - analyze its complexity in real time
  • Estimate training time for a given model/dataset/hardware configuration
  • Propose scalable alternatives for O(n^2) operations on large datasets

Key Takeaways

  1. Complexity analysis in ML is about feasibility, not theory. The question is never "what is the Big-O?" in isolation. It is always "is this feasible at the scale we need, and if not, what is the alternative?" Always connect complexity to real numbers.

  2. Attention's O(n^2) is the defining constraint of modern AI. Understanding this quadratic bottleneck - and the zoo of methods to address it (Flash Attention, sparse attention, linear attention) - is essential for any transformer-related role.

  3. Training is dominated by memory, inference by latency. During training, optimizer states often consume 4-12x the model parameter memory. During inference, the KV cache and autoregressive generation dominate. Know both dimensions.

  4. Back-of-envelope estimation is a senior skill. Being able to say "this will take ~2 hours on 8 A100s" from first principles (FLOPs, throughput, MFU) is what separates senior engineers from junior ones. Practice this.

Next Steps

Continue to Mock Coding Problems for 15-20 full interview problems that combine everything from this section - DSA, NumPy, Pandas, SQL, ML implementation, and complexity analysis.

© 2026 EngineersOfAI. All rights reserved.