Complexity Analysis - Computational Thinking for ML Engineers

Reading time: ~40 min | Interview relevance: High | Roles: MLE, AI Eng, Research Engineer, Applied Scientist

The Real Interview Moment

You are in a Google L5 MLE interview. You have just implemented a function that computes pairwise cosine similarities between all embeddings in a dataset. The interviewer asks: "What is the time complexity? What about space? Now, what if you have 10 million embeddings with 768 dimensions - how long will this take, and what would you do instead?"

You wrote similarities = embeddings @ embeddings.T / (norms[:, None] * norms[None, :]). You know the matrix multiplication is O(n^2 * d), and the result is an n x n matrix using O(n^2) memory. For n = 10M and d = 768, that is 10^14 operations and 800 TB of memory for the similarity matrix. You immediately propose approximate nearest neighbor search - FAISS with IVF or HNSW - which gives approximate results in O(n * log(n)) time and O(n * d) memory.

This is what complexity analysis in ML interviews looks like. It is not about reciting Big-O definitions - it is about connecting algorithmic complexity to real-world resource constraints (time, memory, cost) and knowing when an exact algorithm is infeasible and what approximation to use instead.

Candidates who say "it is O(n^2)" and stop get a "lean hire." Candidates who compute the actual wall-clock time, identify the bottleneck, and propose a scalable alternative get a "strong hire."

What You Will Master

Define Big-O, Big-Omega, and Big-Theta with mathematical precision
Analyze time and space complexity of common data structures and algorithms
Apply amortized analysis to dynamic arrays and hash tables
Compute complexity of ML-specific operations: training, inference, attention, convolution
Estimate real-world resource requirements from complexity analysis
Identify common interview traps in complexity questions
Propose scalable alternatives when exact algorithms are infeasible

Self-Assessment: Where Are You Now?

Skill	1 - Cannot	2 - Vaguely	3 - Can Analyze	4 - Can Estimate	5 - Can Teach	Your Score
Define Big-O formally						___
Analyze loops and recursion						___
Know complexity of common data structures						___
Amortized analysis						___
ML training complexity						___
ML inference complexity						___
Translate O(.) to wall-clock estimates						___
Propose scalable alternatives						___

Target: All 4s and 5s before your interview.

Part 1 - Formal Definitions

Big-O, Big-Omega, Big-Theta

Asymptotic Notation - Big-O, Omega, Theta

Formal definitions:

O(g(n)): f(n) = O(g(n)) if there exist constants c > 0 and n_0 such that f(n) <= c * g(n) for all n >= n_0
Omega(g(n)): f(n) = Omega(g(n)) if there exist constants c > 0 and n_0 such that f(n) >= c * g(n) for all n >= n_0
Theta(g(n)): f(n) = Theta(g(n)) if f(n) = O(g(n)) AND f(n) = Omega(g(n))

60-Second Answer

"Big-O describes the upper bound on growth rate - how an algorithm scales in the worst case as input size grows. O(n) means linear scaling, O(n^2) means quadratic. But Big-O alone is incomplete: it tells you the ceiling, not the floor. Big-Omega is the lower bound, and Big-Theta is the tight bound. In interviews, we usually say Big-O but mean Big-Theta - we describe the typical growth rate, not just the worst case. The distinction matters for algorithms like quicksort, which is O(n^2) worst case but Theta(n log n) average case."

Common Growth Rates

Notation	Name	Example	n = 10^6 Operations
O(1)	Constant	Hash table lookup	1
O(log n)	Logarithmic	Binary search	20
O(n)	Linear	Array scan	10^6
O(n log n)	Linearithmic	Merge sort	2 * 10^7
O(n^2)	Quadratic	Nested loops, attention	10^12
O(n^3)	Cubic	Matrix multiplication (naive)	10^18
O(2^n)	Exponential	All subsets	Infeasible
O(n!)	Factorial	All permutations	Infeasible

Rules for Combining

# Rule 1: Drop constants
# O(3n + 5) = O(n)

# Rule 2: Drop lower-order terms
# O(n^2 + n + 1) = O(n^2)

# Rule 3: Sequential operations add
# O(n) + O(m) = O(n + m)

# Rule 4: Nested operations multiply
# for i in range(n):        # O(n)
#     for j in range(m):    # O(m)
#         ...               # O(1)
# Total: O(n * m)

# Rule 5: Different inputs = different variables
# Do NOT simplify O(n + m) to O(n) unless you know m <= n

Common Trap

Do NOT say "O(n + m) = O(2n) = O(n)" unless m is bounded by n. If n is the number of users and m is the number of products, they are independent variables. Interviewers specifically test whether you distinguish between different input dimensions. In ML: n = samples, d = features, k = clusters, c = classes - all independent.

Part 2 - Data Structure Complexity

Essential Data Structures

Data Structure	Access	Search	Insert	Delete	Space	ML Use Case
Array	O(1)	O(n)	O(n)	O(n)	O(n)	Feature vectors, predictions
Dynamic Array	O(1)	O(n)	O(1)*	O(n)	O(n)	Building datasets
Linked List	O(n)	O(n)	O(1)	O(1)	O(n)	Rarely used in ML
Hash Map	O(1)*	O(1)*	O(1)*	O(1)*	O(n)	Vocabulary, feature maps
Binary Search Tree	O(log n)	O(log n)	O(log n)	O(log n)	O(n)	Sorted access patterns
Heap (Min/Max)	O(1) top	O(n)	O(log n)	O(log n)	O(n)	Top-k selection, priority queues
Trie	O(L)	O(L)	O(L)	O(L)	O(N*L)	Tokenization, prefix matching

*Amortized or average case

Amortized Analysis

# Dynamic array (Python list): append is O(1) amortized
# How? When the array is full, it doubles in capacity.
# Doubling copies all n elements: O(n).
# But this happens only once every n insertions.
# Cost for n insertions: n * O(1) + O(1+2+4+...+n) = n + 2n = O(n)
# Amortized cost per insertion: O(n) / n = O(1)

# Hash table: lookup is O(1) amortized (with good hash function)
# Worst case: O(n) if all keys hash to same bucket
# Average case with load factor < 0.75: O(1)
# Resizing: same amortized argument as dynamic arrays

Part 3 - Sorting and Searching Complexity

Algorithm	Best	Average	Worst	Space	Stable?	ML Relevance
Quicksort	O(n log n)	O(n log n)	O(n^2)	O(log n)	No	numpy.sort default
Mergesort	O(n log n)	O(n log n)	O(n log n)	O(n)	Yes	Stable sort for ties
Heapsort	O(n log n)	O(n log n)	O(n log n)	O(1)	No	In-place sorting
Counting Sort	O(n+k)	O(n+k)	O(n+k)	O(k)	Yes	Label histograms
Binary Search	O(1)	O(log n)	O(log n)	O(1)	N/A	Threshold search
Argpartition	O(n)	O(n)	O(n^2)	O(1)	No	Top-k selection (NumPy)

Company Variation

Google tests sorting/searching complexity directly. They may ask: "Why is np.argpartition O(n) for top-k while np.argsort is O(n log n)?" (Answer: argpartition uses Introselect, a variant of quickselect, which only partially sorts.) Meta and Amazon test this indirectly through system design: "Your recommendation system needs to return top 100 items from 10M candidates - what is the complexity?" (Answer: O(n) with argpartition vs O(n log n) with full sort.)

Part 4 - ML Algorithm Complexity

Training Complexity

Algorithm	Time (Training)	Space (Training)	Key Variables
Linear Regression (OLS)	O(nd^2 + d^3)	O(d^2)	n=samples, d=features
Linear Regression (GD)	O(ndt)	O(d)	t=iterations
Logistic Regression (GD)	O(ndt)	O(d)	Same as linear GD
Decision Tree	O(n * d * depth)	O(nodes)	Greedy split search
Random Forest	O(T * n * sqrt(d) * depth)	O(T * nodes)	T=trees
Gradient Boosting	O(T * n * d * depth)	O(T * nodes)	Sequential trees
K-NN	O(1) or O(n log n)	O(nd)	Lazy learner or KD-tree
K-Means	O(nkdt)	O(nd + kd)	k=clusters, t=iterations
SVM (kernel)	O(n^2 * d) to O(n^3)	O(n^2)	Kernel matrix
PCA	O(nd^2 + d^3) or O(n^2 d + n^3)	O(d^2) or O(n^2)	Depends on n vs d
Neural Network	O(B * L * d^2) per epoch	O(L * d^2 + B * d)	B=batch, L=layers, d=width

Inference Complexity

Algorithm	Time (Inference, single sample)	Space	Notes
Linear/Logistic Regression	O(d)	O(d)	Just a dot product
Decision Tree	O(depth)	O(nodes)	Follow one path
Random Forest	O(T * depth)	O(T * nodes)	T independent trees
K-NN	O(nd)	O(nd)	Must store all training data
Neural Network	O(L * d^2)	O(L * d^2)	Matrix multiplications
Transformer (self-attention)	O(n^2 * d)	O(n^2 + nd)	n=sequence length

ML Model Scaling - Complexity Analysis

Part 5 - Deep Learning Complexity

Transformer Self-Attention: The O(n^2) Problem

# Standard self-attention
# Q, K, V each have shape (batch, seq_len, d_model)

# Step 1: QK^T - the bottleneck
# (batch, seq_len, d_model) @ (batch, d_model, seq_len)
# = (batch, seq_len, seq_len)
# Time: O(n^2 * d)
# Space: O(n^2) for the attention matrix
scores = Q @ K.transpose(0, 2, 1)  # O(n^2 * d)

# Step 2: Softmax - along last axis
# Time: O(n^2)
weights = softmax(scores / sqrt(d))  # O(n^2)

# Step 3: Weighted sum with V
# (batch, seq_len, seq_len) @ (batch, seq_len, d_model)
# Time: O(n^2 * d)
output = weights @ V  # O(n^2 * d)

# Total: O(n^2 * d) time, O(n^2) space
# For GPT-4 with n=128K and d=12288:
# n^2 = 1.6 * 10^10 - that is 16 billion attention entries per layer

Approaches to O(n^2) Attention

Method	Time	Space	Tradeoff
Standard attention	O(n^2 d)	O(n^2)	Exact but quadratic
Flash Attention	O(n^2 d)	O(n)	Same time, O(n) memory via tiling
Sparse attention	O(n * sqrt(n) * d)	O(n * sqrt(n))	Attends to fixed patterns
Linear attention	O(n * d^2)	O(d^2)	Approximate, kernel trick
Sliding window	O(n * w * d)	O(n * w)	Local attention only
Ring attention	O(n^2 d / P)	O(n^2 / P) per device	Distributed across P devices

Instant Rejection

Never say "attention is O(n^2)" and leave it at that. You must specify what n is (sequence length), what the full complexity is (O(n^2 * d)), and whether you are talking about time or space. Then follow up with what this means practically: "For a 128K context window with d=12288, the attention matrix alone is 128K^2 * 4 bytes = 64 GB per layer, which is why we need Flash Attention or sparse methods." This demonstrates systems awareness.

Convolution Complexity

# 2D Convolution: input (batch, C_in, H, W), kernel (C_out, C_in, kH, kW)
# Output: (batch, C_out, H_out, W_out)
# where H_out = (H - kH + 2*pad) / stride + 1

# Time: O(batch * C_out * H_out * W_out * C_in * kH * kW)
# Space: O(batch * C_out * H_out * W_out) for output

# For a typical conv layer:
# batch=32, C_in=64, C_out=128, H=W=56, k=3
# FLOPs = 32 * 128 * 56 * 56 * 64 * 3 * 3 ≈ 7.3 * 10^9

Batch Size Effects

Aspect	Small Batch (32)	Large Batch (4096)	Effect on Complexity
Gradient computation	O(B * model_params)	O(B * model_params)	Linear in B
Weight update	Once per batch	Once per batch	Independent of B
Total epochs	More updates/epoch	Fewer updates/epoch	Same total training
Memory	Low	High	Linear in B
GPU utilization	Low (underutilized)	High (saturated)	Throughput scales with B (up to a point)
Convergence	More noise, may generalize better	Less noise, may need LR warmup	Not purely a complexity question

Part 6 - Space Complexity for ML

Memory Estimation

# Model parameters
# Linear layer: W (d_in, d_out) + b (d_out)
# Params = d_in * d_out + d_out

# Transformer block:
# Self-attention: W_Q, W_K, W_V, W_O - each (d, d) = 4 * d^2
# FFN: W1 (d, 4d) + W2 (4d, d) = 8 * d^2
# Total per block: ~12 * d^2

# GPT-3 (175B parameters):
# d = 12288, L = 96 layers, vocab = 50257
# Per layer: 12 * 12288^2 = 1.8 * 10^9
# All layers: 96 * 1.8 * 10^9 = 1.7 * 10^11
# + Embeddings: 50257 * 12288 ≈ 6 * 10^8
# Total: ~175 * 10^9 parameters

# Memory for parameters (float16):
# 175 * 10^9 * 2 bytes = 350 GB
# Memory for Adam optimizer states (float32):
# 175 * 10^9 * (4 + 4 + 4) bytes = 2.1 TB (params + momentum + variance)

Memory During Training

Component	Memory	Depends On
Model parameters	O(P)	P = total parameters
Gradients	O(P)	Same size as parameters
Optimizer state (Adam)	O(3P)	Momentum + variance + master copy
Activations	O(B * L * d)	Batch * layers * hidden size
Workspace (GEMM, temp)	O(B * d^2)	Temporary buffers

Interviewer's Perspective

When I ask "how much memory does training this model require?", I am testing three things: (1) Can you enumerate all the memory components (not just parameters)? (2) Do you know that optimizer states often dominate? (3) Can you estimate actual numbers? A candidate who says "175B params * 2 bytes = 350 GB" is incomplete. A candidate who says "350 GB for params + 350 GB for gradients + 700 GB for Adam states + activations = ~2 TB minimum" gets strong marks.

Part 7 - Common Complexity Analysis Patterns

Pattern 1: Nested Loops

# O(n^2)
for i in range(n):
    for j in range(n):
        # O(1) work

# O(n^2) but NOT n^2 iterations (upper triangle)
for i in range(n):
    for j in range(i, n):
        # n + (n-1) + ... + 1 = n(n+1)/2 = O(n^2)

# O(n * m) - two different inputs
for i in range(n):
    for j in range(m):
        # O(1) work

Pattern 2: Logarithmic

# O(log n) - halving the problem
def binary_search(arr, target):
    lo, hi = 0, len(arr) - 1
    while lo <= hi:        # Runs O(log n) times
        mid = (lo + hi) // 2
        if arr[mid] == target:
            return mid
        elif arr[mid] < target:
            lo = mid + 1
        else:
            hi = mid - 1
    return -1

# O(n log n) - divide and conquer
def merge_sort(arr):
    if len(arr) <= 1:
        return arr
    mid = len(arr) // 2
    left = merge_sort(arr[:mid])     # T(n/2)
    right = merge_sort(arr[mid:])    # T(n/2)
    return merge(left, right)         # O(n)
    # T(n) = 2T(n/2) + O(n) = O(n log n)

Pattern 3: Matrix Operations

# Matrix multiply: (n, d) @ (d, m) = (n, m)
# Time: O(n * d * m) - three nested loops
# Space: O(n * m) for the result

# Key insight for ML: most operations are matrix multiplies
# Linear layer: X @ W = (batch, d_in) @ (d_in, d_out) = O(batch * d_in * d_out)
# Attention QK^T: (batch, seq, d) @ (batch, d, seq) = O(batch * seq^2 * d)

Pattern 4: Recursion and the Master Theorem

# Recurrence: T(n) = a * T(n/b) + O(n^c)
# a = number of subproblems
# b = size reduction factor
# c = exponent of work per level

# Case 1: c < log_b(a) → T(n) = O(n^(log_b(a)))
# Case 2: c = log_b(a) → T(n) = O(n^c * log n)
# Case 3: c > log_b(a) → T(n) = O(n^c)

# Examples:
# Merge sort: T(n) = 2T(n/2) + O(n) → a=2, b=2, c=1
#   log_2(2) = 1 = c → Case 2 → O(n log n)

# Binary search: T(n) = T(n/2) + O(1) → a=1, b=2, c=0
#   log_2(1) = 0 = c → Case 2 → O(log n)

# Strassen: T(n) = 7T(n/2) + O(n^2) → a=7, b=2, c=2
#   log_2(7) ≈ 2.81 > 2 → Case 1 → O(n^2.81)

Part 8 - Real-World Estimation

Translating Big-O to Wall-Clock Time

# Rule of thumb: modern hardware does ~10^9 simple operations per second

# O(n) with n = 10^6: ~1 ms (fast)
# O(n log n) with n = 10^6: ~20 ms (fast)
# O(n^2) with n = 10^6: ~10^12 ops ÷ 10^9 = ~1000 seconds (too slow)
# O(n^2) with n = 10^4: ~10^8 ops ÷ 10^9 = ~0.1 seconds (okay)

# For matrix operations (BLAS-optimized on GPU):
# GPU: ~10^13 FLOPS (A100)
# Matrix multiply (n, n) @ (n, n):
#   n = 1000: 10^9 ops → ~0.1 ms on GPU
#   n = 10000: 10^12 ops → ~100 ms on GPU
#   n = 100000: 10^15 ops → ~100 seconds on GPU

The "Will It Fit?" Framework

ML Operation Feasibility Check

Scenario	n	d	Algorithm	Time	Memory	Feasible?
Pairwise distances	10K	100	O(n^2 d)	~1s	~800 MB	Yes
Pairwise distances	1M	100	O(n^2 d)	~28 hours	~8 TB	No
KNN search	1M	100	Brute force O(nd)	~0.1s per query	~800 MB	Marginal
KNN search	1M	100	HNSW O(d log n)	~0.001s per query	~1 GB	Yes
Attention	1K seq	768	O(n^2 d)	~0.001s	~4 MB	Yes
Attention	128K seq	768	O(n^2 d)	~17s	~64 GB	Need FlashAttention

Practice Problems

Problem 1: Complexity of Feature Engineering Pipeline

Analyze the time and space complexity of this feature engineering pipeline:

def create_features(df, user_col, event_col, value_col):
    # Step 1: Group by user
    grouped = df.groupby(user_col)  # O(?)

    # Step 2: Aggregate per user
    features = grouped[value_col].agg(['mean', 'std', 'max', 'count'])  # O(?)

    # Step 3: Pairwise interaction features
    for col_a in feature_cols:
        for col_b in feature_cols:
            if col_a != col_b:
                df[f'{col_a}_x_{col_b}'] = df[col_a] * df[col_b]  # O(?)

    # Step 4: Sort and compute lag features
    df = df.sort_values([user_col, event_col])  # O(?)
    df['lag_1'] = df.groupby(user_col)[value_col].shift(1)  # O(?)

    return df, features

Hint 1 - Direction

Analyze each step independently, then identify the bottleneck. Pay attention to the nested loop in Step 3 - how many feature pairs are there?

Hint 2 - Variables

Let n = number of rows, u = number of unique users, f = number of feature columns. Each step's complexity depends on different variables.

Hint 3 - Full Solution

Step	Operation	Time	Space	Notes
1	GroupBy	O(n)	O(n)	Hash-based grouping
2	Aggregate	O(n)	O(u)	Single pass over grouped data
3	Pairwise interactions	O(f^2 * n)	O(f^2 * n)	f^2 new columns, each O(n)
4a	Sort	O(n log n)	O(n)	Comparison sort
4b	Lag per group	O(n)	O(n)	Single pass

Bottleneck: Step 3 with O(f^2 * n). If f = 100 and n = 1M, this creates 10,000 new columns with 10^10 values - 80 GB at float64. The space is the real problem, not the time.

Optimization: Use feature selection to reduce f before creating interactions. Or use sparse representation. Or compute interactions only for the top-k most important features.

Scoring:

Strong Hire: Correctly identifies the bottleneck, computes actual memory requirements, proposes optimization.
Lean Hire: Correct analysis but does not estimate actual numbers.
No Hire: Says "it is O(n)" without analyzing the nested loop.

Problem 2: Training vs Inference Complexity

A transformer model has L=12 layers, d=768, 8 attention heads, and a vocabulary of 50,000. For a sequence length of 512:

(a) What is the inference time complexity for generating one token? (b) What is the inference time for generating a 100-token response? (c) How does the KV cache help?

Hint 1 - Direction

For autoregressive generation, each new token requires attending to all previous tokens. Without KV cache, you recompute attention for the entire sequence. With KV cache, you only compute the new token's query against cached keys/values.

Hint 2 - Key insight

Without KV cache: generating token t requires O(t * d) attention per layer. Generating n tokens requires sum from t=1 to n of O(t * d * L) = O(n^2 * d * L). With KV cache: generating token t requires O(t * d) attention (same), but you skip re-computing K and V for previous tokens, saving O((t-1) * d^2) per layer.

Hint 3 - Full Solution

(a) Single token inference (at position t):

Per layer:

Self-attention: Q for new token is O(d^2). K, V if not cached: O(d^2). Attention scores: O(t * d). Weighted sum: O(t * d). Total: O(d^2 + t * d).
FFN: O(d * 4d) = O(d^2).
Per layer total: O(d^2 + t * d).

All layers: O(L * (d^2 + t * d)).

For t=512, L=12, d=768: ~12 * (768^2 + 512*768) = ~12 * (590K + 393K) = ~12M ops.

(b) 100-token generation:

Without KV cache: for each token t, recompute the entire forward pass for the full sequence up to t. Total: sum from t=1 to 100 of O(L * t^2 * d) = O(L * d * 100^3 / 3) -- this is a loose upper bound. More precisely: O(L * n * d^2 + L * n^2 * d) where n = sequence length at each step.

With KV cache: for each token t, only compute the new Q and attend to cached K/V. Total: sum from t=start to start+100 of O(L * (d^2 + t * d)) = approximately O(100 * L * d^2 + L * d * sum(t)).

(c) KV cache benefit:

Without cache: must recompute K and V projections for ALL previous tokens at each generation step. Cost: O(t * d^2) per layer per token.

With cache: store K and V from previous steps. Only compute K, V for the NEW token. Cost: O(d^2) per layer per token (not O(t * d^2)). The attention computation itself is still O(t * d) - that cannot be cached.

Memory cost of KV cache: O(2 * L * t * d) - storing K and V for each layer for all previous positions. For L=12, t=512, d=768, float16: 2 * 12 * 512 * 768 * 2 bytes = ~18 MB per sequence.

Scoring:

Strong Hire: Distinguishes between cached and uncached, gives concrete memory estimates, mentions that attention computation is still O(t*d) even with cache.
Lean Hire: Understands the general idea but cannot give precise complexity.
No Hire: Does not know what KV cache is.

Problem 3: Scaling Laws Back-of-Envelope

An experiment trains a 1B parameter model on 20B tokens in 24 hours on 8 A100 GPUs. Estimate: (a) how long to train a 10B parameter model on 200B tokens, (b) how many GPUs you need to keep it under 24 hours.

Hint 1 - Direction

Training FLOPs scale approximately as 6 * N * D where N = parameters and D = tokens (the Chinchilla approximation). Wall-clock time = FLOPs / (num_GPUs * GPU_throughput * MFU).

Hint 2 - Compute the ratio

FLOPs_1 = 6 * 1B * 20B = 1.2 * 10^20. FLOPs_2 = 6 * 10B * 200B = 1.2 * 10^22. Ratio = 100x more compute. If hardware is the same, time = 100x * 24 hours on 8 GPUs.

Hint 3 - Full Solution

(a) Time estimate:

FLOPs for 1B model: 6 * 10^9 * 2 * 10^10 = 1.2 * 10^20 FLOPs for 10B model: 6 * 10^10 * 2 * 10^11 = 1.2 * 10^22

Ratio: 100x more compute.

Assuming linear scaling (same GPU throughput): 100 * 24 = 2,400 hours = 100 days on 8 A100s.

(b) GPUs for 24 hours:

Need 100x speedup over the 8-GPU baseline. Assuming perfect linear scaling: 8 * 100 = 800 A100 GPUs.

In practice, communication overhead reduces efficiency. With ~70% scaling efficiency at 800 GPUs: need ~1,150 GPUs. Communication overhead increases with GPU count due to all-reduce operations for gradient synchronization.

Real-world considerations:

10B parameters in float16 = 20 GB. Fits on a single A100 (80 GB).
But training requires: params (20 GB) + gradients (20 GB) + Adam states (60 GB) + activations (~40 GB for batch 32) = ~140 GB. Need at least 2 GPUs with model parallelism.
At 800+ GPUs, data parallelism is sufficient. Use ZeRO-2 to shard optimizer states.
MFU (Model FLOPS Utilization) typically 40-55% for large-scale training.

Scoring:

Strong Hire: Correct FLOP estimation, accounts for scaling efficiency, mentions memory requirements and parallelism strategy.
Lean Hire: Correct FLOP estimation but assumes perfect scaling.
No Hire: Cannot estimate FLOPs for a training run.

Problem 4: Identify the Complexity

What is the time and space complexity of each operation?

#	Code	Your Answer: Time	Your Answer: Space
1	`np.dot(a, b)` where a, b are (n,)
2	`A @ B` where A is (m, n), B is (n, p)
3	`np.linalg.svd(X)` where X is (m, n)
4	`sorted(lst)` where lst has n elements
5	`set(lst)` where lst has n elements
6	`dict_lookup[key]`
7	`np.unique(arr)` where arr has n elements
8	Attention for (batch, seq, dim)
9	`df.groupby('col').mean()` with n rows
10	Training 1 epoch of GD with n samples, d features

Full Answers

#	Code	Time	Space
1	Dot product of two vectors	O(n)	O(1)
2	Matrix multiplication	O(m * n * p)	O(m * p)
3	SVD	O(min(mn^2, m^2 n))	O(m * n)
4	Sorting (Timsort)	O(n log n)	O(n)
5	Set construction	O(n) amortized	O(n)
6	Hash map lookup	O(1) amortized	O(1)
7	Unique (sort + dedup)	O(n log n)	O(n)
8	Self-attention	O(batch * seq^2 * dim)	O(batch * seq^2)
9	GroupBy mean	O(n)	O(groups)
10	Gradient descent epoch	O(n * d)	O(d)

Interview Cheat Sheet

Concept	Key Insight	One-Liner	Red Flag
Big-O	Upper bound on growth rate	Drop constants and lower terms	Saying "O(2n)" instead of "O(n)"
Big-Theta	Tight bound (both upper and lower)	The "real" complexity	Confusing Big-O with exact count
Amortized	Average over a sequence of operations	Dynamic array append is O(1) amortized	Saying "dynamic array insert is O(n)"
Matrix multiply	O(n * m * p) for (n,m) x (m,p)	Foundation of all ML compute	Not knowing this is the bottleneck
Attention	O(n^2 * d) time, O(n^2) space	Quadratic in sequence length	Saying "attention is O(n)"
Training complexity	6 * N * D FLOPs (Chinchilla)	Parameters x tokens x 6	Not accounting for optimizer memory
Inference	O(L * d^2) per token with KV cache	Cache K/V, compute only new Q	Not knowing KV cache
Space vs time	Sometimes trade one for the other	DP uses space to save time, streaming saves space	Analyzing only time, forgetting space
Back-of-envelope	10^9 ops/sec CPU, 10^13 FLOPS GPU	Convert Big-O to wall clock	Cannot estimate actual running time
When O(n^2) breaks	n > 10^5 on CPU, n > 10^6 on GPU	Need approximate/sublinear methods	Proposing O(n^2) for 10M data points

Spaced Repetition Checkpoints

Day 0 - Initial Learning

Read this entire page
Write the formal definition of Big-O from memory
Fill in the "Identify the Complexity" table without looking at answers
Complete the self-assessment

Day 3 - First Recall

State the time complexity of: hash table ops, binary search, merge sort, matrix multiply
Explain why attention is O(n^2 * d) and what n and d represent
Estimate the memory required to train a 7B parameter model

Day 7 - Connections

Solve Problem 1 (feature pipeline complexity) without hints
Explain training vs inference complexity for a transformer
Convert O(n^2) with n=100K to wall-clock time on CPU and GPU

Day 14 - Application

Solve Problem 3 (scaling laws) under timed conditions (10 min)
Explain KV cache and its effect on autoregressive generation complexity
Given a system design problem, identify the computational bottleneck

Day 21 - Mock Interview

Have someone give you code - analyze its complexity in real time
Estimate training time for a given model/dataset/hardware configuration
Propose scalable alternatives for O(n^2) operations on large datasets

Key Takeaways

Complexity analysis in ML is about feasibility, not theory. The question is never "what is the Big-O?" in isolation. It is always "is this feasible at the scale we need, and if not, what is the alternative?" Always connect complexity to real numbers.
Attention's O(n^2) is the defining constraint of modern AI. Understanding this quadratic bottleneck - and the zoo of methods to address it (Flash Attention, sparse attention, linear attention) - is essential for any transformer-related role.
Training is dominated by memory, inference by latency. During training, optimizer states often consume 4-12x the model parameter memory. During inference, the KV cache and autoregressive generation dominate. Know both dimensions.
Back-of-envelope estimation is a senior skill. Being able to say "this will take ~2 hours on 8 A100s" from first principles (FLOPs, throughput, MFU) is what separates senior engineers from junior ones. Practice this.

Next Steps

Continue to Mock Coding Problems for 15-20 full interview problems that combine everything from this section - DSA, NumPy, Pandas, SQL, ML implementation, and complexity analysis.

The Real Interview Moment​

What You Will Master​

Self-Assessment: Where Are You Now?​

Part 1 - Formal Definitions​

Big-O, Big-Omega, Big-Theta​

Common Growth Rates​

Rules for Combining​

Part 2 - Data Structure Complexity​

Essential Data Structures​

Amortized Analysis​

Part 3 - Sorting and Searching Complexity​

Part 4 - ML Algorithm Complexity​

Training Complexity​

Inference Complexity​

Part 5 - Deep Learning Complexity​

Transformer Self-Attention: The O(n^2) Problem​

Approaches to O(n^2) Attention​

Convolution Complexity​

Batch Size Effects​

Part 6 - Space Complexity for ML​

Memory Estimation​

Memory During Training​

Part 7 - Common Complexity Analysis Patterns​

Pattern 1: Nested Loops​

Pattern 2: Logarithmic​

Pattern 3: Matrix Operations​

Pattern 4: Recursion and the Master Theorem​

Part 8 - Real-World Estimation​

Translating Big-O to Wall-Clock Time​

The "Will It Fit?" Framework​

Practice Problems​

Problem 1: Complexity of Feature Engineering Pipeline​

Problem 2: Training vs Inference Complexity​

Problem 3: Scaling Laws Back-of-Envelope​

Problem 4: Identify the Complexity​

Interview Cheat Sheet​

Spaced Repetition Checkpoints​

Day 0 - Initial Learning​

Day 3 - First Recall​

Day 7 - Connections​

Day 14 - Application​

Day 21 - Mock Interview​

Key Takeaways​

Next Steps​

The Real Interview Moment

What You Will Master

Self-Assessment: Where Are You Now?

Part 1 - Formal Definitions

Big-O, Big-Omega, Big-Theta

Common Growth Rates

Rules for Combining

Part 2 - Data Structure Complexity

Essential Data Structures

Amortized Analysis

Part 3 - Sorting and Searching Complexity

Part 4 - ML Algorithm Complexity

Training Complexity

Inference Complexity

Part 5 - Deep Learning Complexity

Transformer Self-Attention: The O(n^2) Problem

Approaches to O(n^2) Attention

Convolution Complexity

Batch Size Effects

Part 6 - Space Complexity for ML

Memory Estimation

Memory During Training

Part 7 - Common Complexity Analysis Patterns

Pattern 1: Nested Loops

Pattern 2: Logarithmic

Pattern 3: Matrix Operations

Pattern 4: Recursion and the Master Theorem

Part 8 - Real-World Estimation

Translating Big-O to Wall-Clock Time

The "Will It Fit?" Framework

Practice Problems

Problem 1: Complexity of Feature Engineering Pipeline

Problem 2: Training vs Inference Complexity

Problem 3: Scaling Laws Back-of-Envelope

Problem 4: Identify the Complexity

Interview Cheat Sheet

Spaced Repetition Checkpoints

Day 0 - Initial Learning

Day 3 - First Recall

Day 7 - Connections

Day 14 - Application

Day 21 - Mock Interview

Key Takeaways

Next Steps