NumPy for ML
The pull request sits in the review queue. The title says "optimize neural network forward pass." You open the diff expecting to see architectural changes - maybe a smarter batching strategy, maybe fused layers. Instead, it is a 47-line change that touches nothing but the array operations. Pure Python loops replaced with NumPy expressions. The benchmark comment at the bottom of the PR reads: "0.3 seconds per batch → 2 milliseconds per batch. 150x speedup. No algorithm change."
You stare at that number. One hundred and fifty times faster. The same mathematical computation. The same hardware. The only difference is how the numbers move through memory and which code executes the arithmetic. This is not a corner case - this is NumPy's typical behavior when you go from interpreted Python loops to vectorized C operations backed by SIMD instructions.
The engineer who wrote the original loop was not careless. They were thinking in Python - iterate over rows, compute each element, accumulate the result. That mental model produces correct code. It produces slow code because Python's interpreter adds roughly 50–100 nanoseconds of overhead per operation: bytecode dispatch, reference counting, dynamic type lookup. For one million float additions, that overhead adds up to 50–100 milliseconds of pure interpreter tax before a single floating-point unit does any real work.
NumPy bypasses all of that. It stores arrays in contiguous blocks of typed memory - a C array beneath a Python interface. When you call X @ W, NumPy drops into a C function that calls into an optimized BLAS library (OpenBLAS, MKL, or BLIS depending on your installation). That library issues AVX-512 SIMD instructions that process 16 float32 values per CPU clock cycle. The Python interpreter is not involved at all during the computation. It just waits.
This lesson covers NumPy the way ML engineers actually use it: not as a convenience wrapper around lists, but as a performance primitive that forms the computational foundation for every ML framework. You will learn broadcasting, memory layout, advanced indexing, einsum, SVD, and profiling - everything you need to write fast, correct array code.
What NumPy Actually Is
NumPy stores arrays in contiguous blocks of typed memory - a C array underneath a Python interface. Operations on NumPy arrays are implemented in C or Fortran, skip Python's interpreter overhead entirely, and often use SIMD instructions (AVX, SSE) that process multiple values in a single CPU clock cycle.
A Python loop over a million floats runs in roughly 100–200ms. The equivalent NumPy operation runs in 1–2ms. The 100x speedup is not an exaggeration - it is typical.
Array Creation
import numpy as np
# From Python lists
a = np.array([1.0, 2.0, 3.0]) # 1D, float64
b = np.array([[1, 2], [3, 4]]) # 2D, int64
# Factory functions
zeros = np.zeros((3, 4), dtype=np.float32)
ones = np.ones((100,), dtype=np.bool_)
eye = np.eye(5) # identity matrix
rng = np.random.default_rng(seed=42) # reproducible RNG
X = rng.standard_normal((1000, 20)) # 1000 samples, 20 features
# Ranges
idx = np.arange(0, 10, 2) # [0, 2, 4, 6, 8]
lin = np.linspace(0, 1, 101) # 101 points from 0 to 1
# Inspect shape and dtype immediately
print(X.shape) # (1000, 20)
print(X.dtype) # float64
print(X.nbytes) # 160000 - 160KB for 1000x20 float64
Choosing dtypes
| dtype | Bytes | When to use |
|---|---|---|
float64 | 8 | Default, full precision |
float32 | 4 | GPU training, halves memory |
float16 | 2 | Mixed-precision inference |
int64 | 8 | Large integer indices |
int32 | 4 | Typical label arrays |
bool | 1 | Masks and selections |
For ML training on GPU, float32 is almost always correct. float64 uses twice the memory and GPU fp64 throughput is 32x slower than fp32 on most consumer hardware.
Indexing and Slicing
X = np.random.default_rng(0).standard_normal((100, 10))
# Basic slicing - returns a view (no copy)
first_50 = X[:50] # rows 0–49
feature_3 = X[:, 3] # column 3, all rows
block = X[10:20, 2:5] # sub-matrix
# Boolean indexing - returns a copy
mask = X[:, 0] > 0 # boolean array of shape (100,)
positive_rows = X[mask] # rows where feature 0 is positive
# Integer indexing (fancy indexing) - returns a copy
idx = np.array([0, 5, 10, 15])
selected = X[idx] # 4 specific rows
# Setting values
X[X < -3] = -3 # clip extreme negatives in place
:::warning View vs copy
Slicing returns a view - modifying it modifies the original array. Fancy indexing (with an integer array or boolean mask) returns a copy. This distinction causes subtle bugs. When in doubt, call .copy() explicitly.
:::
Broadcasting: The Rules
Broadcasting lets you operate on arrays of different shapes without explicit loops or replication. It is one of NumPy's most powerful features and one of the most common sources of shape bugs.
The four rules, precisely stated:
- Arrays are compared element-wise starting from the trailing dimension and working backward.
- Two dimensions are compatible if they are equal, or one of them is exactly 1.
- If one array has fewer dimensions than the other, its shape is padded with 1s on the left until both shapes have the same length.
- If dimensions are incompatible (both greater than 1 and not equal), NumPy raises a
ValueError.
The broadcasting algorithm step by step:
Array A: shape (100, 10)
Array B: shape (10,)
Step 1: Pad B on the left → B becomes (1, 10)
Step 2: Compare trailing dim: 10 == 10 → compatible
Step 3: Compare leading dim: 100 vs 1 → stretch B to 100
Result shape: (100, 10)
# Example 1: add a bias vector to every row
# Shapes: (100, 10) + (10,) → (100, 10)
X = np.ones((100, 10))
bias = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) # shape (10,)
result = X + bias # each row gets bias added - no loop needed
# Example 2: normalize each feature (subtract mean, divide by std)
mean = X.mean(axis=0) # shape (10,) - column means
std = X.std(axis=0) # shape (10,)
X_norm = (X - mean) / std # broadcasting handles the (100,10) - (10,) case
# Example 3: column vector broadcast
# shape (100, 1) + (10,) → (100, 10)
row_sum = X.sum(axis=1, keepdims=True) # (100, 1)
X_rowscaled = X / row_sum # each row divided by its sum
# Example 4: outer product via broadcasting
u = np.array([1, 2, 3]) # shape (3,)
v = np.array([10, 20, 30, 40]) # shape (4,)
outer = u[:, np.newaxis] * v # (3,1) * (4,) → (3,4)
# Result:
# [[10, 20, 30, 40],
# [20, 40, 60, 80],
# [30, 60, 90, 120]]
Broadcasting quiz: predict the output shape
| A shape | B shape | Result shape | Compatible? |
|---|---|---|---|
(3, 4) | (4,) | (3, 4) | Yes |
(3, 4) | (3, 1) | (3, 4) | Yes |
(3, 4) | (1, 4) | (3, 4) | Yes |
(3, 4) | (3,) | - | No - trailing dim 3 != 4 |
(5, 1, 4) | (3, 4) | (5, 3, 4) | Yes |
(3, 4) | (4, 3) | - | No - shapes cannot align |
Common Broadcasting Gotcha
a = np.ones((3, 4))
b = np.ones((4, 3))
# a + b → ERROR: shapes (3,4) and (4,3) are not compatible
# This is NOT broadcast-compatible. Transpose b first:
a + b.T # (3,4) + (3,4) → fine
# Another gotcha: keepdims matters
X = np.random.randn(100, 10)
mean_wrong = X.mean(axis=0) # shape (10,) - fine for row-wise ops
row_mean = X.mean(axis=1) # shape (100,) - CANNOT subtract from (100,10) directly
row_mean_k = X.mean(axis=1, keepdims=True) # shape (100,1) - works!
X_centered = X - row_mean_k # (100,10) - (100,1) → (100,10) ✓
A reliable debugging strategy: print .shape on every intermediate result when a broadcast fails.
Vectorization vs Python Loops
The 150x speedup in practice
import time
import numpy as np
rng = np.random.default_rng(42)
X = rng.standard_normal((100_000, 50))
y = rng.integers(0, 2, size=100_000)
# Python loop: compute mean of positive class per feature
start = time.perf_counter()
means_loop = []
for j in range(X.shape[1]):
col = [X[i, j] for i in range(X.shape[0]) if y[i] == 1]
means_loop.append(sum(col) / len(col))
loop_time = time.perf_counter() - start
print(f"Loop: {loop_time:.3f}s")
# NumPy vectorized: same computation
start = time.perf_counter()
mask = y == 1
means_vec = X[mask].mean(axis=0)
vec_time = time.perf_counter() - start
print(f"NumPy: {vec_time:.4f}s")
print(f"Speedup: {loop_time / vec_time:.0f}x")
# Typical output: Loop: 8.2s | NumPy: 0.04s | Speedup: ~200x
Practical vectorization patterns
# Euclidean distance between each pair of rows (N x N matrix)
# Vectorized: use broadcasting
def pairwise_euclidean(X):
# ||x_i - x_j||^2 = ||x_i||^2 - 2*x_i*x_j + ||x_j||^2
sq_norms = (X ** 2).sum(axis=1) # (N,)
dot = X @ X.T # (N, N)
D_sq = sq_norms[:, None] - 2 * dot + sq_norms[None, :]
D_sq = np.maximum(D_sq, 0) # numerical safety
return np.sqrt(D_sq)
X = rng.standard_normal((500, 128))
D = pairwise_euclidean(X) # 500x500 distance matrix, milliseconds
# Softmax - numerically stable
def softmax(logits):
logits = logits - logits.max(axis=-1, keepdims=True) # subtract max
e = np.exp(logits)
return e / e.sum(axis=-1, keepdims=True)
# One-hot encode
def one_hot(y, num_classes):
N = y.shape[0]
out = np.zeros((N, num_classes), dtype=np.float32)
out[np.arange(N), y] = 1.0
return out
Advanced Indexing: Boolean, Fancy, np.where, np.nonzero
NumPy has three fundamentally different indexing modes after basic slicing, each with different performance characteristics and use cases.
Boolean indexing
Boolean indexing selects elements where a condition is True. It always returns a copy.
X = rng.standard_normal((1000, 20))
labels = rng.integers(0, 3, size=1000)
# Select all rows from class 2
class2 = X[labels == 2] # copy - shape depends on how many rows match
# Multi-condition: rows where feature 0 > 0 AND feature 1 < 0
mask = (X[:, 0] > 0) & (X[:, 1] < 0) # use & not 'and' for element-wise
filtered = X[mask]
# Set values in-place using boolean mask
X[X > 3.0] = 3.0 # clip at 3.0 (modifies original)
X[X < -3.0] = -3.0
# Count matches without extracting
n_outliers = (np.abs(X) > 3.0).sum()
Fancy indexing (integer array indexing)
Fancy indexing uses integer arrays to select elements. It returns a copy and is the basis for mini-batch sampling.
# Mini-batch sampling - core operation in training loops
N = 10_000
batch_size = 256
idx = rng.choice(N, size=batch_size, replace=False) # random batch
X_batch = X[idx] # shape (256, 20) - copy
# Gather operation: select specific (row, col) pairs
rows = np.array([0, 1, 2, 3])
cols = np.array([0, 5, 10, 15])
values = X[rows, cols] # shape (4,) - diagonal-like selection
# Reorder rows to sort by a feature
sort_idx = np.argsort(X[:, 0]) # indices that would sort column 0
X_sorted = X[sort_idx] # rows reordered
# Scatter: write to multiple positions
output = np.zeros((100, 20), dtype=np.float32)
target_rows = np.array([10, 20, 30])
output[target_rows] = X[:3] # write first 3 rows to positions 10, 20, 30
np.where - conditional element selection
# np.where(condition, x, y): choose from x where True, y where False
X_clipped = np.where(X > 0, X, 0.0) # ReLU: max(x, 0)
# Same as np.maximum(X, 0) but more general
# Replace NaNs with a sentinel value
data = np.array([1.0, np.nan, 3.0, np.nan, 5.0])
data_clean = np.where(np.isnan(data), -1.0, data)
# Multi-condition: three-way assignment
result = np.where(X > 1.0, 1.0, np.where(X < -1.0, -1.0, X)) # hard tanh
# Return indices of True elements (without extracting values)
idx_2d = np.where(X > 3.0) # returns tuple of (row_indices, col_indices)
np.nonzero - index extraction
# np.nonzero returns indices of non-zero (truthy) elements
# Equivalent to np.where with a single condition argument
arr = np.array([0, 1, 0, 3, 0, 5])
nonzero_idx = np.nonzero(arr) # (array([1, 3, 5]),)
values = arr[np.nonzero(arr)] # [1, 3, 5]
# For 2D: returns tuple of arrays for each dimension
mask_2d = X > 2.5
row_idx, col_idx = np.nonzero(mask_2d)
# Performance comparison for selecting extreme values:
# np.where - fastest when you need the values
# np.nonzero - fastest when you only need the indices
# Boolean indexing - most readable, returns copy of selected rows
:::tip When is each fastest?
- Boolean indexing: use when selecting rows based on a condition - most readable
- Fancy indexing: use for batching, gathering, and permutations - the workhorse of training loops
- np.where: use for element-wise conditional assignment - cleaner than if/else on arrays
- np.nonzero: use when you need the index positions themselves (e.g., sparse operations) :::
Memory Layout: C-Order vs F-Order, Strides, and Cache Performance
This is the section most NumPy tutorials skip. It explains why some operations are 5x faster than equivalent operations on the same data.
C-order vs Fortran-order
NumPy arrays are stored as a flat block of bytes in memory. The strides tell NumPy how many bytes to skip to move one step in each dimension.
# Default: C-order (row-major) - rows are stored contiguously
A = np.array([[1, 2, 3], [4, 5, 6]], dtype=np.float64, order='C')
print(A.strides) # (24, 8) - move 24 bytes to next row, 8 bytes to next column
# Fortran-order (column-major) - columns are stored contiguously
B = np.array([[1, 2, 3], [4, 5, 6]], dtype=np.float64, order='F')
print(B.strides) # (8, 16) - move 8 bytes to next row, 16 bytes to next column
# Check contiguity
print(A.flags['C_CONTIGUOUS']) # True
print(A.flags['F_CONTIGUOUS']) # False
print(B.flags['C_CONTIGUOUS']) # False
print(B.flags['F_CONTIGUOUS']) # True
In memory, a C-order 2x3 array looks like:
Elements in memory order: [1, 2, 3, 4, 5, 6]
Row 0: elements 1, 2, 3 are adjacent
Row 1: elements 4, 5, 6 are adjacent
→ Reading a whole row is sequential (cache-friendly)
Fortran-order stores the same data as:
Elements in memory order: [1, 4, 2, 5, 3, 6]
Column 0: elements 1, 4 are adjacent
Column 1: elements 2, 5 are adjacent
→ Reading a whole column is sequential (cache-friendly)
Why this matters for performance
Modern CPUs load data in cache lines (typically 64 bytes = 8 float64 values). When you access element [0, 0], the CPU loads elements [0, 0] through [0, 7] into cache. If your next access is [0, 1] (next column in C-order), it is already in cache. If your next access is [1, 0] (next row in Fortran-order for a C-layout array), it is not - causing a cache miss.
import numpy as np
import time
big = np.random.randn(5000, 5000).astype(np.float64)
# Row-wise sum (fast for C-order - reads contiguous memory per row)
start = time.perf_counter()
for _ in range(10):
_ = big.sum(axis=1)
print(f"Row sum: {(time.perf_counter() - start)*1000:.1f} ms")
# Column-wise sum (reads with large strides - less cache-friendly for C-order)
start = time.perf_counter()
for _ in range(10):
_ = big.sum(axis=0)
print(f"Col sum: {(time.perf_counter() - start)*1000:.1f} ms")
# Row sum is typically 2-4x faster due to cache behavior
# The fix for matmul: make arrays contiguous before heavy computation
W = rng.standard_normal((512, 256))
W_T = W.T # view with swapped strides - NOT contiguous
W_T_cont = np.ascontiguousarray(W.T) # new array, C-contiguous
X_batch = rng.standard_normal((64, 512))
# Time the difference
t1 = time.perf_counter()
for _ in range(1000):
_ = X_batch @ W_T
print(f"Non-contiguous W.T: {(time.perf_counter()-t1)*1000:.1f} ms")
t2 = time.perf_counter()
for _ in range(1000):
_ = X_batch @ W_T_cont
print(f"Contiguous W.T: {(time.perf_counter()-t2)*1000:.1f} ms")
# ascontiguousarray version is often 20-40% faster for repeated matmuls
Strides and reshape: when copies happen
a = np.arange(12, dtype=np.float64) # shape (12,)
# Reshape - returns a view if possible (strides adjusted)
b = a.reshape(3, 4)
print(b.base is a) # True - view, same memory
# Transpose - view with swapped strides
c = b.T
print(c.flags['C_CONTIGUOUS']) # False - strides are non-sequential
print(c.strides) # (8, 32) - non-standard
# After transpose, if you reshape, NumPy must copy (cannot express as strides)
d = c.reshape(12) # triggers a copy
print(d.base is c) # None - owned memory, c was not contiguous
# Best practice: call ascontiguousarray before repeated access patterns
# that require sequential memory
W_ready = np.ascontiguousarray(W.T) # pay the copy cost once
NumPy Memory Layout and Broadcasting - Architecture Diagram
Common ML Operations
Dot products and matrix multiply
rng = np.random.default_rng(42)
# Weights W: (out_dim, in_dim), input x: (batch, in_dim)
W = rng.standard_normal((64, 128))
x = rng.standard_normal((32, 128))
# Linear layer forward pass
out = x @ W.T # (32, 64) - batch matrix multiply
# Equivalent with np.dot, np.matmul - prefer @ operator for clarity
assert np.allclose(out, np.matmul(x, W.T))
Norms
v = np.array([3.0, 4.0])
np.linalg.norm(v) # 5.0 - L2 norm
np.linalg.norm(v, ord=1) # 7.0 - L1 norm
np.linalg.norm(v, ord=np.inf) # 4.0 - L∞ norm
# Row-wise L2 norms for a matrix
X = rng.standard_normal((100, 20))
row_norms = np.linalg.norm(X, axis=1) # shape (100,)
X_unit = X / row_norms[:, np.newaxis] # normalize each row
Covariance matrix
# X: (N_samples, N_features) - zero-mean assumed
X_centered = X - X.mean(axis=0)
cov = (X_centered.T @ X_centered) / (X.shape[0] - 1)
# shape: (N_features, N_features)
# Or use np.cov (handles centering internally)
cov_np = np.cov(X.T) # np.cov takes (features, samples)
Linear Algebra for ML: SVD, Eigenvalues, PCA from Scratch
Linear algebra is the language of ML. Understanding how NumPy implements these operations - and what they mean geometrically - separates ML engineers who use libraries from those who understand them.
Eigenvalues and Eigenvectors
For a square matrix , the eigenvector satisfies . The scalar is the eigenvalue - it tells you how much stretches or compresses in the direction of .
# Symmetric matrix - eigenvalues are real, eigenvectors are orthogonal
A = np.array([[4.0, 2.0], [2.0, 3.0]])
eigenvalues, eigenvectors = np.linalg.eigh(A) # eigh for symmetric/Hermitian
# eigenvalues: ascending order
# eigenvectors: columns are the eigenvectors
print("Eigenvalues:", eigenvalues) # e.g., [1.76, 5.24]
print("Eigenvectors:\n", eigenvectors) # orthonormal columns
# Verify: A @ v = λ * v for each eigenvector
for i in range(len(eigenvalues)):
v = eigenvectors[:, i]
lam = eigenvalues[i]
assert np.allclose(A @ v, lam * v), f"Eigenvector {i} failed"
Singular Value Decomposition (SVD) - the backbone of PCA
Every matrix of shape can be decomposed as , where:
- is - left singular vectors (output space basis)
- is diagonal - singular values (importance weights)
- is - right singular vectors (input space basis)
SVD is the most important factorization in data science. It underlies PCA, matrix completion, recommendation systems, and image compression.
X = rng.standard_normal((200, 50))
X_centered = X - X.mean(axis=0)
# Economy SVD (truncated at min(m,n))
U, s, Vt = np.linalg.svd(X_centered, full_matrices=False)
# U: (200, 50), s: (50,), Vt: (50, 50)
# Singular values measure variance along each principal direction
print("Top 5 singular values:", s[:5])
print("Condition number:", s[0] / s[-1]) # ratio of largest to smallest
# Project to k dimensions (PCA)
k = 10
X_pca = X_centered @ Vt[:k].T # (200, 10)
# Explained variance ratio
explained_var = (s ** 2) / (s ** 2).sum()
print(f"Top-{k} components explain {explained_var[:k].sum():.1%} of variance")
# Reconstruct with truncated SVD
X_reconstructed = U[:, :k] * s[:k] @ Vt[:k]
reconstruction_error = np.linalg.norm(X_centered - X_reconstructed, 'fro')
print(f"Frobenius reconstruction error: {reconstruction_error:.3f}")
PCA from scratch - complete implementation
def pca_from_scratch(X, n_components):
"""
Principal Component Analysis using SVD.
Numerically stable - avoids forming the covariance matrix explicitly.
This is exactly what sklearn's PCA does internally.
"""
# Step 1: center the data
mean = X.mean(axis=0)
X_centered = X - mean
# Step 2: SVD on centered data
U, s, Vt = np.linalg.svd(X_centered, full_matrices=False)
# Step 3: principal components are the rows of Vt
components = Vt[:n_components] # shape (n_components, n_features)
# Step 4: project
X_reduced = X_centered @ components.T # shape (n_samples, n_components)
# Step 5: explained variance ratio
total_var = (s ** 2).sum()
evr = (s[:n_components] ** 2) / total_var
return X_reduced, components, evr, mean
# Use it
X = rng.standard_normal((500, 100))
X_reduced, components, evr, mean = pca_from_scratch(X, n_components=20)
print(f"Shape after PCA: {X_reduced.shape}")
print(f"Variance explained: {evr.sum():.1%}")
Solving linear systems
# Solve Ax = b without computing A inverse (never use np.linalg.inv in practice)
A = rng.standard_normal((50, 50))
A = A.T @ A + 50 * np.eye(50) # make symmetric positive definite
b = rng.standard_normal(50)
# Numerically stable solution via LU factorization
x = np.linalg.solve(A, b)
print("Residual:", np.linalg.norm(A @ x - b)) # should be near machine epsilon
# Least squares (overdetermined system)
# Minimize ||Ax - b||^2 when A is (m, n) with m > n
A_rect = rng.standard_normal((200, 10))
b_rect = rng.standard_normal(200)
x_lstsq, residuals, rank, sv = np.linalg.lstsq(A_rect, b_rect, rcond=None)
print(f"Solution shape: {x_lstsq.shape}, rank: {rank}")
np.einsum - Expressive Tensor Contractions
einsum uses Einstein summation notation to express complex tensor operations without loops or intermediate allocations. The notation: each letter is an index, repeated letters are summed over, output indices are listed after the arrow.
Einsum notation reference table
| Operation | einsum string | Equivalent |
|---|---|---|
| Matrix multiply | 'ij,jk->ik' | A @ B |
| Batch matrix multiply | 'bij,bjk->bik' | loop over batch |
| Dot product | 'i,i->' | np.dot(a, b) |
| Outer product | 'i,j->ij' | np.outer(a, b) |
| Trace | 'ii->' | np.trace(A) |
| Transpose | 'ij->ji' | A.T |
| Row-wise dot products | 'ij,ij->i' | (A * B).sum(axis=1) |
| Batch outer product | 'bi,bj->bij' | loop over batch |
| Tensor contraction | 'ijk,jk->ij' | no direct NumPy op |
rng = np.random.default_rng(0)
# Matrix multiply: (M, K) x (K, N) → (M, N)
A = rng.standard_normal((32, 64))
B = rng.standard_normal((64, 128))
C = np.einsum('mk,kn->mn', A, B) # same as A @ B
# Batch matrix multiply: (B, M, K) x (B, K, N) → (B, M, N)
Abatch = rng.standard_normal((16, 32, 64))
Bbatch = rng.standard_normal((16, 64, 128))
Cbatch = np.einsum('bmk,bkn->bmn', Abatch, Bbatch)
# Trace (sum of diagonal)
sq = rng.standard_normal((10, 10))
trace = np.einsum('ii->', sq) # same as np.trace(sq)
# Outer product: (M,) x (N,) → (M, N)
u = rng.standard_normal((5,))
v = rng.standard_normal((8,))
outer = np.einsum('i,j->ij', u, v) # same as np.outer(u, v)
# Multi-head attention scores: Q(B,H,T,D) x K(B,H,T,D) → scores(B,H,T,T)
B_size, H, T, D = 8, 4, 64, 32
Q = rng.standard_normal((B_size, H, T, D))
K = rng.standard_normal((B_size, H, T, D))
scores = np.einsum('bhid,bhjd->bhij', Q, K) / np.sqrt(D)
# shape: (8, 4, 64, 64) - attention weight matrix per head
einsum performance: when to use optimize=True
# For chains of three or more matrices, optimize=True finds the best contraction order
# This can be orders of magnitude faster
A = rng.standard_normal((100, 200))
B = rng.standard_normal((200, 300))
C_mat = rng.standard_normal((300, 100))
# Without optimize: contracts left to right, O(100*200*300 + 100*300*100)
result_slow = np.einsum('ij,jk,kl->il', A, B, C_mat)
# With optimize: finds optimal contraction order automatically
result_fast = np.einsum('ij,jk,kl->il', A, B, C_mat, optimize=True)
# For simple two-tensor contractions, use @ directly - it is faster than einsum
# einsum overhead: index parsing, path planning (~10 microseconds)
# @ uses BLAS directly: sub-microsecond dispatch
# Rule of thumb:
# 2 arrays → use @, np.dot, or np.tensordot
# 3+ arrays or non-standard contractions → use einsum with optimize=True
:::warning When einsum is slower than matmul
For simple 2D matrix multiply, np.einsum('ij,jk->ik', A, B) is slower than A @ B because @ dispatches directly to BLAS while einsum parses the index string, determines the operation, and then calls into C code. The overhead is small in absolute terms (microseconds) but matters in tight loops. Always use @ for 2D matrix multiply and save einsum for cases where it provides genuine expressiveness that @ cannot.
:::
Random Number Generation for Reproducibility
# Modern API: use default_rng with a seed
rng = np.random.default_rng(seed=42)
# Draw from distributions
X_train = rng.standard_normal((1000, 20))
y_labels = rng.integers(0, 10, size=1000)
noise = rng.uniform(-0.1, 0.1, size=(1000,))
# Shuffle indices reproducibly
idx = rng.permutation(1000)
X_shuffled = X_train[idx]
# Why NOT the legacy API
# np.random.seed(42) # global state - breaks with multiple threads
# np.random.randn(...) # uses the global seed
# Sampling strategies for ML
# 1. Bootstrap sampling (with replacement)
bootstrap_idx = rng.choice(1000, size=1000, replace=True)
# 2. Stratified sampling (preserve class distribution)
def stratified_sample(X, y, n_samples, rng):
classes = np.unique(y)
n_per_class = n_samples // len(classes)
indices = []
for cls in classes:
cls_idx = np.where(y == cls)[0]
sampled = rng.choice(cls_idx, size=n_per_class, replace=False)
indices.append(sampled)
return np.concatenate(indices)
# 3. Reproducible train/test splits
def reproducible_split(X, y, test_size=0.2, seed=42):
rng = np.random.default_rng(seed)
n = len(X)
idx = rng.permutation(n)
split = int(n * (1 - test_size))
train_idx, test_idx = idx[:split], idx[split:]
return X[train_idx], X[test_idx], y[train_idx], y[test_idx]
# 4. Different rngs for different subsystems - avoids seed coupling
data_rng = np.random.default_rng(0)
model_rng = np.random.default_rng(1)
aug_rng = np.random.default_rng(2)
:::tip Seed management in ML
Never use np.random.seed() in production ML code. It sets a global state that is not thread-safe and creates hidden coupling between modules. Instead, create a default_rng instance per subsystem (data loading, model initialization, augmentation) with independent seeds. This makes each subsystem independently reproducible and debuggable.
:::
Performance Profiling: Timeit, np.vectorize, and Numba
Using timeit correctly
import numpy as np
import timeit
rng = np.random.default_rng(42)
X = rng.standard_normal((1000, 50))
W = rng.standard_normal((50, 100))
# timeit - always use number > 1 to average out noise
time_matmul = timeit.timeit(lambda: X @ W, number=10_000)
print(f"matmul: {time_matmul / 10_000 * 1e6:.2f} µs per call")
# In Jupyter: use %timeit magic (automatically selects number of repeats)
# %timeit X @ W
# Output: 45.2 µs ± 1.3 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# Profile by component to find the bottleneck
def forward_pass(X, W, b):
z = X @ W # linear
z = z + b # bias
return np.maximum(z, 0) # ReLU
b = np.zeros(100)
time_full = timeit.timeit(lambda: forward_pass(X, W, b), number=10_000)
time_matmul_only = timeit.timeit(lambda: X @ W, number=10_000)
print(f"Matmul is {time_matmul_only / time_full * 100:.0f}% of forward pass time")
np.vectorize - and why it is slow
# np.vectorize makes a scalar function work on arrays
# It looks like it should be fast - it is NOT
def my_sigmoid(x):
return 1.0 / (1.0 + np.exp(-x))
v_sigmoid = np.vectorize(my_sigmoid)
X_flat = rng.standard_normal(100_000)
# These all compute the same thing:
t1 = timeit.timeit(lambda: v_sigmoid(X_flat), number=100)
t2 = timeit.timeit(lambda: 1.0 / (1.0 + np.exp(-X_flat)), number=100)
print(f"np.vectorize: {t1/100*1000:.2f} ms") # ~80ms
print(f"Vectorized: {t2/100*1000:.2f} ms") # ~1.5ms
# np.vectorize is syntactic sugar for a Python loop. Under the hood it calls
# your Python function once per element. The only advantage is broadcasting
# support for functions that do not accept array arguments natively.
:::danger np.vectorize is not a performance tool
np.vectorize is documentation in the NumPy source: "The vectorized function evaluates pyfunc over successive tuples of the input arrays like the python map function, except it uses the broadcasting rules of numpy." It is a Python loop with broadcasting. Using it in a hot path will destroy performance. Write vectorized NumPy expressions directly.
:::
Numba JIT - when NumPy is not enough
# When you genuinely cannot vectorize (custom recurrences, complex conditionals):
# Numba compiles Python to LLVM machine code
from numba import njit
import numpy as np
import timeit
@njit(cache=True)
def compute_recurrence(arr):
"""Exponentially weighted moving average - inherently sequential."""
result = np.empty_like(arr)
result[0] = arr[0]
alpha = 0.3
for i in range(1, len(arr)):
result[i] = alpha * arr[i] + (1 - alpha) * result[i - 1]
return result
data = np.random.randn(1_000_000)
# Warm up JIT compilation
_ = compute_recurrence(data[:100])
t_numba = timeit.timeit(lambda: compute_recurrence(data), number=100)
print(f"Numba EWMA: {t_numba/100*1000:.2f} ms")
# Comparable to C speed - cannot be done faster with pure NumPy
Structured Arrays and Recarrays for Mixed-Type Data
Most ML data is heterogeneous - integers, floats, and strings in the same row. NumPy's structured arrays handle this without converting everything to object dtype.
# Define a structured dtype
dtype = np.dtype([
('user_id', np.int32),
('age', np.float32),
('score', np.float64),
('label', np.int8),
('country', 'U3'), # 3-character Unicode string
])
# Create structured array
data = np.array([
(1001, 25.0, 0.872, 1, 'US'),
(1002, 31.5, 0.341, 0, 'UK'),
(1003, 28.0, 0.615, 1, 'DE'),
], dtype=dtype)
# Access fields by name
print(data['user_id']) # [1001 1002 1003]
print(data['score']) # [0.872 0.341 0.615]
# Boolean indexing still works
high_scorers = data[data['score'] > 0.5]
# Convert numeric fields to a 2D float array for ML
features = np.stack([
data['age'].astype(np.float32),
data['score'].astype(np.float32),
data['label'].astype(np.float32),
], axis=1)
# shape: (3, 3) - ML-ready feature matrix
# Memory comparison: structured vs object array
struct_mem = data.nbytes
print(f"Structured array: {struct_mem} bytes")
# object arrays store Python pointers - much more overhead
:::note When to use structured arrays Structured arrays are most useful for: (1) reading binary data files with known schema (e.g., HDF5, binary logs), (2) passing heterogeneous data between C extensions and Python, (3) intermediate data representations that are too large for Pandas but need named columns. For typical tabular ML workflows, Pandas DataFrames are more ergonomic. For pure numeric pipelines, use regular NumPy arrays. :::
YouTube Resources
| Video | Channel | Description |
|---|---|---|
| NumPy Tutorial for Beginners | freeCodeCamp | Comprehensive NumPy crash course covering arrays, indexing, and operations |
| Broadcasting in NumPy | Sentdex | Visual step-by-step broadcasting explanation with practical examples |
| NumPy Internals | PyData | How NumPy works under the hood - strides, memory layout, BLAS |
| Scientific Python Lecture Notes | SciPy | NumPy for scientific computing - SVD, linear algebra, and performance |
| Advanced NumPy | EuroSciPy | Structured arrays, advanced indexing, and memory-efficient patterns |
Performance Tips Summary
# 1. Avoid Python loops - use vectorized operations
# Bad:
result = [np.dot(X[i], w) for i in range(len(X))]
# Good:
result = X @ w
# 2. Use in-place operations when memory is tight
X -= X.mean(axis=0) # in-place, no copy
X /= X.std(axis=0)
# 3. Pre-allocate output arrays
out = np.empty((N, D), dtype=np.float32)
for i, chunk in enumerate(chunks):
out[i * chunk_size : (i + 1) * chunk_size] = process(chunk)
# 4. Use float32 for GPU pipelines
X = X.astype(np.float32)
# 5. Profile with timeit before optimizing
import timeit
timeit.timeit(lambda: X @ W, number=1000)
# 6. Make arrays contiguous before repeated matmul
W_cont = np.ascontiguousarray(W.T)
# 7. Use np.einsum with optimize=True for three-or-more tensor contractions
result = np.einsum('ij,jk,kl->il', A, B, C, optimize=True)
# 8. Never use np.vectorize in a hot path - write vectorized expressions directly
# 9. Use keepdims=True to avoid reshape errors in broadcasting
# 10. Use np.random.default_rng(seed) - never np.random.seed() for production code
Interview Q&A
Q1: Walk me through the four broadcasting rules and give an example where each one applies.
Broadcasting aligns shapes from the trailing dimension leftward. Rule 1: trailing dimensions must be compared first - shapes (100, 10) and (10,) compare their rightmost dimensions: 10 == 10, compatible. Rule 2: a dimension of size 1 is stretched to match the other - (100, 1) and (100, 10) stretch the 1 to 10. Rule 3: if one array has fewer dimensions, it is padded with 1s on the left - (10,) becomes (1, 10) when paired with a 2D array. Rule 4: if two dimensions are both greater than 1 and unequal, NumPy raises ValueError. A concrete ML example: (X - mean) / std where X is (N, D) and mean, std are (D,) - the mean and std are padded to (1, D) and stretched to (N, D) without any data copy.
Q2: What is the difference between C-order and Fortran-order memory layout, and when does it affect performance?
C-order (row-major) stores rows contiguously - element [i, j] is adjacent to [i, j+1]. Fortran-order (column-major) stores columns contiguously - element [i, j] is adjacent to [i+1, j]. NumPy defaults to C-order. Performance impact: the CPU loads data in 64-byte cache lines. In a C-order array, reading across a row is sequential (cache-friendly); reading down a column requires jumping by n_cols * element_size bytes (cache-unfriendly). The most common performance issue is repeated matrix multiply with W.T where W is C-contiguous - W.T has non-sequential strides. Calling np.ascontiguousarray(W.T) pays a one-time copy cost and enables BLAS to use more efficient memory access patterns in the subsequent multiplications.
Q3: When is einsum slower than matmul, and when is it faster?
For a simple 2D matrix multiply, np.einsum('ij,jk->ik', A, B) is slower than A @ B because @ dispatches directly to BLAS (OpenBLAS/MKL) with near-zero overhead, while einsum parses the index string, determines the contraction strategy, and then calls C code. The overhead is small (microseconds) but measurable in tight loops. einsum becomes advantageous when: (1) you have three or more tensor contractions - use optimize=True to find the optimal contraction order, which can be orders of magnitude faster than sequential matmul; (2) you need batch outer products, per-example dot products, or contractions that have no direct NumPy equivalent; (3) expressing multi-head attention scores as 'bhid,bhjd->bhij' is cleaner and avoids creating intermediate reshaped tensors.
Q4: How would you implement scaled dot-product attention using only NumPy?
def scaled_dot_product_attention(Q, K, V, mask=None):
# Q, K, V: (batch, heads, seq_len, d_k)
d_k = Q.shape[-1]
# Compute attention scores
scores = np.einsum('bhid,bhjd->bhij', Q, K) / np.sqrt(d_k)
# Apply mask (e.g., causal mask for decoder)
if mask is not None:
scores = np.where(mask, scores, -1e9)
# Softmax over key dimension
scores -= scores.max(axis=-1, keepdims=True)
weights = np.exp(scores)
weights /= weights.sum(axis=-1, keepdims=True)
# Weighted sum of values
output = np.einsum('bhij,bhjd->bhid', weights, V)
return output, weights
The key insights: use einsum for the QK^T computation to handle batch and head dimensions naturally, subtract the max before softmax for numerical stability, and use einsum again for the weighted sum of values.
Q5: How would you vectorize a cross-entropy loss function for a batch of predictions?
def cross_entropy_loss(logits, labels):
# logits: (N, C), labels: (N,) integer class indices
# Numerically stable: subtract max before exp
logits_stable = logits - logits.max(axis=1, keepdims=True)
log_sum_exp = np.log(np.exp(logits_stable).sum(axis=1)) # (N,)
# Gather the log-probability for the correct class
correct_logits = logits_stable[np.arange(len(labels)), labels] # fancy indexing
# Cross-entropy = -log_softmax(correct_class)
loss = -correct_logits + log_sum_exp
return loss.mean()
# The key NumPy patterns used:
# - max subtraction for numerical stability (broadcasting)
# - fancy indexing with np.arange to select correct class logits
# - fully vectorized - no Python loop over samples
Q6: What are the memory layout effects when you transpose a matrix and then multiply it repeatedly? How do you fix it?
When you call W.T, NumPy returns a view with swapped strides - no data is copied. The transposed array is Fortran-contiguous (columns are sequential in memory) but not C-contiguous. When BLAS receives a non-contiguous matrix, it must use a strided access pattern that defeats CPU cache prefetching. For a single matmul, this overhead is small. But in a training loop with 1000+ iterations, it adds up. The fix: W_cont = np.ascontiguousarray(W.T) creates a new C-contiguous array - pay the copy cost once, outside the loop, and then use W_cont for all multiplications. Speedup is typically 20–40% for large matrices in repeated operations.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Tensor Viewer demo on the EngineersOfAI Playground - no code required.
:::
