Skip to main content

NumPy for Interviews - Vectorized Thinking for ML Engineers

Reading time: ~45 min | Interview relevance: Critical | Roles: MLE, AI Eng, Data Scientist, Research Engineer

The Real Interview Moment

You are thirty minutes into a Meta MLE phone screen. The interviewer shares a Colab notebook and says: "Implement batch normalization using only NumPy. No loops." You stare at the cell. You know the formula - subtract mean, divide by standard deviation, scale and shift. But the input is a 4D tensor of shape (batch, channels, height, width), and you need to normalize per channel. Your fingers hover over the keyboard. Do you reach for a for-loop over channels, or do you see the axis parameter, the keepdims trick, the broadcasting shape?

This is the moment that separates candidates who use NumPy from candidates who think in NumPy. Every AI interview involving numerical computation will test whether you can express mathematical operations as vectorized array manipulations - without loops, without copies, without wasted memory. The interviewer is not testing whether you memorized the API. They are testing whether you understand how data lives in memory and how to transform it efficiently.

Candidates who write for i in range(batch_size) get a polite "we'll get back to you." Candidates who write (x - x.mean(axis=(0,2,3), keepdims=True)) / (x.std(axis=(0,2,3), keepdims=True) + eps) in one line get a "strong hire." This page gives you everything you need to be in the second group.

What You Will Master

  • Create arrays efficiently using arange, linspace, zeros, ones, eye, and random generators
  • Manipulate shapes with reshape, transpose, expand_dims, squeeze, and stacking operations
  • Apply broadcasting rules to eliminate loops from any element-wise computation
  • Vectorize loop-based code into array operations with 100x speedups
  • Perform linear algebra operations: dot products, matrix multiplication, eigendecomposition, SVD
  • Use advanced indexing: boolean masks, fancy indexing, np.where, np.take_along_axis
  • Understand memory layout: C-order vs Fortran-order, strides, views vs copies
  • Solve 10 interview-grade NumPy problems from scratch

Self-Assessment: Where Are You Now?

Skill1 - Cannot2 - Vaguely3 - Can Do4 - Can Optimize5 - Can TeachYour Score
Create arrays (zeros, eye, random)___
Reshape and transpose correctly___
Apply broadcasting rules___
Vectorize a loop-based computation___
Matrix multiplication (@ operator)___
Boolean and fancy indexing___
Explain memory layout and strides___
Implement ML operations without loops___

Target: All 4s and 5s before your interview.

Part 1 - Array Creation and Manipulation

Essential Array Creation

import numpy as np

# From data
a = np.array([1, 2, 3]) # 1D array
b = np.array([[1, 2], [3, 4]]) # 2D array (2x2 matrix)

# Filled arrays
zeros = np.zeros((3, 4)) # 3x4 of zeros
ones = np.ones((2, 3, 4)) # 2x3x4 of ones
full = np.full((3, 3), 7.0) # 3x3 filled with 7.0
eye = np.eye(4) # 4x4 identity matrix

# Sequences
seq = np.arange(0, 10, 2) # [0, 2, 4, 6, 8]
lin = np.linspace(0, 1, 5) # [0.0, 0.25, 0.5, 0.75, 1.0]
log = np.logspace(0, 3, 4) # [1, 10, 100, 1000]

# Random (modern API)
rng = np.random.default_rng(42)
uniform = rng.uniform(0, 1, size=(3, 3)) # Uniform [0, 1)
normal = rng.standard_normal((3, 3)) # Standard normal
integers = rng.integers(0, 10, size=(5,)) # Random ints [0, 10)
60-Second Answer

"NumPy arrays are fixed-type, contiguous memory blocks - unlike Python lists, which are arrays of pointers to scattered heap objects. This is why NumPy is fast: operations map to BLAS/LAPACK calls on contiguous memory, enabling SIMD vectorization. I always use the modern default_rng() API for reproducible random generation, and I choose dtype explicitly when memory matters - float32 for GPU-bound work, float64 for numerical stability."

Shape Manipulation

x = np.arange(24)

# Reshape: total elements must match
a = x.reshape(4, 6) # Shape: (4, 6)
b = x.reshape(2, 3, 4) # Shape: (2, 3, 4)
c = x.reshape(2, -1) # Shape: (2, 12) - infer last dim

# Add/remove dimensions
d = x.reshape(24, 1) # Explicit column vector
e = np.expand_dims(x, axis=0) # Shape: (1, 24) - add batch dim
f = np.expand_dims(x, axis=1) # Shape: (24, 1) - add feature dim
g = np.squeeze(e) # Shape: (24,) - remove size-1 dims

# Transpose
mat = np.arange(6).reshape(2, 3)
mat_T = mat.T # Shape: (3, 2)
# For higher dimensions: use transpose with axis order
tensor = np.arange(24).reshape(2, 3, 4)
tensor_t = tensor.transpose(0, 2, 1) # Shape: (2, 4, 3)

# Stacking
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
np.stack([a, b], axis=0) # Shape: (2, 3) - stack as rows
np.stack([a, b], axis=1) # Shape: (3, 2) - stack as columns
np.concatenate([a, b]) # Shape: (6,) - join along existing axis
Common Trap

reshape returns a view when possible (contiguous memory), but returns a copy when the data is not contiguous. After a transpose, the data is typically not contiguous, so transposed_array.reshape(...) creates a copy. Use .copy() explicitly when you need to guarantee contiguous memory for performance-critical code. Interviewers will ask: "Does reshape always return a view?"

Part 2 - Broadcasting Rules

Broadcasting is the single most important NumPy concept for interviews. It determines how arrays of different shapes interact in element-wise operations.

The Three Rules

NumPy Broadcasting Three Rules

Broadcasting Examples

# Example 1: Scalar + Array
a = np.array([1, 2, 3]) # Shape: (3,)
b = 5 # Shape: ()
# Rule 1: b becomes shape (1,), then (3,) by Rule 3
c = a + b # [6, 7, 8], shape: (3,)

# Example 2: Column + Row = Matrix
col = np.array([[1], [2], [3]]) # Shape: (3, 1)
row = np.array([10, 20, 30]) # Shape: (3,) -> padded to (1, 3)
# Rule 3: col stretches to (3, 3), row stretches to (3, 3)
result = col + row
# [[11, 21, 31],
# [12, 22, 32],
# [13, 23, 33]]

# Example 3: Subtract mean per feature (common ML operation)
X = np.random.randn(100, 5) # Shape: (100, 5) - 100 samples, 5 features
mean = X.mean(axis=0) # Shape: (5,)
X_centered = X - mean # Broadcasting: (100, 5) - (5,) -> (100, 5)

# Example 4: Normalize per channel in an image batch
images = np.random.randn(32, 3, 224, 224) # (batch, channels, H, W)
channel_mean = images.mean(axis=(0, 2, 3)) # Shape: (3,)
# Need to reshape for broadcasting: (3,) -> (1, 3, 1, 1)
channel_mean = channel_mean.reshape(1, 3, 1, 1)
normalized = images - channel_mean # Shape: (32, 3, 224, 224)
Instant Rejection

Never write for i in range(n): result[i] = X[i] - mean when result = X - mean does the same thing via broadcasting. Interviewers interpret Python loops over array elements as a signal that you do not understand vectorized computation. In production ML code, such loops are 100-1000x slower and indicate you will write slow data pipelines.

Broadcasting Compatibility Table

Shape AShape BCompatible?Result ShapeExplanation
(3,)(3,)Yes(3,)Same shape
(3, 4)(4,)Yes(3, 4)(4,) padded to (1, 4), stretched
(3, 4)(3, 1)Yes(3, 4)1 stretched to 4
(3, 4)(3,)NoErrorTrailing dims 4 vs 3, neither is 1
(2, 3, 4)(3, 4)Yes(2, 3, 4)Padded to (1, 3, 4)
(2, 3, 4)(2, 1, 4)Yes(2, 3, 4)Middle dim 1 stretched to 3
(5, 1, 3)(1, 4, 3)Yes(5, 4, 3)Both size-1 dims stretch
(5, 3)(4, 3)NoError5 vs 4, neither is 1

Part 3 - Vectorization vs Loops

The Performance Gap

import time

n = 1_000_000

# Slow: Python loop
def dot_product_loop(a, b):
result = 0.0
for i in range(len(a)):
result += a[i] * b[i]
return result

# Fast: Vectorized
def dot_product_vectorized(a, b):
return np.dot(a, b)

a = np.random.randn(n)
b = np.random.randn(n)

# Loop: ~300ms | Vectorized: ~1ms - 300x faster

Common Vectorization Patterns

# Pattern 1: Conditional assignment
# Slow
for i in range(len(x)):
if x[i] > 0:
result[i] = x[i]
else:
result[i] = 0
# Fast (ReLU activation)
result = np.maximum(x, 0)
# Also: np.where(x > 0, x, 0)

# Pattern 2: Pairwise distances
# Slow: O(n^2) loop
n_samples = 1000
X = np.random.randn(n_samples, 50)
distances = np.zeros((n_samples, n_samples))
# for i in range(n_samples):
# for j in range(n_samples):
# distances[i, j] = np.sqrt(np.sum((X[i] - X[j])**2))

# Fast: Vectorized using broadcasting
# ||a - b||^2 = ||a||^2 + ||b||^2 - 2 * a . b
sq_norms = np.sum(X**2, axis=1) # Shape: (n,)
# Broadcasting: (n, 1) + (1, n) - 2 * (n, n)
distances = np.sqrt(sq_norms[:, None] + sq_norms[None, :] - 2 * X @ X.T)

# Pattern 3: Accumulation with grouping
labels = np.array([0, 1, 0, 2, 1, 0])
values = np.array([10, 20, 30, 40, 50, 60])
# Sum per group using np.bincount
group_sums = np.bincount(labels, weights=values) # [100, 70, 40]

# Pattern 4: Rolling/sliding window
def rolling_mean(x, window):
# Use cumsum trick for O(n) rolling mean
cumsum = np.cumsum(x)
cumsum[window:] = cumsum[window:] - cumsum[:-window]
return cumsum[window - 1:] / window
Company Variation

Google and Meta heavily test vectorization skills because their ML pipelines process billions of examples. They expect you to recognize that a nested Python loop over arrays is unacceptable. Startups are more forgiving - they care more about correctness than performance. Research labs (DeepMind, FAIR) test whether you can express complex mathematical operations (attention, convolution) as matrix operations.

Part 4 - Linear Algebra Operations

Essential Operations for ML

A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
v = np.array([1, 2])

# Matrix multiplication (three equivalent ways)
C = A @ B # Preferred syntax (Python 3.5+)
C = np.matmul(A, B) # Function form
C = np.dot(A, B) # Works for 2D, but @ is clearer

# Dot product
d = np.dot(v, v) # Scalar: 5
d = v @ v # Same thing

# Batch matrix multiplication
# Shape: (batch, m, k) @ (batch, k, n) -> (batch, m, n)
batch_A = np.random.randn(32, 4, 3)
batch_B = np.random.randn(32, 3, 5)
batch_C = batch_A @ batch_B # Shape: (32, 4, 5)

# Transpose
A_T = A.T

# Inverse (use sparingly - prefer solve)
A_inv = np.linalg.inv(A)

# Solve Ax = b (numerically stable)
b = np.array([5, 11])
x = np.linalg.solve(A, b) # Much better than A_inv @ b

# Eigendecomposition
eigenvalues, eigenvectors = np.linalg.eigh(A.T @ A) # For symmetric matrices

# SVD
U, S, Vt = np.linalg.svd(A, full_matrices=False)

# Norms
l2_norm = np.linalg.norm(v) # L2 norm (default)
l1_norm = np.linalg.norm(v, ord=1) # L1 norm
frob_norm = np.linalg.norm(A, 'fro') # Frobenius norm for matrices

# Determinant and rank
det = np.linalg.det(A)
rank = np.linalg.matrix_rank(A)
Common Trap

Never use np.linalg.inv(A) @ b to solve a linear system. Use np.linalg.solve(A, b) instead. The inverse is numerically unstable and O(n^3) - solve uses LU decomposition which is both faster and more stable. Interviewers specifically check for this. Similarly, never compute A.T @ A then invert it for least squares - use np.linalg.lstsq(A, b).

Part 5 - Advanced Indexing

Boolean Indexing

X = np.array([[1, -2, 3], [4, -5, 6], [-7, 8, -9]])

# Select elements matching a condition
positives = X[X > 0] # [1, 3, 4, 6, 8] - flattened

# Set elements matching a condition
X_clipped = X.copy()
X_clipped[X_clipped < 0] = 0 # ReLU equivalent

# Combine conditions (use & for AND, | for OR, ~ for NOT)
mask = (X > 0) & (X < 5)
selected = X[mask] # [1, 3, 4]

# np.where: conditional element selection
result = np.where(X > 0, X, 0) # ReLU
result = np.where(X > 0, X, -X) # Absolute value

Fancy Indexing

# Index with arrays of indices
arr = np.array([10, 20, 30, 40, 50])
indices = np.array([0, 3, 4])
arr[indices] # [10, 40, 50]

# 2D fancy indexing
mat = np.arange(12).reshape(3, 4)
rows = np.array([0, 2])
cols = np.array([1, 3])
mat[rows, cols] # [1, 11] - elements (0,1) and (2,3)

# Select specific rows
mat[rows] # Shape: (2, 4) - rows 0 and 2

# np.take_along_axis: gather along an axis
# Common in top-k selection, beam search
scores = np.array([[0.1, 0.9, 0.3],
[0.8, 0.1, 0.5]])
top_k_indices = np.argsort(scores, axis=1)[:, -2:] # Top-2 per row
top_k_values = np.take_along_axis(scores, top_k_indices, axis=1)

# np.put_along_axis: scatter along an axis
# Useful for one-hot encoding
n_classes = 5
labels = np.array([0, 3, 1, 4])
one_hot = np.zeros((len(labels), n_classes))
np.put_along_axis(one_hot, labels.reshape(-1, 1), 1, axis=1)

Part 6 - Memory Layout: C vs Fortran Order

How Arrays Are Stored

a = np.arange(6).reshape(2, 3)
# [[0, 1, 2],
# [3, 4, 5]]

# C-order (row-major, default): elements in a row are contiguous
# Memory: [0, 1, 2, 3, 4, 5]
a_c = np.ascontiguousarray(a) # Ensure C-order

# Fortran-order (column-major): elements in a column are contiguous
# Memory: [0, 3, 1, 4, 2, 5]
a_f = np.asfortranarray(a) # Convert to Fortran-order

# Check order
a.flags['C_CONTIGUOUS'] # True (default)
a.flags['F_CONTIGUOUS'] # False

# Strides: bytes to step in each dimension
a.strides # (24, 8) for float64 C-order: 3*8=24 to next row, 8 to next col

Why Memory Layout Matters

n = 10000

# Row-major access (fast for C-order)
a = np.random.randn(n, n)
# Summing along axis=1 (across columns within a row) is fast
# because elements in the same row are contiguous in memory
row_sum = a.sum(axis=1) # Fast: sequential memory access

# Column-major access (slow for C-order)
col_sum = a.sum(axis=0) # Slower: strided memory access (cache misses)
Interviewer's Perspective

When I ask about memory layout, I am testing whether the candidate understands performance beyond algorithmic complexity. A candidate who knows that sum(axis=1) is faster than sum(axis=0) for row-major arrays - and can explain why in terms of cache lines - demonstrates systems-level understanding. This matters for production ML code where data loading and preprocessing are often the bottleneck.

Views vs Copies

a = np.arange(10)

# Views: share the same memory (modifications propagate)
b = a[2:5] # Slice -> view
b[0] = 99 # a[2] is now 99
c = a.reshape(2, 5) # Reshape -> view (if contiguous)

# Copies: independent memory
d = a.copy() # Explicit copy
e = a[[0, 1, 2]] # Fancy indexing -> always a copy
f = a[a > 5] # Boolean indexing -> always a copy

# How to check
np.shares_memory(a, b) # True (view)
np.shares_memory(a, d) # False (copy)

Part 7 - Random Sampling for ML

rng = np.random.default_rng(42)

# Common distributions for ML
normal = rng.normal(loc=0, scale=1, size=(100, 10)) # Weight init
uniform = rng.uniform(0, 1, size=(100,)) # Dropout masks
bernoulli = rng.binomial(1, p=0.5, size=(100,)) # Binary masks

# Sampling without replacement (train/test split)
n_samples = 1000
indices = np.arange(n_samples)
rng.shuffle(indices)
train_idx = indices[:800]
test_idx = indices[800:]

# Or use choice
train_idx = rng.choice(n_samples, size=800, replace=False)

# Xavier/Glorot initialization
fan_in, fan_out = 256, 128
std = np.sqrt(2.0 / (fan_in + fan_out))
weights = rng.normal(0, std, size=(fan_in, fan_out))

# He initialization (for ReLU networks)
std = np.sqrt(2.0 / fan_in)
weights = rng.normal(0, std, size=(fan_in, fan_out))

Practice Problems

Problem 1: Implement Softmax

Implement the softmax function for a 2D input (batch of logits). Handle numerical stability.

Input: logits with shape (batch_size, num_classes) Output: probabilities with shape (batch_size, num_classes), rows sum to 1

Hint 1 - Direction

The naive formula is exp(x) / sum(exp(x)), but this overflows for large values. Subtract the maximum value first - this does not change the result mathematically but prevents overflow.

Hint 2 - Broadcasting

Use keepdims=True when computing the max and sum so that broadcasting works correctly along the class axis.

Hint 3 - Full Solution
def softmax(logits):
"""
Numerically stable softmax for a batch of logits.

Args:
logits: np.ndarray of shape (batch_size, num_classes)
Returns:
np.ndarray of shape (batch_size, num_classes), rows sum to 1
"""
# Subtract max for numerical stability (prevents exp overflow)
# keepdims=True ensures shape (batch_size, 1) for broadcasting
shifted = logits - logits.max(axis=1, keepdims=True)

# Exponentiate
exp_shifted = np.exp(shifted)

# Normalize: divide by sum along class axis
return exp_shifted / exp_shifted.sum(axis=1, keepdims=True)

# Test
logits = np.array([[1.0, 2.0, 3.0],
[1000.0, 1001.0, 1002.0]]) # Large values
probs = softmax(logits)
print(probs)
# [[0.0900, 0.2447, 0.6652],
# [0.0900, 0.2447, 0.6652]]
print(probs.sum(axis=1)) # [1.0, 1.0]

Why the max trick works: softmax(x) = softmax(x - c) for any constant c. Subtracting the max ensures all exponents are <= 0, preventing overflow. The largest exponent is exp(0) = 1.

Scoring:

  • Strong Hire: Implements with numerical stability, correct axis handling, and keepdims. Explains why the max trick works mathematically.
  • Lean Hire: Correct formula but forgets numerical stability or has axis errors.
  • No Hire: Uses a loop over the batch dimension.

Problem 2: Implement Batch Normalization

Implement batch normalization for a 2D input (batch of feature vectors).

Input: x shape (batch_size, features), gamma shape (features,), beta shape (features,), eps=1e-5 Output: normalized, scaled, and shifted output of same shape

Hint 1 - Direction

Batch norm: normalize each feature across the batch (axis=0), then scale by gamma and shift by beta.

Hint 2 - Formula

Formula: y = gamma * (x - mean) / sqrt(var + eps) + beta where mean and var are computed per feature across the batch.

Hint 3 - Full Solution
def batch_norm(x, gamma, beta, eps=1e-5):
"""
Batch normalization for 2D input.

Args:
x: np.ndarray of shape (batch_size, features)
gamma: np.ndarray of shape (features,) - scale parameter
beta: np.ndarray of shape (features,) - shift parameter
eps: small constant for numerical stability
Returns:
Normalized output of shape (batch_size, features)
Cache tuple for backward pass: (x_norm, std, x_centered)
"""
# Step 1: Compute per-feature mean across batch
mean = x.mean(axis=0) # Shape: (features,)

# Step 2: Center the data
x_centered = x - mean # Broadcasting: (batch, features) - (features,)

# Step 3: Compute per-feature variance
var = (x_centered ** 2).mean(axis=0) # Shape: (features,)

# Step 4: Normalize
std = np.sqrt(var + eps)
x_norm = x_centered / std # Broadcasting: (batch, features) / (features,)

# Step 5: Scale and shift
out = gamma * x_norm + beta # Broadcasting with (features,)

return out, (x_norm, std, x_centered)

# Test
batch_size, features = 64, 128
x = np.random.randn(batch_size, features) * 5 + 3 # Arbitrary scale and shift
gamma = np.ones(features)
beta = np.zeros(features)

out, cache = batch_norm(x, gamma, beta)
print(f"Output mean per feature (should be ~0): {out.mean(axis=0)[:5]}")
print(f"Output var per feature (should be ~1): {out.var(axis=0)[:5]}")

Scoring:

  • Strong Hire: Correct formula, proper axis usage, returns cache for backward pass. Can extend to 4D (conv layers).
  • Lean Hire: Correct formula but computes mean/var along wrong axis.
  • No Hire: Uses loops or cannot handle the broadcasting.

Problem 3: Cosine Similarity Matrix

Compute the pairwise cosine similarity between all pairs of vectors in a matrix.

Input: X with shape (n_samples, n_features) Output: similarity with shape (n_samples, n_samples)

Hint 1 - Direction

Cosine similarity = dot(a, b) / (norm(a) * norm(b)). For all pairs, the numerator is X @ X.T.

Hint 2 - Norms

Compute norms as a column vector, then use broadcasting for the denominator: norms[:, None] * norms[None, :].

Hint 3 - Full Solution
def cosine_similarity_matrix(X):
"""
Compute pairwise cosine similarity.

Args:
X: np.ndarray of shape (n_samples, n_features)
Returns:
np.ndarray of shape (n_samples, n_samples) with similarities in [-1, 1]
"""
# Step 1: Compute all pairwise dot products
dot_products = X @ X.T # Shape: (n, n)

# Step 2: Compute L2 norms
norms = np.linalg.norm(X, axis=1) # Shape: (n,)

# Step 3: Outer product of norms for denominator
# Broadcasting: (n, 1) * (1, n) -> (n, n)
norm_products = norms[:, None] * norms[None, :]

# Step 4: Divide (handle zero norms)
similarity = np.divide(dot_products, norm_products,
out=np.zeros_like(dot_products),
where=norm_products != 0)

return similarity

# Test
X = np.array([[1, 0, 0],
[0, 1, 0],
[1, 1, 0]])
sim = cosine_similarity_matrix(X)
print(sim)
# [[1. , 0. , 0.707],
# [0. , 1. , 0.707],
# [0.707, 0.707, 1. ]]

Scoring:

  • Strong Hire: Vectorized solution with zero-norm handling. Can discuss when to use cosine vs Euclidean distance.
  • Lean Hire: Correct but uses a loop over pairs.
  • No Hire: Cannot vectorize the computation.

Problem 4: Implement Cross-Entropy Loss

Compute the categorical cross-entropy loss for a batch of predictions.

Input: y_pred shape (batch_size, num_classes) - predicted probabilities, y_true shape (batch_size,) - integer labels Output: scalar loss value

Hint 1 - Direction

Cross-entropy: -log(p_correct_class). Use fancy indexing to select the predicted probability for the true class.

Hint 2 - Numerical stability

Clip predictions to avoid log(0). Use np.clip(y_pred, eps, 1 - eps).

Hint 3 - Full Solution
def cross_entropy_loss(y_pred, y_true, eps=1e-12):
"""
Categorical cross-entropy loss.

Args:
y_pred: np.ndarray of shape (batch_size, num_classes) - probabilities
y_true: np.ndarray of shape (batch_size,) - integer class labels
Returns:
Scalar loss value
"""
batch_size = y_pred.shape[0]

# Clip to avoid log(0)
y_pred_clipped = np.clip(y_pred, eps, 1 - eps)

# Select the predicted probability for the true class
# Using advanced indexing: for each sample i, get y_pred[i, y_true[i]]
correct_probs = y_pred_clipped[np.arange(batch_size), y_true]

# Compute negative log likelihood and average over batch
loss = -np.mean(np.log(correct_probs))

return loss

# Test
y_pred = np.array([[0.7, 0.2, 0.1],
[0.1, 0.8, 0.1],
[0.2, 0.3, 0.5]])
y_true = np.array([0, 1, 2])
loss = cross_entropy_loss(y_pred, y_true)
print(f"Loss: {loss:.4f}") # ~0.3567

Key insight: The line y_pred_clipped[np.arange(batch_size), y_true] is fancy indexing - it selects one element per row using the label as the column index. No loops needed.

Scoring:

  • Strong Hire: Correct with clipping, fancy indexing, and can derive the gradient.
  • Lean Hire: Correct but uses one-hot encoding instead of fancy indexing (wasteful).
  • No Hire: Uses a loop over the batch.

Problem 5: Pairwise Euclidean Distance

Compute all pairwise Euclidean distances between two sets of points without any loops.

Input: X shape (n, d), Y shape (m, d) Output: distances shape (n, m)

Hint 1 - Direction

Use the identity: ||a - b||^2 = ||a||^2 + ||b||^2 - 2ab. This avoids the O(nmd) explicit difference.

Hint 2 - Broadcasting

||X||^2 has shape (n,), ||Y||^2 has shape (m,). Reshape to (n, 1) and (1, m) for broadcasting, then add the cross term X @ Y.T.

Hint 3 - Full Solution
def pairwise_euclidean(X, Y):
"""
Compute pairwise Euclidean distances between X and Y.

Args:
X: np.ndarray of shape (n, d)
Y: np.ndarray of shape (m, d)
Returns:
np.ndarray of shape (n, m) with distances
"""
# ||x - y||^2 = ||x||^2 + ||y||^2 - 2 * x . y
X_sq = np.sum(X ** 2, axis=1)[:, None] # Shape: (n, 1)
Y_sq = np.sum(Y ** 2, axis=1)[None, :] # Shape: (1, m)
cross = X @ Y.T # Shape: (n, m)

# Clamp to avoid negative values from floating point errors
sq_dist = np.maximum(X_sq + Y_sq - 2 * cross, 0.0)

return np.sqrt(sq_dist)

# Test
X = np.array([[0, 0], [1, 0], [0, 1]])
Y = np.array([[1, 1], [2, 2]])
print(pairwise_euclidean(X, Y))
# [[1.414, 2.828],
# [1.000, 2.236],
# [1.000, 2.236]]

Why clamp? Floating point arithmetic can produce small negative values in X_sq + Y_sq - 2 * cross when points are very close. Without clamping, np.sqrt would produce NaN.

Scoring:

  • Strong Hire: Uses the expansion trick, handles numerical issues, discusses complexity O(nm + nd + m*d).
  • Lean Hire: Correct but uses explicit subtraction with broadcasting: np.sqrt(((X[:, None, :] - Y[None, :, :]) ** 2).sum(axis=2)) - this works but creates an O(nmd) intermediate array.
  • No Hire: Uses loops.

Problem 6: Implement k-Nearest Neighbors Prediction

Given training data, labels, and a query point, predict the label using k-NN. Use NumPy only.

Hint 1 - Direction

Compute distances from the query to all training points, find the k smallest, and take a majority vote.

Hint 2 - Useful functions

Use np.argpartition for O(n) partial sort instead of O(n log n) full sort. Use np.bincount for majority vote.

Hint 3 - Full Solution
def knn_predict(X_train, y_train, X_query, k=5):
"""
k-NN prediction for multiple query points.

Args:
X_train: np.ndarray of shape (n_train, d)
y_train: np.ndarray of shape (n_train,) - integer labels
X_query: np.ndarray of shape (n_query, d)
k: number of neighbors
Returns:
predictions: np.ndarray of shape (n_query,)
"""
# Step 1: Compute all pairwise distances
# Using the expansion trick from Problem 5
X_sq = np.sum(X_train ** 2, axis=1)[None, :] # (1, n_train)
Q_sq = np.sum(X_query ** 2, axis=1)[:, None] # (n_query, 1)
cross = X_query @ X_train.T # (n_query, n_train)
distances = np.sqrt(np.maximum(Q_sq + X_sq - 2 * cross, 0.0))

# Step 2: Find k nearest neighbors (O(n) with argpartition)
# argpartition gives indices of k smallest elements (unordered)
knn_indices = np.argpartition(distances, k, axis=1)[:, :k]

# Step 3: Get labels of k nearest neighbors
knn_labels = y_train[knn_indices] # Shape: (n_query, k)

# Step 4: Majority vote per query
n_classes = y_train.max() + 1
predictions = np.zeros(X_query.shape[0], dtype=int)
for i in range(X_query.shape[0]):
counts = np.bincount(knn_labels[i], minlength=n_classes)
predictions[i] = np.argmax(counts)

return predictions

# Test
X_train = np.array([[0, 0], [1, 0], [0, 1], [5, 5], [6, 5], [5, 6]])
y_train = np.array([0, 0, 0, 1, 1, 1])
X_query = np.array([[0.5, 0.5], [5.5, 5.5]])
print(knn_predict(X_train, y_train, X_query, k=3)) # [0, 1]

Note: The loop in Step 4 is acceptable because it is over queries (often small), not over features or training samples. A fully vectorized version would use np.apply_along_axis or a sparse matrix trick, but the loop here is pragmatic.

Scoring:

  • Strong Hire: Uses argpartition instead of argsort. Handles ties. Discusses time complexity.
  • Lean Hire: Correct but uses full sort (argsort).
  • No Hire: Cannot compute distances without loops.

Problem 7: Implement Matrix Factorization (Truncated SVD)

Implement dimensionality reduction using SVD. Given a data matrix, return the k-dimensional representation.

Hint 1 - Direction

Center the data, compute SVD, and project onto the top-k right singular vectors.

Hint 2 - The projection

After SVD: X = U @ diag(S) @ Vt. The k-dimensional projection is X @ Vt[:k].T, or equivalently U[:, :k] @ diag(S[:k]).

Hint 3 - Full Solution
def truncated_svd(X, k):
"""
Dimensionality reduction via truncated SVD.

Args:
X: np.ndarray of shape (n_samples, n_features)
k: number of components to keep
Returns:
X_reduced: np.ndarray of shape (n_samples, k)
explained_variance_ratio: np.ndarray of shape (k,)
components: np.ndarray of shape (k, n_features) - the projection directions
"""
# Step 1: Center the data
mean = X.mean(axis=0)
X_centered = X - mean

# Step 2: Compute SVD
U, S, Vt = np.linalg.svd(X_centered, full_matrices=False)

# Step 3: Truncate to k components
U_k = U[:, :k] # Shape: (n, k)
S_k = S[:k] # Shape: (k,)
Vt_k = Vt[:k, :] # Shape: (k, d)

# Step 4: Project (two equivalent ways)
X_reduced = U_k * S_k # Shape: (n, k) - element-wise multiply
# Equivalent: X_reduced = X_centered @ Vt_k.T

# Step 5: Explained variance ratio
total_variance = np.sum(S ** 2) / (X.shape[0] - 1)
explained_variance = S_k ** 2 / (X.shape[0] - 1)
explained_variance_ratio = explained_variance / total_variance

return X_reduced, explained_variance_ratio, Vt_k

# Test
X = np.random.randn(100, 10)
X_reduced, evr, components = truncated_svd(X, k=3)
print(f"Reduced shape: {X_reduced.shape}") # (100, 3)
print(f"Explained variance ratios: {evr}")
print(f"Total explained: {evr.sum():.2%}")

Scoring:

  • Strong Hire: Centers data first, computes explained variance ratio, discusses connection to PCA. Mentions that full_matrices=False is important for memory when n >> d.
  • Lean Hire: Correct SVD but forgets to center data.
  • No Hire: Cannot implement without sklearn.

Problem 8: Implement Attention Scores

Compute scaled dot-product attention scores for a batch of queries and keys.

Input: Q shape (batch, seq_len, d_model), K shape (batch, seq_len, d_model) Output: attention_weights shape (batch, seq_len, seq_len)

Hint 1 - Direction

Attention(Q, K) = softmax(Q @ K^T / sqrt(d_model)). Use batch matrix multiplication.

Hint 2 - Masking

For causal attention, add a mask of -inf to the upper triangle before softmax.

Hint 3 - Full Solution
def scaled_dot_product_attention(Q, K, V, mask=None):
"""
Scaled dot-product attention.

Args:
Q: np.ndarray of shape (batch, seq_len_q, d_k)
K: np.ndarray of shape (batch, seq_len_k, d_k)
V: np.ndarray of shape (batch, seq_len_k, d_v)
mask: optional bool array, True = masked (set to -inf)
Returns:
output: np.ndarray of shape (batch, seq_len_q, d_v)
weights: np.ndarray of shape (batch, seq_len_q, seq_len_k)
"""
d_k = Q.shape[-1]

# Step 1: Compute raw scores
# Batch matmul: (batch, seq_q, d_k) @ (batch, d_k, seq_k) -> (batch, seq_q, seq_k)
scores = Q @ K.transpose(0, 2, 1) / np.sqrt(d_k)

# Step 2: Apply mask (for causal/padding attention)
if mask is not None:
scores = np.where(mask, -1e9, scores)

# Step 3: Softmax along last axis (key dimension)
shifted = scores - scores.max(axis=-1, keepdims=True)
exp_scores = np.exp(shifted)
weights = exp_scores / exp_scores.sum(axis=-1, keepdims=True)

# Step 4: Weighted sum of values
output = weights @ V # (batch, seq_q, seq_k) @ (batch, seq_k, d_v)

return output, weights

# Test with causal mask
batch, seq_len, d_k, d_v = 2, 4, 8, 8
Q = np.random.randn(batch, seq_len, d_k)
K = np.random.randn(batch, seq_len, d_k)
V = np.random.randn(batch, seq_len, d_v)

# Causal mask: prevent attending to future positions
causal_mask = np.triu(np.ones((seq_len, seq_len), dtype=bool), k=1)
output, weights = scaled_dot_product_attention(Q, K, V, mask=causal_mask)
print(f"Output shape: {output.shape}") # (2, 4, 8)
print(f"Weights shape: {weights.shape}") # (2, 4, 4)
print(f"Weights row sums: {weights.sum(axis=-1)}") # All 1.0

Scoring:

  • Strong Hire: Implements with scaling, masking, numerical stability. Explains why scaling by sqrt(d_k) prevents softmax saturation. Can extend to multi-head attention.
  • Lean Hire: Correct basic implementation but misses scaling or masking.
  • No Hire: Cannot express attention as matrix operations.

Problem 9: Implement Batch Matrix Operations for Multi-Head Attention

Split an input into multiple attention heads, process each, and concatenate results.

Hint 1 - Direction

Reshape (batch, seq, d_model) into (batch, n_heads, seq, d_head), then use the attention function from Problem 8.

Hint 2 - Reshape trick

x.reshape(batch, seq, n_heads, d_head).transpose(0, 2, 1, 3) gives shape (batch, n_heads, seq, d_head).

Hint 3 - Full Solution
def multi_head_attention(X, W_Q, W_K, W_V, W_O, n_heads, mask=None):
"""
Multi-head attention.

Args:
X: np.ndarray of shape (batch, seq_len, d_model)
W_Q, W_K, W_V: np.ndarray of shape (d_model, d_model)
W_O: np.ndarray of shape (d_model, d_model) - output projection
n_heads: int
mask: optional
Returns:
output: np.ndarray of shape (batch, seq_len, d_model)
"""
batch, seq_len, d_model = X.shape
d_head = d_model // n_heads

# Step 1: Project to Q, K, V
Q = X @ W_Q # (batch, seq, d_model)
K = X @ W_K
V = X @ W_V

# Step 2: Reshape into heads
# (batch, seq, d_model) -> (batch, seq, n_heads, d_head) -> (batch, n_heads, seq, d_head)
Q = Q.reshape(batch, seq_len, n_heads, d_head).transpose(0, 2, 1, 3)
K = K.reshape(batch, seq_len, n_heads, d_head).transpose(0, 2, 1, 3)
V = V.reshape(batch, seq_len, n_heads, d_head).transpose(0, 2, 1, 3)

# Merge batch and head dims for attention: (batch * n_heads, seq, d_head)
Q = Q.reshape(batch * n_heads, seq_len, d_head)
K = K.reshape(batch * n_heads, seq_len, d_head)
V = V.reshape(batch * n_heads, seq_len, d_head)

# Step 3: Apply scaled dot-product attention
attn_output, attn_weights = scaled_dot_product_attention(Q, K, V, mask)

# Step 4: Reshape back: (batch * n_heads, seq, d_head) -> (batch, seq, d_model)
attn_output = attn_output.reshape(batch, n_heads, seq_len, d_head)
attn_output = attn_output.transpose(0, 2, 1, 3) # (batch, seq, n_heads, d_head)
attn_output = attn_output.reshape(batch, seq_len, d_model) # Concatenate heads

# Step 5: Output projection
output = attn_output @ W_O

return output

# Test
batch, seq_len, d_model, n_heads = 2, 8, 64, 8
X = np.random.randn(batch, seq_len, d_model)
W_Q = np.random.randn(d_model, d_model) * 0.02
W_K = np.random.randn(d_model, d_model) * 0.02
W_V = np.random.randn(d_model, d_model) * 0.02
W_O = np.random.randn(d_model, d_model) * 0.02

output = multi_head_attention(X, W_Q, W_K, W_V, W_O, n_heads)
print(f"Output shape: {output.shape}") # (2, 8, 64)

Scoring:

  • Strong Hire: Correct reshape/transpose sequence, can explain why we split into heads (different representation subspaces), discusses parameter count.
  • Lean Hire: Correct but uses loops over heads.
  • No Hire: Cannot manage the 4D tensor manipulations.

Problem 10: Implement Conv2D Forward Pass

Implement a basic 2D convolution forward pass using NumPy (im2col approach).

Hint 1 - Direction

The im2col trick converts convolution to matrix multiplication: extract all patches into columns, then multiply by the flattened kernel.

Hint 2 - Patch extraction

Use stride tricks or nested indexing to extract patches. The output spatial size is (H - kH + 2*pad) // stride + 1.

Hint 3 - Full Solution
def conv2d_forward(x, kernel, bias=None, stride=1, padding=0):
"""
2D convolution forward pass using im2col.

Args:
x: np.ndarray of shape (batch, in_channels, H, W)
kernel: np.ndarray of shape (out_channels, in_channels, kH, kW)
bias: optional np.ndarray of shape (out_channels,)
stride: int
padding: int
Returns:
output: np.ndarray of shape (batch, out_channels, H_out, W_out)
"""
batch, in_ch, H, W = x.shape
out_ch, _, kH, kW = kernel.shape

# Step 1: Pad input if needed
if padding > 0:
x = np.pad(x, ((0,0), (0,0), (padding,padding), (padding,padding)))
_, _, H, W = x.shape

# Step 2: Compute output dimensions
H_out = (H - kH) // stride + 1
W_out = (W - kW) // stride + 1

# Step 3: im2col - extract patches as columns
# Each patch is (in_ch * kH * kW) elements
# Total patches per image: H_out * W_out
cols = np.zeros((batch, in_ch * kH * kW, H_out * W_out))

col_idx = 0
for i in range(H_out):
for j in range(W_out):
h_start = i * stride
w_start = j * stride
patch = x[:, :, h_start:h_start+kH, w_start:w_start+kW]
cols[:, :, col_idx] = patch.reshape(batch, -1)
col_idx += 1

# Step 4: Reshape kernel: (out_ch, in_ch*kH*kW)
kernel_flat = kernel.reshape(out_ch, -1)

# Step 5: Matrix multiply: (out_ch, in_ch*kH*kW) @ (batch, in_ch*kH*kW, patches)
# Use einsum for batch matmul
output = np.einsum('oi,bio->bo', kernel_flat, cols)
# Reshape: actually we need per-sample:
output = np.zeros((batch, out_ch, H_out * W_out))
for b_idx in range(batch):
output[b_idx] = kernel_flat @ cols[b_idx]

output = output.reshape(batch, out_ch, H_out, W_out)

# Step 6: Add bias
if bias is not None:
output += bias.reshape(1, -1, 1, 1) # Broadcasting

return output

# Test
batch, in_ch, H, W = 2, 3, 8, 8
out_ch, kH, kW = 16, 3, 3
x = np.random.randn(batch, in_ch, H, W)
kernel = np.random.randn(out_ch, in_ch, kH, kW) * 0.01
bias = np.zeros(out_ch)

output = conv2d_forward(x, kernel, bias, stride=1, padding=1)
print(f"Output shape: {output.shape}") # (2, 16, 8, 8) with padding=1

Note: The im2col approach trades memory for speed - the column matrix can be large. Production implementations use optimized GEMM libraries. The loops here are over spatial positions (small), not over batch or channels.

Scoring:

  • Strong Hire: Implements im2col correctly, discusses the memory-speed tradeoff, mentions that cuDNN uses different algorithms (Winograd, FFT).
  • Lean Hire: Correct but uses 6 nested loops (batch, out_ch, H_out, W_out, kH, kW).
  • No Hire: Cannot implement convolution at all.

Interview Cheat Sheet

ConceptKey PatternOne-LinerRed Flag
Array creationnp.zeros, np.eye, default_rng()Fixed-type contiguous memory blocksUsing Python lists for numerical computation
BroadcastingShapes align from the right, size-1 stretchesEliminates loops for element-wise opsCannot state the three rules
VectorizationReplace loops with array operations100-1000x speedup over Python loopsWriting for i in range(n) over array elements
Matrix multiplyA @ B or np.matmulUse @ operator, not np.dot for clarityUsing np.dot for batch matmul
Solving Ax=bnp.linalg.solve(A, b)Never invert A explicitlyWriting np.linalg.inv(A) @ b
Fancy indexinga[indices], a[np.arange(n), labels]Select arbitrary elements without loopsNot knowing that fancy indexing copies
Boolean indexinga[a > 0], np.where(cond, x, y)Conditional selection/assignmentUsing loops for conditional operations
Memory layoutC-order (row-major) vs F-order (col-major)Determines which axis iterations are fastNot knowing views vs copies
Numerical stabilitySubtract max before exp, clip before logPrevent overflow/underflow in softmax/CEImplementing naive softmax that overflows
Partial sortnp.argpartition(a, k)O(n) for top-k instead of O(n log n) sortUsing full sort for top-k selection

Spaced Repetition Checkpoints

Day 0 - Initial Learning

  • Read this entire page and run all code examples
  • Implement softmax and batch norm from memory
  • State the three broadcasting rules without looking
  • Complete the self-assessment

Day 3 - First Recall

  • Implement cosine similarity matrix and cross-entropy loss from scratch
  • Explain views vs copies and when each occurs
  • Solve Problem 5 (pairwise distance) without hints

Day 7 - Connections

  • Implement k-NN prediction using vectorized distance computation
  • Implement truncated SVD and explain its connection to PCA
  • Write scaled dot-product attention from memory

Day 14 - Application

  • Implement multi-head attention from scratch under timed conditions (20 min)
  • Implement a conv2d forward pass from memory
  • Explain memory layout implications for ML data loading

Day 21 - Mock Interview

  • Have someone pick 3 problems randomly - solve each in under 10 minutes
  • Explain broadcasting rules, then demonstrate with 3 examples on a whiteboard
  • Implement batch normalization for 4D tensors (batch, channels, H, W) from memory

Key Takeaways

  1. Broadcasting is the foundation. Every vectorized ML operation - normalization, distance computation, attention - relies on broadcasting. Master the three rules and you can vectorize anything.

  2. Vectorization is not optional. In AI interviews, a loop over array elements is treated as a correctness error, not just a style issue. Interviewers interpret it as a signal that you will write slow production code.

  3. Numerical stability is expected. Softmax, cross-entropy, and distance computations all have well-known numerical failure modes. Production-ready implementations handle them. Interview implementations should too.

  4. Memory awareness separates senior from junior. Understanding views vs copies, C-order vs Fortran-order, and the memory implications of operations like im2col demonstrates systems-level thinking that senior ML roles require.

Next Steps

Continue to Pandas for Interviews to master data manipulation and feature engineering - the other half of the numerical computing interview.

© 2026 EngineersOfAI. All rights reserved.