NumPy for Interviews - Vectorized Thinking for ML Engineers

Reading time: ~45 min | Interview relevance: Critical | Roles: MLE, AI Eng, Data Scientist, Research Engineer

The Real Interview Moment

You are thirty minutes into a Meta MLE phone screen. The interviewer shares a Colab notebook and says: "Implement batch normalization using only NumPy. No loops." You stare at the cell. You know the formula - subtract mean, divide by standard deviation, scale and shift. But the input is a 4D tensor of shape (batch, channels, height, width), and you need to normalize per channel. Your fingers hover over the keyboard. Do you reach for a for-loop over channels, or do you see the axis parameter, the keepdims trick, the broadcasting shape?

This is the moment that separates candidates who use NumPy from candidates who think in NumPy. Every AI interview involving numerical computation will test whether you can express mathematical operations as vectorized array manipulations - without loops, without copies, without wasted memory. The interviewer is not testing whether you memorized the API. They are testing whether you understand how data lives in memory and how to transform it efficiently.

Candidates who write for i in range(batch_size) get a polite "we'll get back to you." Candidates who write (x - x.mean(axis=(0,2,3), keepdims=True)) / (x.std(axis=(0,2,3), keepdims=True) + eps) in one line get a "strong hire." This page gives you everything you need to be in the second group.

What You Will Master

Create arrays efficiently using arange, linspace, zeros, ones, eye, and random generators
Manipulate shapes with reshape, transpose, expand_dims, squeeze, and stacking operations
Apply broadcasting rules to eliminate loops from any element-wise computation
Vectorize loop-based code into array operations with 100x speedups
Perform linear algebra operations: dot products, matrix multiplication, eigendecomposition, SVD
Use advanced indexing: boolean masks, fancy indexing, np.where, np.take_along_axis
Understand memory layout: C-order vs Fortran-order, strides, views vs copies
Solve 10 interview-grade NumPy problems from scratch

Self-Assessment: Where Are You Now?

Skill	1 - Cannot	2 - Vaguely	3 - Can Do	4 - Can Optimize	5 - Can Teach	Your Score
Create arrays (zeros, eye, random)						___
Reshape and transpose correctly						___
Apply broadcasting rules						___
Vectorize a loop-based computation						___
Matrix multiplication (@ operator)						___
Boolean and fancy indexing						___
Explain memory layout and strides						___
Implement ML operations without loops						___

Target: All 4s and 5s before your interview.

Part 1 - Array Creation and Manipulation

Essential Array Creation

import numpy as np

# From data
a = np.array([1, 2, 3])                    # 1D array
b = np.array([[1, 2], [3, 4]])              # 2D array (2x2 matrix)

# Filled arrays
zeros = np.zeros((3, 4))                    # 3x4 of zeros
ones = np.ones((2, 3, 4))                   # 2x3x4 of ones
full = np.full((3, 3), 7.0)                 # 3x3 filled with 7.0
eye = np.eye(4)                             # 4x4 identity matrix

# Sequences
seq = np.arange(0, 10, 2)                   # [0, 2, 4, 6, 8]
lin = np.linspace(0, 1, 5)                  # [0.0, 0.25, 0.5, 0.75, 1.0]
log = np.logspace(0, 3, 4)                  # [1, 10, 100, 1000]

# Random (modern API)
rng = np.random.default_rng(42)
uniform = rng.uniform(0, 1, size=(3, 3))    # Uniform [0, 1)
normal = rng.standard_normal((3, 3))        # Standard normal
integers = rng.integers(0, 10, size=(5,))   # Random ints [0, 10)

60-Second Answer

"NumPy arrays are fixed-type, contiguous memory blocks - unlike Python lists, which are arrays of pointers to scattered heap objects. This is why NumPy is fast: operations map to BLAS/LAPACK calls on contiguous memory, enabling SIMD vectorization. I always use the modern default_rng() API for reproducible random generation, and I choose dtype explicitly when memory matters - float32 for GPU-bound work, float64 for numerical stability."

Shape Manipulation

x = np.arange(24)

# Reshape: total elements must match
a = x.reshape(4, 6)        # Shape: (4, 6)
b = x.reshape(2, 3, 4)     # Shape: (2, 3, 4)
c = x.reshape(2, -1)       # Shape: (2, 12) - infer last dim

# Add/remove dimensions
d = x.reshape(24, 1)             # Explicit column vector
e = np.expand_dims(x, axis=0)    # Shape: (1, 24) - add batch dim
f = np.expand_dims(x, axis=1)    # Shape: (24, 1) - add feature dim
g = np.squeeze(e)                # Shape: (24,) - remove size-1 dims

# Transpose
mat = np.arange(6).reshape(2, 3)
mat_T = mat.T                    # Shape: (3, 2)
# For higher dimensions: use transpose with axis order
tensor = np.arange(24).reshape(2, 3, 4)
tensor_t = tensor.transpose(0, 2, 1)  # Shape: (2, 4, 3)

# Stacking
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
np.stack([a, b], axis=0)     # Shape: (2, 3) - stack as rows
np.stack([a, b], axis=1)     # Shape: (3, 2) - stack as columns
np.concatenate([a, b])       # Shape: (6,) - join along existing axis

Common Trap

reshape returns a view when possible (contiguous memory), but returns a copy when the data is not contiguous. After a transpose, the data is typically not contiguous, so transposed_array.reshape(...) creates a copy. Use .copy() explicitly when you need to guarantee contiguous memory for performance-critical code. Interviewers will ask: "Does reshape always return a view?"

Part 2 - Broadcasting Rules

Broadcasting is the single most important NumPy concept for interviews. It determines how arrays of different shapes interact in element-wise operations.

The Three Rules

NumPy Broadcasting Three Rules

Broadcasting Examples

# Example 1: Scalar + Array
a = np.array([1, 2, 3])       # Shape: (3,)
b = 5                          # Shape: ()
# Rule 1: b becomes shape (1,), then (3,) by Rule 3
c = a + b                     # [6, 7, 8], shape: (3,)

# Example 2: Column + Row = Matrix
col = np.array([[1], [2], [3]])   # Shape: (3, 1)
row = np.array([10, 20, 30])      # Shape: (3,) -> padded to (1, 3)
# Rule 3: col stretches to (3, 3), row stretches to (3, 3)
result = col + row
# [[11, 21, 31],
#  [12, 22, 32],
#  [13, 23, 33]]

# Example 3: Subtract mean per feature (common ML operation)
X = np.random.randn(100, 5)      # Shape: (100, 5) - 100 samples, 5 features
mean = X.mean(axis=0)             # Shape: (5,)
X_centered = X - mean             # Broadcasting: (100, 5) - (5,) -> (100, 5)

# Example 4: Normalize per channel in an image batch
images = np.random.randn(32, 3, 224, 224)   # (batch, channels, H, W)
channel_mean = images.mean(axis=(0, 2, 3))   # Shape: (3,)
# Need to reshape for broadcasting: (3,) -> (1, 3, 1, 1)
channel_mean = channel_mean.reshape(1, 3, 1, 1)
normalized = images - channel_mean            # Shape: (32, 3, 224, 224)

Instant Rejection

Never write for i in range(n): result[i] = X[i] - mean when result = X - mean does the same thing via broadcasting. Interviewers interpret Python loops over array elements as a signal that you do not understand vectorized computation. In production ML code, such loops are 100-1000x slower and indicate you will write slow data pipelines.

Broadcasting Compatibility Table

Shape A	Shape B	Compatible?	Result Shape	Explanation
(3,)	(3,)	Yes	(3,)	Same shape
(3, 4)	(4,)	Yes	(3, 4)	(4,) padded to (1, 4), stretched
(3, 4)	(3, 1)	Yes	(3, 4)	1 stretched to 4
(3, 4)	(3,)	No	Error	Trailing dims 4 vs 3, neither is 1
(2, 3, 4)	(3, 4)	Yes	(2, 3, 4)	Padded to (1, 3, 4)
(2, 3, 4)	(2, 1, 4)	Yes	(2, 3, 4)	Middle dim 1 stretched to 3
(5, 1, 3)	(1, 4, 3)	Yes	(5, 4, 3)	Both size-1 dims stretch
(5, 3)	(4, 3)	No	Error	5 vs 4, neither is 1

Part 3 - Vectorization vs Loops

The Performance Gap

import time

n = 1_000_000

# Slow: Python loop
def dot_product_loop(a, b):
    result = 0.0
    for i in range(len(a)):
        result += a[i] * b[i]
    return result

# Fast: Vectorized
def dot_product_vectorized(a, b):
    return np.dot(a, b)

a = np.random.randn(n)
b = np.random.randn(n)

# Loop: ~300ms | Vectorized: ~1ms - 300x faster

Common Vectorization Patterns

# Pattern 1: Conditional assignment
# Slow
for i in range(len(x)):
    if x[i] > 0:
        result[i] = x[i]
    else:
        result[i] = 0
# Fast (ReLU activation)
result = np.maximum(x, 0)
# Also: np.where(x > 0, x, 0)

# Pattern 2: Pairwise distances
# Slow: O(n^2) loop
n_samples = 1000
X = np.random.randn(n_samples, 50)
distances = np.zeros((n_samples, n_samples))
# for i in range(n_samples):
#     for j in range(n_samples):
#         distances[i, j] = np.sqrt(np.sum((X[i] - X[j])**2))

# Fast: Vectorized using broadcasting
# ||a - b||^2 = ||a||^2 + ||b||^2 - 2 * a . b
sq_norms = np.sum(X**2, axis=1)  # Shape: (n,)
# Broadcasting: (n, 1) + (1, n) - 2 * (n, n)
distances = np.sqrt(sq_norms[:, None] + sq_norms[None, :] - 2 * X @ X.T)

# Pattern 3: Accumulation with grouping
labels = np.array([0, 1, 0, 2, 1, 0])
values = np.array([10, 20, 30, 40, 50, 60])
# Sum per group using np.bincount
group_sums = np.bincount(labels, weights=values)  # [100, 70, 40]

# Pattern 4: Rolling/sliding window
def rolling_mean(x, window):
    # Use cumsum trick for O(n) rolling mean
    cumsum = np.cumsum(x)
    cumsum[window:] = cumsum[window:] - cumsum[:-window]
    return cumsum[window - 1:] / window

Company Variation

Google and Meta heavily test vectorization skills because their ML pipelines process billions of examples. They expect you to recognize that a nested Python loop over arrays is unacceptable. Startups are more forgiving - they care more about correctness than performance. Research labs (DeepMind, FAIR) test whether you can express complex mathematical operations (attention, convolution) as matrix operations.

Part 4 - Linear Algebra Operations

Essential Operations for ML

A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
v = np.array([1, 2])

# Matrix multiplication (three equivalent ways)
C = A @ B                    # Preferred syntax (Python 3.5+)
C = np.matmul(A, B)          # Function form
C = np.dot(A, B)             # Works for 2D, but @ is clearer

# Dot product
d = np.dot(v, v)             # Scalar: 5
d = v @ v                    # Same thing

# Batch matrix multiplication
# Shape: (batch, m, k) @ (batch, k, n) -> (batch, m, n)
batch_A = np.random.randn(32, 4, 3)
batch_B = np.random.randn(32, 3, 5)
batch_C = batch_A @ batch_B  # Shape: (32, 4, 5)

# Transpose
A_T = A.T

# Inverse (use sparingly - prefer solve)
A_inv = np.linalg.inv(A)

# Solve Ax = b (numerically stable)
b = np.array([5, 11])
x = np.linalg.solve(A, b)   # Much better than A_inv @ b

# Eigendecomposition
eigenvalues, eigenvectors = np.linalg.eigh(A.T @ A)  # For symmetric matrices

# SVD
U, S, Vt = np.linalg.svd(A, full_matrices=False)

# Norms
l2_norm = np.linalg.norm(v)            # L2 norm (default)
l1_norm = np.linalg.norm(v, ord=1)     # L1 norm
frob_norm = np.linalg.norm(A, 'fro')   # Frobenius norm for matrices

# Determinant and rank
det = np.linalg.det(A)
rank = np.linalg.matrix_rank(A)

Common Trap

Never use np.linalg.inv(A) @ b to solve a linear system. Use np.linalg.solve(A, b) instead. The inverse is numerically unstable and O(n^3) - solve uses LU decomposition which is both faster and more stable. Interviewers specifically check for this. Similarly, never compute A.T @ A then invert it for least squares - use np.linalg.lstsq(A, b).

Part 5 - Advanced Indexing

Boolean Indexing

X = np.array([[1, -2, 3], [4, -5, 6], [-7, 8, -9]])

# Select elements matching a condition
positives = X[X > 0]           # [1, 3, 4, 6, 8] - flattened

# Set elements matching a condition
X_clipped = X.copy()
X_clipped[X_clipped < 0] = 0   # ReLU equivalent

# Combine conditions (use & for AND, | for OR, ~ for NOT)
mask = (X > 0) & (X < 5)
selected = X[mask]              # [1, 3, 4]

# np.where: conditional element selection
result = np.where(X > 0, X, 0)        # ReLU
result = np.where(X > 0, X, -X)       # Absolute value

Fancy Indexing

# Index with arrays of indices
arr = np.array([10, 20, 30, 40, 50])
indices = np.array([0, 3, 4])
arr[indices]                    # [10, 40, 50]

# 2D fancy indexing
mat = np.arange(12).reshape(3, 4)
rows = np.array([0, 2])
cols = np.array([1, 3])
mat[rows, cols]                 # [1, 11] - elements (0,1) and (2,3)

# Select specific rows
mat[rows]                       # Shape: (2, 4) - rows 0 and 2

# np.take_along_axis: gather along an axis
# Common in top-k selection, beam search
scores = np.array([[0.1, 0.9, 0.3],
                   [0.8, 0.1, 0.5]])
top_k_indices = np.argsort(scores, axis=1)[:, -2:]  # Top-2 per row
top_k_values = np.take_along_axis(scores, top_k_indices, axis=1)

# np.put_along_axis: scatter along an axis
# Useful for one-hot encoding
n_classes = 5
labels = np.array([0, 3, 1, 4])
one_hot = np.zeros((len(labels), n_classes))
np.put_along_axis(one_hot, labels.reshape(-1, 1), 1, axis=1)

Part 6 - Memory Layout: C vs Fortran Order

How Arrays Are Stored

a = np.arange(6).reshape(2, 3)
# [[0, 1, 2],
#  [3, 4, 5]]

# C-order (row-major, default): elements in a row are contiguous
# Memory: [0, 1, 2, 3, 4, 5]
a_c = np.ascontiguousarray(a)   # Ensure C-order

# Fortran-order (column-major): elements in a column are contiguous
# Memory: [0, 3, 1, 4, 2, 5]
a_f = np.asfortranarray(a)     # Convert to Fortran-order

# Check order
a.flags['C_CONTIGUOUS']         # True (default)
a.flags['F_CONTIGUOUS']         # False

# Strides: bytes to step in each dimension
a.strides    # (24, 8) for float64 C-order: 3*8=24 to next row, 8 to next col

Why Memory Layout Matters

n = 10000

# Row-major access (fast for C-order)
a = np.random.randn(n, n)
# Summing along axis=1 (across columns within a row) is fast
# because elements in the same row are contiguous in memory
row_sum = a.sum(axis=1)   # Fast: sequential memory access

# Column-major access (slow for C-order)
col_sum = a.sum(axis=0)   # Slower: strided memory access (cache misses)

Interviewer's Perspective

When I ask about memory layout, I am testing whether the candidate understands performance beyond algorithmic complexity. A candidate who knows that sum(axis=1) is faster than sum(axis=0) for row-major arrays - and can explain why in terms of cache lines - demonstrates systems-level understanding. This matters for production ML code where data loading and preprocessing are often the bottleneck.

Views vs Copies

a = np.arange(10)

# Views: share the same memory (modifications propagate)
b = a[2:5]          # Slice -> view
b[0] = 99           # a[2] is now 99
c = a.reshape(2, 5) # Reshape -> view (if contiguous)

# Copies: independent memory
d = a.copy()         # Explicit copy
e = a[[0, 1, 2]]    # Fancy indexing -> always a copy
f = a[a > 5]         # Boolean indexing -> always a copy

# How to check
np.shares_memory(a, b)   # True (view)
np.shares_memory(a, d)   # False (copy)

Part 7 - Random Sampling for ML

rng = np.random.default_rng(42)

# Common distributions for ML
normal = rng.normal(loc=0, scale=1, size=(100, 10))    # Weight init
uniform = rng.uniform(0, 1, size=(100,))                # Dropout masks
bernoulli = rng.binomial(1, p=0.5, size=(100,))         # Binary masks

# Sampling without replacement (train/test split)
n_samples = 1000
indices = np.arange(n_samples)
rng.shuffle(indices)
train_idx = indices[:800]
test_idx = indices[800:]

# Or use choice
train_idx = rng.choice(n_samples, size=800, replace=False)

# Xavier/Glorot initialization
fan_in, fan_out = 256, 128
std = np.sqrt(2.0 / (fan_in + fan_out))
weights = rng.normal(0, std, size=(fan_in, fan_out))

# He initialization (for ReLU networks)
std = np.sqrt(2.0 / fan_in)
weights = rng.normal(0, std, size=(fan_in, fan_out))

Practice Problems

Problem 1: Implement Softmax

Implement the softmax function for a 2D input (batch of logits). Handle numerical stability.

Input: logits with shape (batch_size, num_classes) Output: probabilities with shape (batch_size, num_classes), rows sum to 1

Hint 1 - Direction

The naive formula is exp(x) / sum(exp(x)), but this overflows for large values. Subtract the maximum value first - this does not change the result mathematically but prevents overflow.

Hint 2 - Broadcasting

Use keepdims=True when computing the max and sum so that broadcasting works correctly along the class axis.

Hint 3 - Full Solution

def softmax(logits):
    """
    Numerically stable softmax for a batch of logits.

    Args:
        logits: np.ndarray of shape (batch_size, num_classes)
    Returns:
        np.ndarray of shape (batch_size, num_classes), rows sum to 1
    """
    # Subtract max for numerical stability (prevents exp overflow)
    # keepdims=True ensures shape (batch_size, 1) for broadcasting
    shifted = logits - logits.max(axis=1, keepdims=True)

    # Exponentiate
    exp_shifted = np.exp(shifted)

    # Normalize: divide by sum along class axis
    return exp_shifted / exp_shifted.sum(axis=1, keepdims=True)

# Test
logits = np.array([[1.0, 2.0, 3.0],
                    [1000.0, 1001.0, 1002.0]])  # Large values
probs = softmax(logits)
print(probs)
# [[0.0900, 0.2447, 0.6652],
#  [0.0900, 0.2447, 0.6652]]
print(probs.sum(axis=1))  # [1.0, 1.0]

Why the max trick works: softmax(x) = softmax(x - c) for any constant c. Subtracting the max ensures all exponents are <= 0, preventing overflow. The largest exponent is exp(0) = 1.

Scoring:

Strong Hire: Implements with numerical stability, correct axis handling, and keepdims. Explains why the max trick works mathematically.
Lean Hire: Correct formula but forgets numerical stability or has axis errors.
No Hire: Uses a loop over the batch dimension.

Problem 2: Implement Batch Normalization

Implement batch normalization for a 2D input (batch of feature vectors).

Input: x shape (batch_size, features), gamma shape (features,), beta shape (features,), eps=1e-5 Output: normalized, scaled, and shifted output of same shape

Hint 1 - Direction

Batch norm: normalize each feature across the batch (axis=0), then scale by gamma and shift by beta.

Hint 2 - Formula

Formula: y = gamma * (x - mean) / sqrt(var + eps) + beta where mean and var are computed per feature across the batch.

Hint 3 - Full Solution

def batch_norm(x, gamma, beta, eps=1e-5):
    """
    Batch normalization for 2D input.

    Args:
        x: np.ndarray of shape (batch_size, features)
        gamma: np.ndarray of shape (features,) - scale parameter
        beta: np.ndarray of shape (features,) - shift parameter
        eps: small constant for numerical stability
    Returns:
        Normalized output of shape (batch_size, features)
        Cache tuple for backward pass: (x_norm, std, x_centered)
    """
    # Step 1: Compute per-feature mean across batch
    mean = x.mean(axis=0)  # Shape: (features,)

    # Step 2: Center the data
    x_centered = x - mean  # Broadcasting: (batch, features) - (features,)

    # Step 3: Compute per-feature variance
    var = (x_centered ** 2).mean(axis=0)  # Shape: (features,)

    # Step 4: Normalize
    std = np.sqrt(var + eps)
    x_norm = x_centered / std  # Broadcasting: (batch, features) / (features,)

    # Step 5: Scale and shift
    out = gamma * x_norm + beta  # Broadcasting with (features,)

    return out, (x_norm, std, x_centered)

# Test
batch_size, features = 64, 128
x = np.random.randn(batch_size, features) * 5 + 3  # Arbitrary scale and shift
gamma = np.ones(features)
beta = np.zeros(features)

out, cache = batch_norm(x, gamma, beta)
print(f"Output mean per feature (should be ~0): {out.mean(axis=0)[:5]}")
print(f"Output var per feature (should be ~1):  {out.var(axis=0)[:5]}")

Scoring:

Strong Hire: Correct formula, proper axis usage, returns cache for backward pass. Can extend to 4D (conv layers).
Lean Hire: Correct formula but computes mean/var along wrong axis.
No Hire: Uses loops or cannot handle the broadcasting.

Problem 3: Cosine Similarity Matrix

Compute the pairwise cosine similarity between all pairs of vectors in a matrix.

Input: X with shape (n_samples, n_features) Output: similarity with shape (n_samples, n_samples)

Hint 1 - Direction

Cosine similarity = dot(a, b) / (norm(a) * norm(b)). For all pairs, the numerator is X @ X.T.

Hint 2 - Norms

Compute norms as a column vector, then use broadcasting for the denominator: norms[:, None] * norms[None, :].

Hint 3 - Full Solution

def cosine_similarity_matrix(X):
    """
    Compute pairwise cosine similarity.

    Args:
        X: np.ndarray of shape (n_samples, n_features)
    Returns:
        np.ndarray of shape (n_samples, n_samples) with similarities in [-1, 1]
    """
    # Step 1: Compute all pairwise dot products
    dot_products = X @ X.T  # Shape: (n, n)

    # Step 2: Compute L2 norms
    norms = np.linalg.norm(X, axis=1)  # Shape: (n,)

    # Step 3: Outer product of norms for denominator
    # Broadcasting: (n, 1) * (1, n) -> (n, n)
    norm_products = norms[:, None] * norms[None, :]

    # Step 4: Divide (handle zero norms)
    similarity = np.divide(dot_products, norm_products,
                          out=np.zeros_like(dot_products),
                          where=norm_products != 0)

    return similarity

# Test
X = np.array([[1, 0, 0],
              [0, 1, 0],
              [1, 1, 0]])
sim = cosine_similarity_matrix(X)
print(sim)
# [[1.   , 0.   , 0.707],
#  [0.   , 1.   , 0.707],
#  [0.707, 0.707, 1.   ]]

Scoring:

Strong Hire: Vectorized solution with zero-norm handling. Can discuss when to use cosine vs Euclidean distance.
Lean Hire: Correct but uses a loop over pairs.
No Hire: Cannot vectorize the computation.

Problem 4: Implement Cross-Entropy Loss

Compute the categorical cross-entropy loss for a batch of predictions.

Input: y_pred shape (batch_size, num_classes) - predicted probabilities, y_true shape (batch_size,) - integer labels Output: scalar loss value

Hint 1 - Direction

Cross-entropy: -log(p_correct_class). Use fancy indexing to select the predicted probability for the true class.

Hint 2 - Numerical stability

Clip predictions to avoid log(0). Use np.clip(y_pred, eps, 1 - eps).

Hint 3 - Full Solution

def cross_entropy_loss(y_pred, y_true, eps=1e-12):
    """
    Categorical cross-entropy loss.

    Args:
        y_pred: np.ndarray of shape (batch_size, num_classes) - probabilities
        y_true: np.ndarray of shape (batch_size,) - integer class labels
    Returns:
        Scalar loss value
    """
    batch_size = y_pred.shape[0]

    # Clip to avoid log(0)
    y_pred_clipped = np.clip(y_pred, eps, 1 - eps)

    # Select the predicted probability for the true class
    # Using advanced indexing: for each sample i, get y_pred[i, y_true[i]]
    correct_probs = y_pred_clipped[np.arange(batch_size), y_true]

    # Compute negative log likelihood and average over batch
    loss = -np.mean(np.log(correct_probs))

    return loss

# Test
y_pred = np.array([[0.7, 0.2, 0.1],
                    [0.1, 0.8, 0.1],
                    [0.2, 0.3, 0.5]])
y_true = np.array([0, 1, 2])
loss = cross_entropy_loss(y_pred, y_true)
print(f"Loss: {loss:.4f}")  # ~0.3567

Key insight: The line y_pred_clipped[np.arange(batch_size), y_true] is fancy indexing - it selects one element per row using the label as the column index. No loops needed.

Scoring:

Strong Hire: Correct with clipping, fancy indexing, and can derive the gradient.
Lean Hire: Correct but uses one-hot encoding instead of fancy indexing (wasteful).
No Hire: Uses a loop over the batch.

Problem 5: Pairwise Euclidean Distance

Compute all pairwise Euclidean distances between two sets of points without any loops.

Input: X shape (n, d), Y shape (m, d) Output: distances shape (n, m)

Hint 1 - Direction

Use the identity: ||a - b||^2 = ||a||^2 + ||b||^2 - 2ab. This avoids the O(nmd) explicit difference.

Hint 2 - Broadcasting

||X||^2 has shape (n,), ||Y||^2 has shape (m,). Reshape to (n, 1) and (1, m) for broadcasting, then add the cross term X @ Y.T.

Hint 3 - Full Solution

def pairwise_euclidean(X, Y):
    """
    Compute pairwise Euclidean distances between X and Y.

    Args:
        X: np.ndarray of shape (n, d)
        Y: np.ndarray of shape (m, d)
    Returns:
        np.ndarray of shape (n, m) with distances
    """
    # ||x - y||^2 = ||x||^2 + ||y||^2 - 2 * x . y
    X_sq = np.sum(X ** 2, axis=1)[:, None]  # Shape: (n, 1)
    Y_sq = np.sum(Y ** 2, axis=1)[None, :]  # Shape: (1, m)
    cross = X @ Y.T                          # Shape: (n, m)

    # Clamp to avoid negative values from floating point errors
    sq_dist = np.maximum(X_sq + Y_sq - 2 * cross, 0.0)

    return np.sqrt(sq_dist)

# Test
X = np.array([[0, 0], [1, 0], [0, 1]])
Y = np.array([[1, 1], [2, 2]])
print(pairwise_euclidean(X, Y))
# [[1.414, 2.828],
#  [1.000, 2.236],
#  [1.000, 2.236]]

Why clamp? Floating point arithmetic can produce small negative values in X_sq + Y_sq - 2 * cross when points are very close. Without clamping, np.sqrt would produce NaN.

Scoring:

Strong Hire: Uses the expansion trick, handles numerical issues, discusses complexity O(nm + nd + m*d).
Lean Hire: Correct but uses explicit subtraction with broadcasting: np.sqrt(((X[:, None, :] - Y[None, :, :]) ** 2).sum(axis=2)) - this works but creates an O(nmd) intermediate array.
No Hire: Uses loops.

Problem 6: Implement k-Nearest Neighbors Prediction

Given training data, labels, and a query point, predict the label using k-NN. Use NumPy only.

Hint 1 - Direction

Compute distances from the query to all training points, find the k smallest, and take a majority vote.

Hint 2 - Useful functions

Use np.argpartition for O(n) partial sort instead of O(n log n) full sort. Use np.bincount for majority vote.

Hint 3 - Full Solution

def knn_predict(X_train, y_train, X_query, k=5):
    """
    k-NN prediction for multiple query points.

    Args:
        X_train: np.ndarray of shape (n_train, d)
        y_train: np.ndarray of shape (n_train,) - integer labels
        X_query: np.ndarray of shape (n_query, d)
        k: number of neighbors
    Returns:
        predictions: np.ndarray of shape (n_query,)
    """
    # Step 1: Compute all pairwise distances
    # Using the expansion trick from Problem 5
    X_sq = np.sum(X_train ** 2, axis=1)[None, :]  # (1, n_train)
    Q_sq = np.sum(X_query ** 2, axis=1)[:, None]  # (n_query, 1)
    cross = X_query @ X_train.T                     # (n_query, n_train)
    distances = np.sqrt(np.maximum(Q_sq + X_sq - 2 * cross, 0.0))

    # Step 2: Find k nearest neighbors (O(n) with argpartition)
    # argpartition gives indices of k smallest elements (unordered)
    knn_indices = np.argpartition(distances, k, axis=1)[:, :k]

    # Step 3: Get labels of k nearest neighbors
    knn_labels = y_train[knn_indices]  # Shape: (n_query, k)

    # Step 4: Majority vote per query
    n_classes = y_train.max() + 1
    predictions = np.zeros(X_query.shape[0], dtype=int)
    for i in range(X_query.shape[0]):
        counts = np.bincount(knn_labels[i], minlength=n_classes)
        predictions[i] = np.argmax(counts)

    return predictions

# Test
X_train = np.array([[0, 0], [1, 0], [0, 1], [5, 5], [6, 5], [5, 6]])
y_train = np.array([0, 0, 0, 1, 1, 1])
X_query = np.array([[0.5, 0.5], [5.5, 5.5]])
print(knn_predict(X_train, y_train, X_query, k=3))  # [0, 1]

Note: The loop in Step 4 is acceptable because it is over queries (often small), not over features or training samples. A fully vectorized version would use np.apply_along_axis or a sparse matrix trick, but the loop here is pragmatic.

Scoring:

Strong Hire: Uses argpartition instead of argsort. Handles ties. Discusses time complexity.
Lean Hire: Correct but uses full sort (argsort).
No Hire: Cannot compute distances without loops.

Problem 7: Implement Matrix Factorization (Truncated SVD)

Implement dimensionality reduction using SVD. Given a data matrix, return the k-dimensional representation.

Hint 1 - Direction

Center the data, compute SVD, and project onto the top-k right singular vectors.

Hint 2 - The projection

After SVD: X = U @ diag(S) @ Vt. The k-dimensional projection is X @ Vt[:k].T, or equivalently U[:, :k] @ diag(S[:k]).

Hint 3 - Full Solution

def truncated_svd(X, k):
    """
    Dimensionality reduction via truncated SVD.

    Args:
        X: np.ndarray of shape (n_samples, n_features)
        k: number of components to keep
    Returns:
        X_reduced: np.ndarray of shape (n_samples, k)
        explained_variance_ratio: np.ndarray of shape (k,)
        components: np.ndarray of shape (k, n_features) - the projection directions
    """
    # Step 1: Center the data
    mean = X.mean(axis=0)
    X_centered = X - mean

    # Step 2: Compute SVD
    U, S, Vt = np.linalg.svd(X_centered, full_matrices=False)

    # Step 3: Truncate to k components
    U_k = U[:, :k]           # Shape: (n, k)
    S_k = S[:k]               # Shape: (k,)
    Vt_k = Vt[:k, :]          # Shape: (k, d)

    # Step 4: Project (two equivalent ways)
    X_reduced = U_k * S_k     # Shape: (n, k) - element-wise multiply
    # Equivalent: X_reduced = X_centered @ Vt_k.T

    # Step 5: Explained variance ratio
    total_variance = np.sum(S ** 2) / (X.shape[0] - 1)
    explained_variance = S_k ** 2 / (X.shape[0] - 1)
    explained_variance_ratio = explained_variance / total_variance

    return X_reduced, explained_variance_ratio, Vt_k

# Test
X = np.random.randn(100, 10)
X_reduced, evr, components = truncated_svd(X, k=3)
print(f"Reduced shape: {X_reduced.shape}")  # (100, 3)
print(f"Explained variance ratios: {evr}")
print(f"Total explained: {evr.sum():.2%}")

Scoring:

Strong Hire: Centers data first, computes explained variance ratio, discusses connection to PCA. Mentions that full_matrices=False is important for memory when n >> d.
Lean Hire: Correct SVD but forgets to center data.
No Hire: Cannot implement without sklearn.

Problem 8: Implement Attention Scores

Compute scaled dot-product attention scores for a batch of queries and keys.

Input: Q shape (batch, seq_len, d_model), K shape (batch, seq_len, d_model) Output: attention_weights shape (batch, seq_len, seq_len)

Hint 1 - Direction

Attention(Q, K) = softmax(Q @ K^T / sqrt(d_model)). Use batch matrix multiplication.

Hint 2 - Masking

For causal attention, add a mask of -inf to the upper triangle before softmax.

Hint 3 - Full Solution

def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Scaled dot-product attention.

    Args:
        Q: np.ndarray of shape (batch, seq_len_q, d_k)
        K: np.ndarray of shape (batch, seq_len_k, d_k)
        V: np.ndarray of shape (batch, seq_len_k, d_v)
        mask: optional bool array, True = masked (set to -inf)
    Returns:
        output: np.ndarray of shape (batch, seq_len_q, d_v)
        weights: np.ndarray of shape (batch, seq_len_q, seq_len_k)
    """
    d_k = Q.shape[-1]

    # Step 1: Compute raw scores
    # Batch matmul: (batch, seq_q, d_k) @ (batch, d_k, seq_k) -> (batch, seq_q, seq_k)
    scores = Q @ K.transpose(0, 2, 1) / np.sqrt(d_k)

    # Step 2: Apply mask (for causal/padding attention)
    if mask is not None:
        scores = np.where(mask, -1e9, scores)

    # Step 3: Softmax along last axis (key dimension)
    shifted = scores - scores.max(axis=-1, keepdims=True)
    exp_scores = np.exp(shifted)
    weights = exp_scores / exp_scores.sum(axis=-1, keepdims=True)

    # Step 4: Weighted sum of values
    output = weights @ V  # (batch, seq_q, seq_k) @ (batch, seq_k, d_v)

    return output, weights

# Test with causal mask
batch, seq_len, d_k, d_v = 2, 4, 8, 8
Q = np.random.randn(batch, seq_len, d_k)
K = np.random.randn(batch, seq_len, d_k)
V = np.random.randn(batch, seq_len, d_v)

# Causal mask: prevent attending to future positions
causal_mask = np.triu(np.ones((seq_len, seq_len), dtype=bool), k=1)
output, weights = scaled_dot_product_attention(Q, K, V, mask=causal_mask)
print(f"Output shape: {output.shape}")      # (2, 4, 8)
print(f"Weights shape: {weights.shape}")     # (2, 4, 4)
print(f"Weights row sums: {weights.sum(axis=-1)}")  # All 1.0

Scoring:

Strong Hire: Implements with scaling, masking, numerical stability. Explains why scaling by sqrt(d_k) prevents softmax saturation. Can extend to multi-head attention.
Lean Hire: Correct basic implementation but misses scaling or masking.
No Hire: Cannot express attention as matrix operations.

Problem 9: Implement Batch Matrix Operations for Multi-Head Attention

Split an input into multiple attention heads, process each, and concatenate results.

Hint 1 - Direction

Reshape (batch, seq, d_model) into (batch, n_heads, seq, d_head), then use the attention function from Problem 8.

Hint 2 - Reshape trick

x.reshape(batch, seq, n_heads, d_head).transpose(0, 2, 1, 3) gives shape (batch, n_heads, seq, d_head).

Hint 3 - Full Solution

def multi_head_attention(X, W_Q, W_K, W_V, W_O, n_heads, mask=None):
    """
    Multi-head attention.

    Args:
        X: np.ndarray of shape (batch, seq_len, d_model)
        W_Q, W_K, W_V: np.ndarray of shape (d_model, d_model)
        W_O: np.ndarray of shape (d_model, d_model) - output projection
        n_heads: int
        mask: optional
    Returns:
        output: np.ndarray of shape (batch, seq_len, d_model)
    """
    batch, seq_len, d_model = X.shape
    d_head = d_model // n_heads

    # Step 1: Project to Q, K, V
    Q = X @ W_Q  # (batch, seq, d_model)
    K = X @ W_K
    V = X @ W_V

    # Step 2: Reshape into heads
    # (batch, seq, d_model) -> (batch, seq, n_heads, d_head) -> (batch, n_heads, seq, d_head)
    Q = Q.reshape(batch, seq_len, n_heads, d_head).transpose(0, 2, 1, 3)
    K = K.reshape(batch, seq_len, n_heads, d_head).transpose(0, 2, 1, 3)
    V = V.reshape(batch, seq_len, n_heads, d_head).transpose(0, 2, 1, 3)

    # Merge batch and head dims for attention: (batch * n_heads, seq, d_head)
    Q = Q.reshape(batch * n_heads, seq_len, d_head)
    K = K.reshape(batch * n_heads, seq_len, d_head)
    V = V.reshape(batch * n_heads, seq_len, d_head)

    # Step 3: Apply scaled dot-product attention
    attn_output, attn_weights = scaled_dot_product_attention(Q, K, V, mask)

    # Step 4: Reshape back: (batch * n_heads, seq, d_head) -> (batch, seq, d_model)
    attn_output = attn_output.reshape(batch, n_heads, seq_len, d_head)
    attn_output = attn_output.transpose(0, 2, 1, 3)  # (batch, seq, n_heads, d_head)
    attn_output = attn_output.reshape(batch, seq_len, d_model)  # Concatenate heads

    # Step 5: Output projection
    output = attn_output @ W_O

    return output

# Test
batch, seq_len, d_model, n_heads = 2, 8, 64, 8
X = np.random.randn(batch, seq_len, d_model)
W_Q = np.random.randn(d_model, d_model) * 0.02
W_K = np.random.randn(d_model, d_model) * 0.02
W_V = np.random.randn(d_model, d_model) * 0.02
W_O = np.random.randn(d_model, d_model) * 0.02

output = multi_head_attention(X, W_Q, W_K, W_V, W_O, n_heads)
print(f"Output shape: {output.shape}")  # (2, 8, 64)

Scoring:

Strong Hire: Correct reshape/transpose sequence, can explain why we split into heads (different representation subspaces), discusses parameter count.
Lean Hire: Correct but uses loops over heads.
No Hire: Cannot manage the 4D tensor manipulations.

Problem 10: Implement Conv2D Forward Pass

Implement a basic 2D convolution forward pass using NumPy (im2col approach).

Hint 1 - Direction

The im2col trick converts convolution to matrix multiplication: extract all patches into columns, then multiply by the flattened kernel.

Hint 2 - Patch extraction

Use stride tricks or nested indexing to extract patches. The output spatial size is (H - kH + 2*pad) // stride + 1.

Hint 3 - Full Solution

def conv2d_forward(x, kernel, bias=None, stride=1, padding=0):
    """
    2D convolution forward pass using im2col.

    Args:
        x: np.ndarray of shape (batch, in_channels, H, W)
        kernel: np.ndarray of shape (out_channels, in_channels, kH, kW)
        bias: optional np.ndarray of shape (out_channels,)
        stride: int
        padding: int
    Returns:
        output: np.ndarray of shape (batch, out_channels, H_out, W_out)
    """
    batch, in_ch, H, W = x.shape
    out_ch, _, kH, kW = kernel.shape

    # Step 1: Pad input if needed
    if padding > 0:
        x = np.pad(x, ((0,0), (0,0), (padding,padding), (padding,padding)))
        _, _, H, W = x.shape

    # Step 2: Compute output dimensions
    H_out = (H - kH) // stride + 1
    W_out = (W - kW) // stride + 1

    # Step 3: im2col - extract patches as columns
    # Each patch is (in_ch * kH * kW) elements
    # Total patches per image: H_out * W_out
    cols = np.zeros((batch, in_ch * kH * kW, H_out * W_out))

    col_idx = 0
    for i in range(H_out):
        for j in range(W_out):
            h_start = i * stride
            w_start = j * stride
            patch = x[:, :, h_start:h_start+kH, w_start:w_start+kW]
            cols[:, :, col_idx] = patch.reshape(batch, -1)
            col_idx += 1

    # Step 4: Reshape kernel: (out_ch, in_ch*kH*kW)
    kernel_flat = kernel.reshape(out_ch, -1)

    # Step 5: Matrix multiply: (out_ch, in_ch*kH*kW) @ (batch, in_ch*kH*kW, patches)
    # Use einsum for batch matmul
    output = np.einsum('oi,bio->bo', kernel_flat, cols)
    # Reshape: actually we need per-sample:
    output = np.zeros((batch, out_ch, H_out * W_out))
    for b_idx in range(batch):
        output[b_idx] = kernel_flat @ cols[b_idx]

    output = output.reshape(batch, out_ch, H_out, W_out)

    # Step 6: Add bias
    if bias is not None:
        output += bias.reshape(1, -1, 1, 1)  # Broadcasting

    return output

# Test
batch, in_ch, H, W = 2, 3, 8, 8
out_ch, kH, kW = 16, 3, 3
x = np.random.randn(batch, in_ch, H, W)
kernel = np.random.randn(out_ch, in_ch, kH, kW) * 0.01
bias = np.zeros(out_ch)

output = conv2d_forward(x, kernel, bias, stride=1, padding=1)
print(f"Output shape: {output.shape}")  # (2, 16, 8, 8) with padding=1

Note: The im2col approach trades memory for speed - the column matrix can be large. Production implementations use optimized GEMM libraries. The loops here are over spatial positions (small), not over batch or channels.

Scoring:

Strong Hire: Implements im2col correctly, discusses the memory-speed tradeoff, mentions that cuDNN uses different algorithms (Winograd, FFT).
Lean Hire: Correct but uses 6 nested loops (batch, out_ch, H_out, W_out, kH, kW).
No Hire: Cannot implement convolution at all.

Interview Cheat Sheet

Concept	Key Pattern	One-Liner	Red Flag
Array creation	`np.zeros`, `np.eye`, `default_rng()`	Fixed-type contiguous memory blocks	Using Python lists for numerical computation
Broadcasting	Shapes align from the right, size-1 stretches	Eliminates loops for element-wise ops	Cannot state the three rules
Vectorization	Replace loops with array operations	100-1000x speedup over Python loops	Writing `for i in range(n)` over array elements
Matrix multiply	`A @ B` or `np.matmul`	Use @ operator, not np.dot for clarity	Using np.dot for batch matmul
Solving Ax=b	`np.linalg.solve(A, b)`	Never invert A explicitly	Writing `np.linalg.inv(A) @ b`
Fancy indexing	`a[indices]`, `a[np.arange(n), labels]`	Select arbitrary elements without loops	Not knowing that fancy indexing copies
Boolean indexing	`a[a > 0]`, `np.where(cond, x, y)`	Conditional selection/assignment	Using loops for conditional operations
Memory layout	C-order (row-major) vs F-order (col-major)	Determines which axis iterations are fast	Not knowing views vs copies
Numerical stability	Subtract max before exp, clip before log	Prevent overflow/underflow in softmax/CE	Implementing naive softmax that overflows
Partial sort	`np.argpartition(a, k)`	O(n) for top-k instead of O(n log n) sort	Using full sort for top-k selection

Spaced Repetition Checkpoints

Day 0 - Initial Learning

Read this entire page and run all code examples
Implement softmax and batch norm from memory
State the three broadcasting rules without looking
Complete the self-assessment

Day 3 - First Recall

Implement cosine similarity matrix and cross-entropy loss from scratch
Explain views vs copies and when each occurs
Solve Problem 5 (pairwise distance) without hints

Day 7 - Connections

Implement k-NN prediction using vectorized distance computation
Implement truncated SVD and explain its connection to PCA
Write scaled dot-product attention from memory

Day 14 - Application

Implement multi-head attention from scratch under timed conditions (20 min)
Implement a conv2d forward pass from memory
Explain memory layout implications for ML data loading

Day 21 - Mock Interview

Have someone pick 3 problems randomly - solve each in under 10 minutes
Explain broadcasting rules, then demonstrate with 3 examples on a whiteboard
Implement batch normalization for 4D tensors (batch, channels, H, W) from memory

Key Takeaways

Broadcasting is the foundation. Every vectorized ML operation - normalization, distance computation, attention - relies on broadcasting. Master the three rules and you can vectorize anything.
Vectorization is not optional. In AI interviews, a loop over array elements is treated as a correctness error, not just a style issue. Interviewers interpret it as a signal that you will write slow production code.
Numerical stability is expected. Softmax, cross-entropy, and distance computations all have well-known numerical failure modes. Production-ready implementations handle them. Interview implementations should too.
Memory awareness separates senior from junior. Understanding views vs copies, C-order vs Fortran-order, and the memory implications of operations like im2col demonstrates systems-level thinking that senior ML roles require.

Next Steps

Continue to Pandas for Interviews to master data manipulation and feature engineering - the other half of the numerical computing interview.

The Real Interview Moment​

What You Will Master​

Self-Assessment: Where Are You Now?​

Part 1 - Array Creation and Manipulation​

Essential Array Creation​

Shape Manipulation​

Part 2 - Broadcasting Rules​

The Three Rules​

Broadcasting Examples​

Broadcasting Compatibility Table​

Part 3 - Vectorization vs Loops​

The Performance Gap​

Common Vectorization Patterns​

Part 4 - Linear Algebra Operations​

Essential Operations for ML​

Part 5 - Advanced Indexing​

Boolean Indexing​

Fancy Indexing​

Part 6 - Memory Layout: C vs Fortran Order​

How Arrays Are Stored​

Why Memory Layout Matters​

Views vs Copies​

Part 7 - Random Sampling for ML​

Practice Problems​

Problem 1: Implement Softmax​

Problem 2: Implement Batch Normalization​

Problem 3: Cosine Similarity Matrix​

Problem 4: Implement Cross-Entropy Loss​

Problem 5: Pairwise Euclidean Distance​

Problem 6: Implement k-Nearest Neighbors Prediction​

Problem 7: Implement Matrix Factorization (Truncated SVD)​

Problem 8: Implement Attention Scores​

Problem 9: Implement Batch Matrix Operations for Multi-Head Attention​

Problem 10: Implement Conv2D Forward Pass​

Interview Cheat Sheet​

Spaced Repetition Checkpoints​

Day 0 - Initial Learning​

Day 3 - First Recall​

Day 7 - Connections​

Day 14 - Application​

Day 21 - Mock Interview​

Key Takeaways​

Next Steps​

The Real Interview Moment

What You Will Master

Self-Assessment: Where Are You Now?

Part 1 - Array Creation and Manipulation

Essential Array Creation

Shape Manipulation

Part 2 - Broadcasting Rules

The Three Rules

Broadcasting Examples

Broadcasting Compatibility Table

Part 3 - Vectorization vs Loops

The Performance Gap

Common Vectorization Patterns

Part 4 - Linear Algebra Operations

Essential Operations for ML

Part 5 - Advanced Indexing

Boolean Indexing

Fancy Indexing

Part 6 - Memory Layout: C vs Fortran Order

How Arrays Are Stored

Why Memory Layout Matters

Views vs Copies

Part 7 - Random Sampling for ML

Practice Problems

Problem 1: Implement Softmax

Problem 2: Implement Batch Normalization

Problem 3: Cosine Similarity Matrix

Problem 4: Implement Cross-Entropy Loss

Problem 5: Pairwise Euclidean Distance

Problem 6: Implement k-Nearest Neighbors Prediction

Problem 7: Implement Matrix Factorization (Truncated SVD)

Problem 8: Implement Attention Scores

Problem 9: Implement Batch Matrix Operations for Multi-Head Attention

Problem 10: Implement Conv2D Forward Pass

Interview Cheat Sheet

Spaced Repetition Checkpoints

Day 0 - Initial Learning

Day 3 - First Recall

Day 7 - Connections

Day 14 - Application

Day 21 - Mock Interview

Key Takeaways

Next Steps