Matrix Operations - The Engine of the Neural Network Forward Pass
Reading time: ~24 minutes | Level: Mathematical Foundations → ML Engineering
A neural network forward pass is a sequence of matrix multiplications. A weight update in backpropagation is a matrix product. The attention scores in a transformer are computed by a matrix multiplication QKᵀ. The Gram matrix XᵀX appears in kernel SVMs, ridge regression, and PCA. The entire computational graph of a deep learning model is, at its core, a composition of linear transformations represented as matrices.
If you think of matrix multiplication as "row times column equals a number," you are missing the geometric story - and without the geometric story, you cannot reason about why neural networks work, when they fail, or what the weight matrices are actually doing to the data.
This lesson fixes that.
What You Will Learn
- What matrix multiplication actually does: linear transformation composition
- Transpose: when it appears, what it means, why
XᵀXis everywhere - Matrix inverse: when it exists, why you almost never compute it directly
- Rank: what it reveals about your data and your model's expressiveness
- Determinant: the volume-scaling factor and its relationship to invertibility
- NumPy:
@,np.linalg.inv,np.linalg.solve,matrix_rank,det - ML connections: attention scores, Gram matrix, normal equations for regression
Prerequisites
- Lesson 01: Vectors and Vector Spaces
- NumPy array indexing and shape
Part 1 - What Matrix Multiplication Really Does
The standard definition: to multiply matrices A (m×n) and B (n×p), take the dot product of each row of A with each column of B. But this definition describes the computation, not the meaning.
The geometric meaning
A matrix is a linear transformation - a function that takes vectors as input and produces vectors as output, while preserving the vector space structure (addition and scalar multiplication).
When you multiply a matrix A by a vector x, you get:
y = Ax
This transforms the vector x into a new vector y. The matrix A encodes how to transform.
Matrix multiplication composes two transformations:
C = AB means: "first apply B, then apply A"
If B transforms x → Bx, and A transforms Bx → A(Bx), then C = AB transforms x → Cx in one step.
A neural network layer is a matrix transformation
A single fully-connected layer computes:
output = f(Wx + b)
Where:
- W is the weight matrix (m×n): maps n-dimensional input to m-dimensional output
- b is the bias vector (m-dimensional)
- f is a nonlinearity (ReLU, GELU, etc.)
The linear part Wx is matrix-vector multiplication: a linear transformation.
A neural network with L layers applies L transformations in sequence:
x₁ = f₁(W₁x₀ + b₁)
x₂ = f₂(W₂x₁ + b₂)
...
xL = fL(WLxL₋₁ + bL)
Each layer's weight matrix Wₗ is a linear transformation. The nonlinearity fₗ is what prevents the composition from collapsing to a single matrix (without nonlinearities, the entire network is equivalent to a single matrix multiplication).
Matrix dimensions and shapes
import numpy as np
# Matrix A: (m×n) transforms n-dimensional vectors to m-dimensional vectors
A = np.random.randn(4, 3) # 4×3 matrix
# Column vector x: n-dimensional
x = np.random.randn(3) # 3-dimensional vector
# Matrix-vector multiply: (4×3) @ (3,) → (4,)
y = A @ x # y is 4-dimensional
# Matrix-matrix multiply: (4×3) @ (3×5) → (4×5)
B = np.random.randn(3, 5)
C = A @ B # C is 4×5
# Batch matrix multiply: (batch, m, n) @ (batch, n, p) → (batch, m, p)
# This is what happens in attention across all heads in parallel
batch_A = np.random.randn(32, 8, 64) # (batch, heads, d_k)
batch_B = np.random.randn(32, 64, 10) # (batch, d_k, seq_len)
batch_C = batch_A @ batch_B # (32, 8, 10)
print(f"Batch matmul result shape: {batch_C.shape}")
# Attention scores: Q @ Kᵀ
# Q: (batch, heads, seq_len, d_k)
# K: (batch, heads, seq_len, d_k)
# Kᵀ: (batch, heads, d_k, seq_len)
# Q @ Kᵀ: (batch, heads, seq_len, seq_len) - pairwise attention scores
batch_size, n_heads, seq_len, d_k = 8, 12, 128, 64
Q = np.random.randn(batch_size, n_heads, seq_len, d_k)
K = np.random.randn(batch_size, n_heads, seq_len, d_k)
attention_scores = Q @ K.transpose(0, 1, 3, 2) # (8, 12, 128, 128)
print(f"Attention scores shape: {attention_scores.shape}")
What matrix multiplication is NOT
Matrix multiplication is:
- Not commutative: AB ≠ BA in general
- Associative: (AB)C = A(BC) - important for understanding how layers compose
- Distributive: A(B+C) = AB + AC
Part 2 - The Transpose
The transpose of a matrix A (m×n) is the matrix Aᵀ (n×m) obtained by reflecting along the main diagonal - swapping rows and columns:
A = [1 2 3] Aᵀ = [1 4]
[4 5 6] [2 5]
[3 6]
When the transpose appears in ML
Attention mechanism: attention_scores = Q @ Kᵀ
The query matrix Q has shape (seq_len, d_k) and the key matrix K also has shape (seq_len, d_k). To compute pairwise dot products between all queries and all keys, we need K transposed:
scores = Q @ Kᵀ # (seq_len, d_k) @ (d_k, seq_len) → (seq_len, seq_len)
The (i,j) element of scores is the dot product between query i and key j - how much position i should attend to position j.
Gram matrix: G = XᵀX
The Gram matrix of a data matrix X (n×d, where n is the number of samples and d is the number of features) is:
G = XᵀX # (d×n) @ (n×d) → (d×d)
The (i,j) element of G is Σₖ Xₖᵢ Xₖⱼ - the dot product between feature i and feature j across all data points. The Gram matrix:
- Is symmetric (G = Gᵀ)
- Is always positive semidefinite (all eigenvalues ≥ 0)
- Appears in: ridge regression normal equations, PCA covariance matrices, kernel SVMs
Symmetric matrices satisfy A = Aᵀ. They are extremely important in ML because:
- Covariance matrices are symmetric
- All eigenvalues of a symmetric matrix are real
- Eigenvectors of a symmetric matrix are orthogonal
- They are always diagonalizable
import numpy as np
# Transpose
A = np.array([[1, 2, 3],
[4, 5, 6]]) # shape (2, 3)
AT = A.T # shape (3, 2)
print(f"A shape: {A.shape}, Aᵀ shape: {AT.shape}")
# Property: (AB)ᵀ = BᵀAᵀ (note the reversal!)
B = np.random.randn(3, 4)
AB_T = (A @ B).T # Transpose of product
BT_AT = B.T @ A.T # Product of transposes (reversed order)
print(f"(AB)ᵀ == BᵀAᵀ: {np.allclose(AB_T, BT_AT)}") # True
# Gram matrix: XᵀX
n_samples, n_features = 100, 20
X = np.random.randn(n_samples, n_features)
gram = X.T @ X # (20, 20) - feature covariance structure
# Check symmetry: XᵀX is always symmetric
print(f"Gram matrix is symmetric: {np.allclose(gram, gram.T)}") # True
# Check positive semidefiniteness: all eigenvalues ≥ 0
eigenvalues = np.linalg.eigvalsh(gram) # eigvalsh for symmetric matrices
print(f"All eigenvalues ≥ 0: {np.all(eigenvalues >= -1e-10)}") # True
# Attention scores (simplified)
seq_len, d_k = 10, 64
Q = np.random.randn(seq_len, d_k)
K = np.random.randn(seq_len, d_k)
scores = Q @ K.T / np.sqrt(d_k) # scaled dot-product attention
print(f"Attention scores shape: {scores.shape}") # (10, 10)
Part 3 - The Matrix Inverse
The inverse of a square matrix A (n×n), written A⁻¹, is the unique matrix satisfying:
AA⁻¹ = A⁻¹A = I (where I is the n×n identity matrix)
A matrix has an inverse if and only if it is square and has full rank (all columns/rows are linearly independent, equivalently det(A) ≠ 0).
When the inverse exists: geometric meaning
If A represents a linear transformation (stretching, rotating, reflecting), then A⁻¹ represents the reverse transformation (undo the stretch, undo the rotation).
If A does not have an inverse (singular matrix), the transformation is irreversible - it maps multiple input vectors to the same output vector, losing information. You cannot recover the original.
The normal equations for linear regression
The ordinary least squares solution to min ‖y - Xw‖₂² is:
w* = (XᵀX)⁻¹ Xᵀy
This is the normal equation. It appears in every linear regression derivation. However...
:::danger Never use np.linalg.inv() to solve linear systems
Computing (XᵀX)⁻¹ Xᵀy by explicitly computing the matrix inverse is:
- Numerically unstable: matrix inversion amplifies floating-point errors
- Computationally wasteful: O(n³) for the inverse when O(n²) is possible
- Almost always wrong: in practice, use
np.linalg.solve
# WRONG: explicitly computing the inverse
import numpy as np
X = np.random.randn(100, 10)
y = np.random.randn(100)
XTX = X.T @ X
XTX_inv = np.linalg.inv(XTX) # Never do this in production!
w = XTX_inv @ X.T @ y
# RIGHT: use solve, which uses LU decomposition internally
w = np.linalg.solve(X.T @ X, X.T @ y) # More stable, faster
# EVEN BETTER: use lstsq, which handles rank-deficient cases
w, residuals, rank, sv = np.linalg.lstsq(X, y, rcond=None)
Rule: If you ever find yourself writing np.linalg.inv(A) @ b, replace it with np.linalg.solve(A, b). Always.
:::
The pseudoinverse
When A is not square (rectangular matrix, m×n with m ≠ n) or is rank-deficient, the regular inverse does not exist. The Moore-Penrose pseudoinverse A⁺ is the generalization:
import numpy as np
# Rectangular matrix (overdetermined system: more equations than unknowns)
A = np.random.randn(5, 3) # 5 equations, 3 unknowns
# Pseudoinverse: A⁺ = (AᵀA)⁻¹Aᵀ for full column rank
A_pinv = np.linalg.pinv(A) # (3, 5) matrix
print(f"A shape: {A.shape}, A⁺ shape: {A_pinv.shape}")
# Property: A⁺A ≈ I (for full column rank A)
print(f"A⁺A ≈ I: {np.allclose(A_pinv @ A, np.eye(3))}") # True
# Least squares solution using pseudoinverse: w = A⁺y
y = np.random.randn(5)
w = A_pinv @ y # same as np.linalg.lstsq(A, y, rcond=None)[0]
Part 4 - Rank: The Intrinsic Dimensionality
The rank of a matrix A (m×n) is the maximum number of linearly independent columns (equivalently, rows). It satisfies:
rank(A) ≤ min(m, n)
A matrix has full rank if rank(A) = min(m, n). Otherwise it is rank-deficient.
What rank reveals about data
If your data matrix X has shape (n_samples=1000, n_features=50), the rank of X tells you the intrinsic dimensionality of your data:
rank(X) = 50: All 50 features are linearly independent - full rank, no redundancyrank(X) = 10: Only 10 directions in feature space have actual variation - the data lies on a 10-dimensional subspacerank(X) = 1: All features are proportional - all data lies on a 1D line
A low-rank data matrix is a redundancy signal: the data has fewer independent dimensions than features. PCA finds these low-dimensional subspaces.
What rank reveals about weight matrices
If a neural network weight matrix W has rank r < min(m,n), the layer can only produce r linearly independent output directions. Information in the other (min(m,n)-r) dimensions is destroyed.
This is used intentionally in Low-Rank Adaptation (LoRA) - fine-tuning LLMs by learning low-rank updates to weight matrices:
W_adapted = W_pretrained + BA where B is (d × r) and A is (r × d) with r << d
The update BA has rank ≤ r, capturing the "essential" adaptation in a low-dimensional subspace.
import numpy as np
# Matrix rank
A = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
print(f"Rank of A: {np.linalg.matrix_rank(A)}") # 2 (not 3!)
# Row 3 = 2*Row2 - Row1 → linearly dependent
# Full rank matrix
B = np.eye(3) # Identity matrix
print(f"Rank of identity: {np.linalg.matrix_rank(B)}") # 3
# Rank of a data matrix
n_samples, n_features = 100, 20
X = np.random.randn(n_samples, n_features)
print(f"Rank of full-rank data matrix: {np.linalg.matrix_rank(X)}") # 20
# Low-rank data (data on a lower-dimensional manifold)
# Create rank-5 data in 20D space
low_rank_X = np.random.randn(100, 5) @ np.random.randn(5, 20)
print(f"Rank of low-rank matrix: {np.linalg.matrix_rank(low_rank_X)}") # 5
# Numerical rank with tolerance
eps_X = low_rank_X + 1e-10 * np.random.randn(100, 20) # tiny noise
print(f"Rank with default tol: {np.linalg.matrix_rank(eps_X)}")
print(f"Rank with loose tol: {np.linalg.matrix_rank(eps_X, tol=1e-8)}") # 5
Rank and linear systems
A linear system Ax = b has:
- Unique solution: rank(A) = n (square, full rank)
- Infinitely many solutions: rank(A) < n (underdetermined)
- No solution: rank([A|b]) > rank(A) (inconsistent)
Part 5 - The Determinant: Volume Scaling Factor
The determinant of a square matrix A, written det(A) or |A|, measures the volume scaling factor of the linear transformation A.
Concretely: if you take the unit square (or unit cube in 3D), apply the transformation A, the resulting parallelogram (or parallelepiped) has area (volume) equal to |det(A)| times the original.
Key properties
| Condition | Meaning |
|---|---|
| det(A) > 0 | Transformation preserves orientation |
| det(A) < 0 | Transformation flips orientation |
| det(A) = 0 | Transformation is singular (not invertible) - volume → 0 |
| det(A) | |
| det(A) | |
| det(A) |
Determinant and invertibility
A matrix is invertible ↔ det(A) ≠ 0 ↔ full rank ↔ no zero eigenvalues.
These are all equivalent conditions for invertibility.
ML relevance: change of variables
Determinants appear in:
- Normalizing flows: generative models that learn invertible transformations. The log-determinant of the Jacobian appears in the change-of-variables formula.
- Multivariate Gaussians: the PDF includes
1/√det(Σ)where Σ is the covariance matrix. - LU decomposition: solving linear systems involves the determinant implicitly.
import numpy as np
# 2×2 determinant: ad - bc
A = np.array([[3, 1],
[2, 4]])
det_A = np.linalg.det(A)
print(f"det(A) = {det_A:.1f}") # 3*4 - 1*2 = 10
# Singular matrix: det = 0
singular = np.array([[1, 2],
[2, 4]]) # Row 2 = 2 × Row 1
print(f"det(singular) = {np.linalg.det(singular):.6f}") # ≈ 0
# Rotation matrix: det = 1 (preserves area)
theta = np.pi / 4 # 45 degrees
rot = np.array([[np.cos(theta), -np.sin(theta)],
[np.sin(theta), np.cos(theta)]])
print(f"det(rotation matrix) = {np.linalg.det(rot):.6f}") # 1.0
# Product rule: det(AB) = det(A) × det(B)
B = np.random.randn(3, 3)
A = np.random.randn(3, 3)
print(f"det(AB) == det(A)*det(B): "
f"{np.isclose(np.linalg.det(A @ B), np.linalg.det(A) * np.linalg.det(B))}")
# Log-determinant: use for numerical stability with very large/small matrices
# det can overflow/underflow for large matrices
large_matrix = np.random.randn(100, 100)
sign, log_det = np.linalg.slogdet(large_matrix) # More stable than np.linalg.det
print(f"Sign: {sign}, Log|det|: {log_det:.2f}")
:::tip Use slogdet instead of det for large matrices
For matrices larger than ~20×20, the determinant can overflow (become Inf) or underflow (become 0) in floating-point arithmetic. np.linalg.slogdet returns the sign and log of the absolute determinant, which is much more numerically stable and should always be preferred for large matrices (e.g., computing log-likelihoods of multivariate Gaussians).
:::
Part 6 - NumPy: Complete Matrix Operation Reference
import numpy as np
# ── Matrix creation ────────────────────────────────────────────────────────
A = np.array([[1, 2], [3, 4]], dtype=np.float64)
identity = np.eye(3) # 3×3 identity
zeros = np.zeros((3, 4)) # 3×4 matrix of zeros
random = np.random.randn(5, 5) # 5×5 standard normal
# ── Basic operations ───────────────────────────────────────────────────────
B = np.array([[5, 6], [7, 8]], dtype=np.float64)
# Matrix multiplication (NOT element-wise!)
C = A @ B # Preferred syntax
C = np.matmul(A, B) # Equivalent
C = np.dot(A, B) # Also works but @ is clearer
# Element-wise operations (different from matrix multiply!)
elem_product = A * B # Hadamard product - each element multiplied
elem_add = A + B
elem_scalar = 3.0 * A
# ── Transpose ─────────────────────────────────────────────────────────────
AT = A.T # 2×2 → 2×2 (square in this case)
AT = A.transpose() # Same
AT = np.transpose(A) # Same
# For batch operations (e.g., in attention): transpose specific axes
batch = np.random.randn(8, 12, 64, 128) # (batch, heads, d, seq)
transposed = batch.transpose(0, 1, 3, 2) # (batch, heads, seq, d)
# ── Linear system solving ──────────────────────────────────────────────────
A = np.array([[3.0, 1.0], [1.0, 2.0]])
b = np.array([9.0, 8.0])
# CORRECT: use solve
x = np.linalg.solve(A, b)
print(f"Solution: {x}") # [2. 3.] → verify: 3*2 + 1*3 = 9 ✓
# Verify solution
print(f"A @ x = {A @ x}") # [9. 8.] ✓
# For rectangular systems (least squares)
X = np.random.randn(100, 10) # 100 equations, 10 unknowns
y = np.random.randn(100)
w, residuals, rank, sv = np.linalg.lstsq(X, y, rcond=None)
print(f"Least squares solution shape: {w.shape}") # (10,)
print(f"Matrix rank: {rank}")
# ── Matrix properties ─────────────────────────────────────────────────────
A = np.random.randn(5, 5)
print(f"Rank: {np.linalg.matrix_rank(A)}")
print(f"Determinant: {np.linalg.det(A):.4f}")
print(f"Trace: {np.trace(A):.4f}") # Sum of diagonal = sum of eigenvalues
# Condition number: ratio of largest to smallest singular value
cond = np.linalg.cond(A)
print(f"Condition number: {cond:.2f}")
if cond > 1e10:
print("WARNING: matrix is poorly conditioned!")
# ── Frobenius norm ─────────────────────────────────────────────────────────
A = np.random.randn(3, 4)
frob = np.linalg.norm(A, 'fro') # sqrt(sum of squared elements)
frob_manual = np.sqrt(np.sum(A**2)) # same
print(f"Frobenius norm: {frob:.4f}")
# ── Common ML patterns ─────────────────────────────────────────────────────
# Gram matrix (appears in kernel methods, PCA, ridge regression)
n_samples, n_features = 50, 10
X = np.random.randn(n_samples, n_features)
gram = X @ X.T # (n_samples, n_samples) - sample similarity
feature_gram = X.T @ X # (n_features, n_features) - feature covariance
# Covariance matrix
X_centered = X - X.mean(axis=0)
cov = (X_centered.T @ X_centered) / (n_samples - 1)
print(f"Covariance matrix shape: {cov.shape}") # (10, 10)
Part 7 - ML Connections: Matrices in Production Systems
Attention mechanism: full derivation
import numpy as np
def scaled_dot_product_attention(
Q: np.ndarray, # (seq_len, d_k)
K: np.ndarray, # (seq_len, d_k)
V: np.ndarray, # (seq_len, d_v)
mask: np.ndarray = None # Optional: (seq_len, seq_len)
) -> np.ndarray:
"""
Scaled dot-product attention.
Every matrix operation here has a linear algebra meaning:
- Q @ K.T: pairwise dot products between queries and keys
→ "how much does each query align with each key?"
- / np.sqrt(d_k): normalize to prevent softmax saturation
- softmax: convert scores to probabilities (attention weights)
- @ V: weighted sum of values
→ "take a weighted average of values, weighted by relevance"
"""
d_k = Q.shape[-1]
# Step 1: Compute attention scores
scores = Q @ K.T / np.sqrt(d_k)
# Step 2: Apply optional mask
if mask is not None:
scores = np.where(mask, scores, -1e9)
# Step 3: Softmax over the last dimension (key dimension)
scores_max = scores.max(axis=-1, keepdims=True) # for numerical stability
exp_scores = np.exp(scores - scores_max)
attention_weights = exp_scores / exp_scores.sum(axis=-1, keepdims=True)
# Step 4: Weighted sum of values
output = attention_weights @ V # (seq_len, d_v)
return output, attention_weights
# Example: 4 tokens, 8-dim keys, 16-dim values
seq_len, d_k, d_v = 4, 8, 16
Q = np.random.randn(seq_len, d_k)
K = np.random.randn(seq_len, d_k)
V = np.random.randn(seq_len, d_v)
output, weights = scaled_dot_product_attention(Q, K, V)
print(f"Attention output shape: {output.shape}") # (4, 16)
print(f"Attention weights shape: {weights.shape}") # (4, 4)
print(f"Weights sum to 1: {np.allclose(weights.sum(axis=1), 1.0)}") # True
Ridge regression normal equations
import numpy as np
def ridge_regression(X: np.ndarray, y: np.ndarray, alpha: float = 1.0) -> np.ndarray:
"""
Ridge regression: min ‖y - Xw‖₂² + α‖w‖₂²
Normal equation: w* = (XᵀX + αI)⁻¹Xᵀy
We solve this using np.linalg.solve, NOT by computing the inverse.
The αI term adds a positive diagonal to XᵀX:
- Ensures the matrix is invertible even when XᵀX is singular
- This is the mathematical reason ridge regression handles multicollinearity
"""
n, d = X.shape
A = X.T @ X + alpha * np.eye(d) # (d, d)
b = X.T @ y # (d,)
# Solve the linear system - do NOT use np.linalg.inv!
w = np.linalg.solve(A, b)
return w
# Generate data
np.random.seed(42)
n, d = 100, 20
X = np.random.randn(n, d)
true_w = np.random.randn(d)
y = X @ true_w + 0.1 * np.random.randn(n)
w = ridge_regression(X, y, alpha=0.1)
print(f"Coefficient error: {np.linalg.norm(w - true_w):.4f}")
Part 8 - Common Failure Modes
:::danger Shape mismatches are the #1 NumPy error
Matrix multiplication requires matching inner dimensions: (m×n) @ (n×p).
import numpy as np
A = np.array([[1, 2, 3], # (2, 3)
[4, 5, 6]])
B = np.array([[7, 8], # (2, 2) - WRONG for A @ B
[9, 10]])
try:
C = A @ B # ValueError: matmul operands have incompatible shapes
except ValueError as e:
print(f"Error: {e}")
# Fix: B must have shape (3, ?) for A @ B
B_correct = np.array([[1, 2], # (3, 2) - inner dims match: 3 == 3
[3, 4],
[5, 6]])
C = A @ B_correct # (2, 2) - valid!
# Common mistake: confusing (n,) and (n,1)
v1 = np.array([1, 2, 3]) # shape: (3,) - 1D array
v2 = np.array([[1], [2], [3]]) # shape: (3, 1) - 2D column vector
# Be explicit about shapes in production code
:::
:::danger Never invert a matrix to solve a linear system
# WRONG - numerically unstable and slower
w = np.linalg.inv(A) @ b
# RIGHT - uses LU decomposition, more stable
w = np.linalg.solve(A, b)
# For least squares with rectangular/rank-deficient matrices:
w = np.linalg.lstsq(A, b, rcond=None)[0]
The reason: computing inv(A) explicitly accumulates floating-point errors. np.linalg.solve uses LU decomposition to solve the system in fewer numerical operations. For an ill-conditioned matrix, the difference can be catastrophic.
:::
:::tip Check the condition number before solving
cond = np.linalg.cond(A)
if cond > 1e10:
print(f"WARNING: condition number {cond:.1e} - results may be inaccurate")
print("Consider using regularization (ridge) or pseudoinverse")
A condition number > 1e10 means approximately 10 digits of precision are lost. If you are working in float64 (16 significant digits), you are down to 6 digits of accuracy. :::
Interview Questions
Q1: What does matrix multiplication mean geometrically?
Matrix multiplication composes two linear transformations. If A represents "stretch along x-axis" and B represents "rotate 90°", then AB represents "rotate 90°, then stretch along x-axis" - applied to any input vector.
This is the geometric foundation of why neural networks can approximate complex functions by composing simple linear transformations (weight matrices) with nonlinearities. Each layer applies a linear transformation; the composition of many such transformations can approximate any continuous function (universal approximation theorem).
Key property: matrix multiplication is associative (AB)C = A(BC) - important for understanding how forward pass computations can be reordered. It is NOT commutative: AB ≠ BA in general.
Q2: Why does the attention mechanism use QKᵀ and not a direct element-wise product?
Q (queries) has shape (seq_len, d_k). K (keys) has shape (seq_len, d_k). We want to compute the dot product between every pair of query and key vectors - all (seq_len)² pairs simultaneously.
- Element-wise product Q·K: gives (seq_len, d_k), only computes pair (i,i) interactions
Q @ K.T: gives (seq_len, seq_len), computes ALL pairs (i,j) simultaneously
The (i,j) element of Q @ K.T is Qᵢ · Kⱼ - the dot product between query i and key j. This is exactly the attention score: "how much should position i attend to position j?"
The scaling by 1/√d_k prevents the dot products from growing too large in magnitude as d_k increases (each element contributes O(1), but d_k elements sum to O(√d_k) for random vectors), which would push softmax into regions of very small gradients.
Q3: What is the rank of a matrix and what does it mean for an ML model?
The rank is the number of linearly independent columns (or rows) - the intrinsic dimensionality of the transformation.
For a neural network weight matrix W (m×n):
- rank(W) = min(m,n): full rank - W can transform inputs to any output direction
- rank(W) < min(m,n): rank-deficient - W collapses some input dimensions, losing information
This matters for:
- LoRA fine-tuning: Hypothesis that fine-tuning updates have low intrinsic rank → parameterize updates as BA (rank-r product) instead of full matrix
- Dimensionality reduction: If data matrix X has rank r ≪ d, data lives on an r-dimensional subspace → PCA compresses to r dimensions losslessly
- Overfitting detection: A rank-deficient feature matrix (collinear features) causes the normal equations to be singular - regularization (ridge) is needed
Q4: Why should you never use np.linalg.inv to solve a linear system?
Two reasons:
-
Numerical instability: Explicit matrix inversion amplifies floating-point errors, especially for ill-conditioned matrices. The condition number κ(A) measures amplification: relative error ≈ κ(A) × machine_epsilon.
np.linalg.solveuses LU decomposition which is more numerically stable. -
Computational cost: Solving Ax = b via x = A⁻¹b requires computing the inverse (O(n³)) and then multiplying (O(n²)).
np.linalg.solvealso costs O(n³) but with a smaller constant - LAPACK's optimized routine is significantly faster in practice.
Correct alternatives:
- Square system:
np.linalg.solve(A, b) - Least squares (rectangular):
np.linalg.lstsq(A, b, rcond=None) - Multiple RHS:
np.linalg.solve(A, B)where B is a matrix
Practice Challenges
Level 1: Predict
Challenge: Without running code, predict the output shape of each operation:
A = np.random.randn(3, 4)
B = np.random.randn(4, 2)
C = np.random.randn(3, 3)
result1 = A @ B
result2 = A.T @ A
result3 = A @ A.T
result4 = C + C.T
Answer
A @ B: (3,4) @ (4,2) → (3, 2)A.T @ A: (4,3) @ (3,4) → (4, 4) - Gram matrix (feature covariance)A @ A.T: (3,4) @ (4,3) → (3, 3) - sample similarity matrixC + C.T: (3,3) + (3,3) → (3, 3) - always symmetric!
Note: A.T @ A is the d×d feature Gram matrix; A @ A.T is the n×n sample Gram matrix. PCA uses the feature Gram matrix; kernel methods use the sample Gram matrix.
Level 2: Debug
Challenge: The following linear regression code is correct but numerically unstable. Identify and fix it:
def linear_regression(X, y):
return np.linalg.inv(X.T @ X) @ X.T @ y
Answer
Issue 1: np.linalg.inv is numerically unstable.
Issue 2: No handling of rank-deficient cases (collinear features → singular XᵀX).
def linear_regression_fixed(X, y, regularization=0.0):
if regularization > 0:
n, d = X.shape
X_aug = np.vstack([X, np.sqrt(regularization) * np.eye(d)])
y_aug = np.concatenate([y, np.zeros(d)])
return np.linalg.lstsq(X_aug, y_aug, rcond=None)[0]
else:
return np.linalg.lstsq(X, y, rcond=None)[0]
Level 3: Design
Challenge: Implement multi-head attention from scratch using only NumPy matrix operations.
Answer
import numpy as np
def multi_head_attention(X, W_Q, W_K, W_V, W_O, n_heads):
seq_len, d_model = X.shape
d_k = d_model // n_heads
Q = (X @ W_Q).reshape(seq_len, n_heads, d_k).transpose(1, 0, 2)
K = (X @ W_K).reshape(seq_len, n_heads, d_k).transpose(1, 0, 2)
V = (X @ W_V).reshape(seq_len, n_heads, d_k).transpose(1, 0, 2)
# (n_heads, seq_len, seq_len)
scores = Q @ K.transpose(0, 2, 1) / np.sqrt(d_k)
scores -= scores.max(axis=-1, keepdims=True) # numerical stability
weights = np.exp(scores)
weights /= weights.sum(axis=-1, keepdims=True)
attended = (weights @ V).transpose(1, 0, 2).reshape(seq_len, d_model)
return attended @ W_O
# Test
seq_len, d_model, n_heads = 10, 64, 8
X = np.random.randn(seq_len, d_model)
W_Q = np.random.randn(d_model, d_model) * 0.01
W_K = np.random.randn(d_model, d_model) * 0.01
W_V = np.random.randn(d_model, d_model) * 0.01
W_O = np.random.randn(d_model, d_model) * 0.01
output = multi_head_attention(X, W_Q, W_K, W_V, W_O, n_heads)
print(f"MHA output shape: {output.shape}") # (10, 64)
Quick Reference Cheatsheet
| Operation | Math | NumPy | Shape | Notes |
|---|---|---|---|---|
| Matrix multiply | AB | A @ B | (m,n)@(n,p)→(m,p) | Composition of transforms |
| Transpose | Aᵀ | A.T | (m,n)→(n,m) | Swap rows/cols |
| Element-wise | A⊙B | A * B | same shape | Hadamard product |
| Linear solve | Ax=b | np.linalg.solve(A,b) | (n,n),(n,)→(n,) | NOT inv! |
| Least squares | min‖Ax-b‖ | np.linalg.lstsq(A,b) | For rectangular A | |
| Pseudoinverse | A⁺ | np.linalg.pinv(A) | (m,n)→(n,m) | |
| Rank | rank(A) | np.linalg.matrix_rank(A) | scalar | |
| Determinant | det(A) | np.linalg.det(A) | scalar | Use slogdet for large A |
| Log-det | log|det(A)| | np.linalg.slogdet(A) | (sign, logdet) | Numerically stable |
| Trace | tr(A) | np.trace(A) | scalar | Sum of eigenvalues |
| Frobenius norm | ‖A‖_F | np.linalg.norm(A,'fro') | scalar | √(sum of squares) |
| Condition number | κ(A) | np.linalg.cond(A) | scalar | Large → ill-conditioned |
Key Takeaways
- Matrix multiplication composes linear transformations - a neural network forward pass is a sequence of composed transformations
- The transpose Aᵀ appears everywhere: in attention (
QKᵀ), in the Gram matrix (XᵀX), in the normal equations ((XᵀX)⁻¹Xᵀy) - A matrix's rank reveals the intrinsic dimensionality of its transformation - rank-deficient matrices lose information
- Never use
np.linalg.invto solve linear systems - usenp.linalg.solve(square) ornp.linalg.lstsq(rectangular/rank-deficient) - The determinant is the volume-scaling factor of the transformation; det = 0 means the transformation is singular (non-invertible)
- Condition numbers above 1e10 signal numerical instability - regularization or pseudoinverses are needed
Next: Eigenvalues and Eigenvectors - How PageRank Ranks the Internet →
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Matrix Transformations in 3D demo on the EngineersOfAI Playground - no code required.
:::
