Skip to main content

Dot Products and Projections - The Math Behind Attention

Reading time: ~25 minutes | Level: Mathematical Foundations → ML Engineering

The transformer's attention mechanism is, at its core, a scaled dot product. The reason attention works - the reason GPT and BERT work - is a geometric property of dot products: they measure alignment. If you understand projections, you understand attention.

When query vector q and key vector k have a large dot product, q is "aligned" with k - the query is looking for exactly what the key is offering. When the dot product is small (near zero), the query and key are nearly orthogonal - the key is irrelevant to the query. The attention weight is just the softmax of these alignment scores.

This lesson builds the full geometric machinery of dot products and projections, then connects it to attention, cosine similarity, and least squares regression - three of the most important computational primitives in modern ML.

What You Will Learn

  • Dot product: algebraic and geometric definitions, and why they measure alignment
  • Orthogonality: when dot product = 0 and why independent directions matter
  • Projection of one vector onto another: the fundamental building block
  • Projection matrices: idempotent, symmetric, and what they represent geometrically
  • Gram-Schmidt: constructing orthonormal bases from arbitrary bases
  • Least squares via projection: the cleanest derivation of the normal equations
  • NumPy: dot products, projections, Gram-Schmidt, projection matrices
  • ML: attention mechanism, cosine similarity, least squares regression

Prerequisites

Part 1 - The Dot Product: Algebraic vs Geometric

The dot product (inner product) of two vectors u, v ∈ ℝⁿ has two equivalent definitions:

Algebraic definition

u · v = u^T v = sum_{i=1}^{n} u_i v_i

It is a sum of pairwise products. Simple, computational, fast.

Geometric definition

u · v = ||u|| * ||v|| * cos(theta)

where θ is the angle between u and v, and ‖u‖ = sqrt(uᵀu) is the Euclidean norm.

Why the geometric definition reveals alignment

The cosine function ranges from -1 to 1:

θcos(θ)Dot productInterpretation
1‖u‖‖v‖Perfectly aligned (same direction)
90°00Orthogonal (perpendicular, independent)
180°-1-‖u‖‖v‖Exactly opposite directions

The dot product measures how much u and v point in the same direction, scaled by their magnitudes.

import numpy as np

def dot_product_demo():
"""Demonstrate algebraic vs geometric definitions of dot product."""

u = np.array([3.0, 4.0]) # ||u|| = 5
v = np.array([1.0, 0.0]) # ||v|| = 1 (unit vector along x-axis)

# Algebraic
dot_algebraic = u @ v # or np.dot(u, v)

# Geometric
norm_u = np.linalg.norm(u)
norm_v = np.linalg.norm(v)
cos_theta = dot_algebraic / (norm_u * norm_v)
theta_deg = np.degrees(np.arccos(np.clip(cos_theta, -1, 1)))

print(f"u = {u}, ||u|| = {norm_u}")
print(f"v = {v}, ||v|| = {norm_v}")
print(f"\nAlgebraic: u·v = {dot_algebraic}")
print(f"Geometric: ||u||·||v||·cos(theta) = {norm_u:.2f} * {norm_v:.2f} * {cos_theta:.4f} = {dot_algebraic}")
print(f"Angle between u and v: {theta_deg:.2f} degrees")

# Key properties
print(f"\nProperties:")
print(f" Commutativity: u·v = v·u: {np.isclose(u@v, v@u)}")
print(f" Distributivity: u·(v+w) = u·v + u·w:")

w = np.array([2.0, 1.0])
print(f" u·(v+w) = {u@(v+w):.4f}")
print(f" u·v + u·w = {u@v + u@w:.4f}")
print(f" Match: {np.isclose(u@(v+w), u@v + u@w)}")

# Dot product as projection (preview)
# u·v = the component of u in the direction of v, times ||v||
u_component_along_v = dot_algebraic / norm_v # scalar projection
print(f"\n Scalar projection of u onto v: {u_component_along_v:.4f}")
# This is the 'shadow' of u cast along the direction of v

dot_product_demo()

Dot product in high dimensions

The geometric interpretation holds in any dimension. In high-dimensional embedding spaces (d = 512, 1024, 4096), dot products between embedding vectors measure alignment - the foundation of semantic similarity, retrieval, and attention.

import numpy as np

def high_dim_alignment():
"""
In high dimensions, random vectors are nearly orthogonal.
This is why random initialization does not cause attention saturation.
"""
np.random.seed(42)
results = {}

for d in [2, 10, 100, 1000]:
u = np.random.randn(d)
v = np.random.randn(d)
# Normalize
u /= np.linalg.norm(u)
v /= np.linalg.norm(v)

cos_sim = u @ v
results[d] = cos_sim

print("Cosine similarity between random unit vectors:")
for d, cos in results.items():
print(f" d={d:5d}: cos(theta) = {cos:.4f} (theta ~= {np.degrees(np.arccos(abs(cos))):.1f} deg)")

# In high dimensions: random unit vectors are nearly orthogonal
# This is the "blessing of dimensionality" for attention
print("\nAs d increases, random vectors become nearly orthogonal.")
print("Attention can distinguish relevant (aligned) from irrelevant (orthogonal) keys.")

high_dim_alignment()

Part 2 - Orthogonality: When the Dot Product Is Zero

Two vectors u and v are orthogonal (perpendicular) if:

u · v = 0

Geometrically: they form a 90° angle. Algebraically: their inner product vanishes.

Why orthogonality matters in ML

Orthogonality = independence in the linear sense. Orthogonal features carry non-overlapping information.

  1. Principal components are orthogonal: PCA produces directions with zero pairwise dot products. Each PC captures information not already in the others.

  2. Attention heads can specialize: if different attention heads learn nearly orthogonal query/key subspaces, they capture different types of relationships in the input.

  3. Regularization via orthogonality: orthogonal weight matrices preserve gradient norms during backpropagation (orthogonal matrices have singular values all equal to 1 → condition number = 1 → perfect gradient flow).

  4. Gram-Schmidt: the process for constructing an orthogonal basis from any set of vectors - the foundation of QR decomposition.

import numpy as np

def orthogonality_demo():
"""Orthogonality properties and verification."""

# Standard basis: perfectly orthogonal
e1 = np.array([1.0, 0.0, 0.0])
e2 = np.array([0.0, 1.0, 0.0])
e3 = np.array([0.0, 0.0, 1.0])

print("Standard basis dot products:")
print(f" e1 · e2 = {e1 @ e2}") # 0
print(f" e1 · e3 = {e1 @ e3}") # 0
print(f" e1 · e1 = {e1 @ e1}") # 1

# Orthonormal matrix: Q^T Q = I
np.random.seed(42)
A = np.random.randn(4, 4)
Q, R = np.linalg.qr(A) # Q is orthogonal (Q^T = Q^{-1})

print(f"\nOrthogonal matrix Q^T @ Q:")
print(np.round(Q.T @ Q, 10)) # Identity

# Key property: orthogonal matrices preserve norms and angles
v = np.random.randn(4)
print(f"\n||v|| = {np.linalg.norm(v):.6f}")
print(f"||Qv|| = {np.linalg.norm(Q @ v):.6f}") # Same! Orthogonal map = isometry

# Orthogonal matrices preserve dot products
u = np.random.randn(4)
print(f"\nu · v = {u @ v:.6f}")
print(f"(Qu) · (Qv) = {(Q@u) @ (Q@v):.6f}") # Same

# Orthogonal initialization: why it helps training
# Gradient propagates through Q unscaled: ||Q^T g|| = ||g||
g = np.random.randn(4) # gradient
print(f"\n||grad|| = {np.linalg.norm(g):.6f}")
print(f"||Q^T grad|| = {np.linalg.norm(Q.T @ g):.6f}") # Preserved

orthogonality_demo()

Orthogonal complement

The orthogonal complement of a subspace S is:

S^perp = {v : v · s = 0 for all s in S}

Every vector x in ℝⁿ can be uniquely decomposed as:

x = x_S + x_{S^perp}

where x_S ∈ S and x_{S^perp} ∈ S^⊥. This is the fundamental decomposition theorem - and projection computes x_S.

Part 3 - Projection of One Vector Onto Another

The scalar projection of u onto v (the component of u in the direction of v):

scalar_proj = (u · v) / ||v|| = ||u|| cos(theta)

The vector projection of u onto v (the actual vector component along v):

proj_v(u) = ((u · v) / ||v||^2) * v = ((u · v) / (v · v)) * v

This is the closest point to u in the span of v.

import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

def vector_projection(u: np.ndarray, v: np.ndarray) -> np.ndarray:
"""
Compute the orthogonal projection of u onto v.

Returns the vector component of u in the direction of v.
The residual u - proj_v(u) is orthogonal to v.
"""
v_unit = v / np.linalg.norm(v)
scalar = u @ v_unit # scalar projection
return scalar * v_unit # vector projection

def projection_demo():
"""Demonstrate vector projection with verification."""
u = np.array([4.0, 3.0])
v = np.array([2.0, 0.0])

proj = vector_projection(u, v)
residual = u - proj

print(f"u = {u}")
print(f"v = {v}")
print(f"proj_v(u) = {proj}")
print(f"residual = {residual}")

# Key property: residual is orthogonal to v
print(f"\nresidual · v = {residual @ v:.10f}") # should be 0
print(f"Orthogonality holds: {np.isclose(residual @ v, 0)}")

# Decomposition: u = proj + residual
print(f"\nu = proj + residual: {np.allclose(u, proj + residual)}")

# Scalar projection = dot product with unit vector
v_hat = v / np.linalg.norm(v)
scalar_proj = u @ v_hat
print(f"\nScalar projection: {scalar_proj:.4f}")
print(f"||proj||: {np.linalg.norm(proj):.4f}") # Should match

# Visual
fig, ax = plt.subplots(figsize=(7, 6))
ax.set_aspect('equal')
ax.axhline(0, color='gray', linewidth=0.5)
ax.axvline(0, color='gray', linewidth=0.5)

def draw_arrow(ax, vec, color, label, style='-'):
ax.annotate('', xy=vec, xytext=[0, 0],
arrowprops=dict(arrowstyle='->', color=color, lw=2))
ax.text(vec[0] + 0.1, vec[1] + 0.1, label, color=color, fontsize=12)

draw_arrow(ax, u, '#f59e0b', 'u')
draw_arrow(ax, v, '#6366f1', 'v')
draw_arrow(ax, proj, '#10b981', 'proj_v(u)')
ax.annotate('', xy=u, xytext=proj,
arrowprops=dict(arrowstyle='->', color='#ef4444', lw=1.5,
linestyle='dashed'))
ax.text((proj[0]+u[0])/2 + 0.1, (proj[1]+u[1])/2 + 0.1,
'residual', color='#ef4444', fontsize=11)

# Right-angle mark at proj
ax.plot([proj[0], proj[0]+0.15], [proj[1], proj[1]+0.15], 'k-', linewidth=1)
ax.plot([proj[0]+0.15, proj[0]+0.3], [proj[1]+0.15, proj[1]], 'k-', linewidth=1)

ax.set_xlim(-0.5, 5.5)
ax.set_ylim(-0.5, 4.5)
ax.set_title('Vector Projection: u = proj_v(u) + residual')
plt.tight_layout()
plt.savefig('/tmp/vector_projection.png', dpi=100, bbox_inches='tight')
print("\nSaved: /tmp/vector_projection.png")

projection_demo()

Projection onto a subspace

To project onto a higher-dimensional subspace spanned by columns of A ∈ ℝⁿˣᵏ:

proj_A(b) = A (A^T A)^{-1} A^T b (when A has full column rank)

The residual b - proj_A(b) is orthogonal to every column of A (and to the entire column space of A).

import numpy as np

def project_onto_subspace(A: np.ndarray, b: np.ndarray) -> np.ndarray:
"""
Project vector b onto the column space of A.
Uses the pseudoinverse to handle rank-deficient A.
"""
# A @ A_pinv = P (projection matrix onto col(A))
A_pinv = np.linalg.pinv(A)
return A @ (A_pinv @ b)

# Example: project b onto a 2D plane in 3D space
A = np.array([[1, 0],
[0, 1],
[0, 0]], dtype=float) # span = xy-plane in R^3

b = np.array([3.0, 4.0, 5.0]) # has a z-component

proj = project_onto_subspace(A, b)
residual = b - proj

print(f"b = {b}")
print(f"proj onto xy-plane: {proj}") # [3, 4, 0]
print(f"residual: {residual}") # [0, 0, 5]

# Verify residual is orthogonal to each column of A
print(f"\nresidual . col1 = {residual @ A[:, 0]:.4f}") # 0
print(f"residual . col2 = {residual @ A[:, 1]:.4f}") # 0

Part 4 - Projection Matrices: Idempotent and Symmetric

The projection matrix onto the subspace spanned by a unit vector v is:

P = v v^T / (v^T v) = v v^T (if ||v|| = 1)

More generally, the projection matrix onto the column space of A:

P = A (A^T A)^{-1} A^T (hat matrix in statistics)

Key properties of projection matrices

  1. Idempotent: P² = P (applying the projection twice = applying it once)
  2. Symmetric: Pᵀ = P (projection is a self-adjoint operator)
  3. Eigenvalues are 0 or 1: the subspace being projected onto has eigenvalue 1; its orthogonal complement has eigenvalue 0
  4. trace(P) = rank(P) = dimension of the subspace
import numpy as np

def projection_matrix_properties():
"""Verify the key properties of projection matrices."""

# Projection onto a line (1D subspace)
v = np.array([3.0, 4.0, 0.0])
v_hat = v / np.linalg.norm(v)

P_line = np.outer(v_hat, v_hat) # v v^T (for unit vector)
print("Projection onto line spanned by v:")
print(f"P =\n{P_line.round(4)}")

# Property 1: Idempotent P^2 = P
P_sq = P_line @ P_line
print(f"\nIdempotent (P^2 = P): {np.allclose(P_sq, P_line)}")

# Property 2: Symmetric P^T = P
print(f"Symmetric (P^T = P): {np.allclose(P_line.T, P_line)}")

# Property 3: Eigenvalues are 0 or 1
eigenvalues = np.linalg.eigvalsh(P_line)
print(f"Eigenvalues: {eigenvalues.round(6)}") # [0, 0, 1]
print(f"All eigenvalues in {{0, 1}}: "
f"{np.all(np.isclose(eigenvalues, 0) | np.isclose(eigenvalues, 1))}")

# Property 4: trace = rank = dim of subspace
print(f"trace(P) = {np.trace(P_line):.4f} (should be 1 for 1D subspace)")
print(f"rank(P) = {np.linalg.matrix_rank(P_line)}")

# Projection onto a 2D plane in R^3
A = np.column_stack([
np.array([1.0, 0.0, 0.0]),
np.array([0.0, 1.0, 0.0])
]) # xy-plane

P_plane = A @ np.linalg.pinv(A) # A (A^T A)^{-1} A^T
print("\nProjection onto xy-plane (2D subspace of R^3):")
print(f"P =\n{P_plane.round(4)}")
print(f"Idempotent: {np.allclose(P_plane @ P_plane, P_plane)}")
print(f"Symmetric: {np.allclose(P_plane.T, P_plane)}")
print(f"trace(P) = {np.trace(P_plane):.4f} (should be 2 for 2D subspace)")

# Complementary projection: I - P projects onto the orthogonal complement
P_complement = np.eye(3) - P_plane
print(f"\n(I - P) projects onto z-axis:")
b = np.array([2.0, 3.0, 7.0])
print(f" P @ b = {P_plane @ b}") # [2, 3, 0]
print(f" (I-P) @ b = {P_complement @ b}") # [0, 0, 7]

projection_matrix_properties()

The orthogonal complement projection

If P projects onto S, then I - P projects onto S^⊥:

x = P @ x + (I - P) @ x
\_____/ \_________/
in S in S^perp

Part 5 - Gram-Schmidt: Constructing Orthonormal Bases

Gram-Schmidt is the process for converting any set of linearly independent vectors into an orthonormal set that spans the same subspace.

Algorithm (for vectors v₁, v₂, ..., vₖ):

u₁ = v₁
u₂ = v₂ - proj_{u₁}(v₂) (subtract component along u₁)
u₃ = v₃ - proj_{u₁}(v₃) - proj_{u₂}(v₃) (subtract components along u₁ and u₂)
...
uᵢ = vᵢ - sum_{j=1}^{i-1} proj_{uⱼ}(vᵢ)

Then normalize: eᵢ = uᵢ / ||uᵢ||
import numpy as np

def gram_schmidt(V: np.ndarray) -> np.ndarray:
"""
Gram-Schmidt orthogonalization.

Args:
V: matrix whose columns are the input vectors (n, k)

Returns:
Q: matrix whose columns are orthonormal (n, k)

Note: numerically unstable for nearly-dependent vectors.
For production use, prefer np.linalg.qr (modified Gram-Schmidt internally).
"""
n, k = V.shape
Q = np.zeros((n, k))

for i in range(k):
# Start with the i-th input vector
q = V[:, i].copy()

# Subtract projections onto all previous orthonormal vectors
for j in range(i):
q -= (Q[:, j] @ V[:, i]) * Q[:, j]
# Equivalent: q -= np.dot(q, Q[:, j]) * Q[:, j]
# (But we use the original V[:, i] to avoid numerical drift)

# Normalize
norm = np.linalg.norm(q)
if norm < 1e-12:
raise ValueError(f"Vector {i} is linearly dependent on previous vectors")
Q[:, i] = q / norm

return Q

# Test: orthogonalize three vectors in R^4
np.random.seed(42)
V = np.random.randn(4, 3)

Q = gram_schmidt(V)
print("Gram-Schmidt result:")
print(f"Q shape: {Q.shape}")
print(f"Orthonormal: Q^T Q =\n{np.round(Q.T @ Q, 10)}") # Identity

# Verify: same column space as V
# (each column of V should be in the column space of Q)
for j in range(V.shape[1]):
v_j = V[:, j]
proj = Q @ Q.T @ v_j
residual = np.linalg.norm(v_j - proj)
print(f"Column {j} reconstruction residual: {residual:.2e}") # near 0

# Compare to np.linalg.qr (production Gram-Schmidt)
Q_np, R_np = np.linalg.qr(V)
print(f"\nnp.linalg.qr orthonormal: {np.allclose(Q_np.T @ Q_np, np.eye(3))}")

# Column spaces should be identical (up to sign)
# Check: columns of Q span the same space as columns of Q_np
for j in range(3):
q_mine = Q[:, j]
q_np = Q_np[:, j]
# One should be a scalar multiple of the other (with possible sign flip)
cos_sim = abs(q_mine @ q_np)
print(f" Column {j} cos similarity: {cos_sim:.8f}") # should be 1.0

Modified Gram-Schmidt (numerically stable version)

The classical Gram-Schmidt above is numerically unstable when vectors are nearly dependent. Modified Gram-Schmidt avoids the issue by re-orthogonalizing against the already-computed (updated) vectors:

import numpy as np

def modified_gram_schmidt(V: np.ndarray) -> np.ndarray:
"""
Modified Gram-Schmidt orthogonalization.
Numerically more stable than classical Gram-Schmidt.
"""
n, k = V.shape
Q = V.copy().astype(float)

for i in range(k):
# Normalize the i-th column
norm = np.linalg.norm(Q[:, i])
if norm < 1e-12:
raise ValueError(f"Vector {i} is linearly dependent")
Q[:, i] /= norm

# Remove component along Q[:, i] from all subsequent columns
for j in range(i + 1, k):
Q[:, j] -= (Q[:, i] @ Q[:, j]) * Q[:, i]

return Q

# Show numerical stability difference
# Create nearly-dependent vectors (bad case for classical GS)
np.random.seed(42)
eps = 1e-8
V_bad = np.array([[1, 1, 1],
[eps, eps, 0],
[0, eps, eps]], dtype=float)

Q_classical = gram_schmidt(V_bad)
Q_modified = modified_gram_schmidt(V_bad)

print("Classical GS orthogonality error:",
np.max(np.abs(Q_classical.T @ Q_classical - np.eye(3))))
print("Modified GS orthogonality error:",
np.max(np.abs(Q_modified.T @ Q_modified - np.eye(3))))

QR decomposition

Gram-Schmidt applied to the columns of A ∈ ℝⁿˣᵏ gives the QR decomposition:

A = Q R

where Q ∈ ℝⁿˣᵏ is orthonormal (QᵀQ = I) and R ∈ ℝᵏˣᵏ is upper triangular.

QR decomposition is the backbone of:

  • Solving least squares (QR is more stable than normal equations)
  • Computing eigenvalues (QR iteration algorithm)
  • Householder reflections in numerical linear algebra

Part 6 - Least Squares via Projection

The geometric picture

Given an overdetermined system Ax = b (more equations than unknowns, A ∈ ℝⁿˣᵈ with n > d), there is usually no exact solution. Instead, we want the x that minimizes ‖Ax - b‖².

Geometric insight: the vector Ax lives in the column space of A (as x varies). We want the point in col(A) closest to b - this is exactly the orthogonal projection of b onto col(A).

Ax* = proj_{col(A)}(b) = A (A^T A)^{-1} A^T b

From Ax* = A(AᵀA)⁻¹Aᵀb, multiply both sides by (AᵀA)⁻¹Aᵀ on the left:

x* = (A^T A)^{-1} A^T b

This is the normal equation - the least squares solution.

Why "normal" equations?

The residual r = b - Ax* must be orthogonal (normal) to the column space of A:

A^T r = A^T (b - Ax*) = 0
A^T b = A^T A x*
x* = (A^T A)^{-1} A^T b

The condition Aᵀr = 0 says every column of A is orthogonal to the residual. This is the orthogonality condition for projection.

import numpy as np

def least_squares_via_projection():
"""
Derive and verify the least squares solution via projection.
"""
np.random.seed(42)
n, d = 100, 5 # overdetermined: 100 equations, 5 unknowns

# True parameters
x_true = np.array([2.0, -1.0, 3.0, 0.5, -2.0])
A = np.random.randn(n, d) # design matrix
b = A @ x_true + 0.1 * np.random.randn(n) # noisy observations

# Method 1: Normal equations (closed form)
# x* = (A^T A)^{-1} A^T b
ATA = A.T @ A # (d, d) -- Gram matrix
ATb = A.T @ b # (d,)
x_normal = np.linalg.solve(ATA, ATb) # faster than inv(ATA) @ ATb

# Method 2: Projection matrix approach
# b_proj = A (A^T A)^{-1} A^T b (project b onto col(A))
P = A @ np.linalg.solve(ATA, A.T) # hat matrix (n, n)
b_proj = P @ b
x_proj = np.linalg.solve(ATA, A.T @ b_proj)

# Method 3: np.linalg.lstsq (production: uses SVD, most stable)
x_lstsq, residuals, rank, sv = np.linalg.lstsq(A, b, rcond=None)

# Method 4: QR decomposition
Q, R = np.linalg.qr(A)
# A = QR -> A^T A x* = A^T b -> R^T Q^T Q R x* = R^T Q^T b -> R x* = Q^T b
x_qr = np.linalg.solve(R, Q.T @ b)

print("Least squares solutions:")
print(f" Normal equations: {x_normal.round(4)}")
print(f" Projection: {x_proj.round(4)}")
print(f" lstsq (SVD): {x_lstsq.round(4)}")
print(f" QR decomposition: {x_qr.round(4)}")
print(f" True x*: {x_true}")

print(f"\nAll methods agree: {np.allclose(x_normal, x_lstsq, atol=1e-8)}")

# Key projection properties
residual = b - A @ x_normal
print(f"\nProjection verification:")
print(f" Residual orthogonal to col(A): "
f"{np.allclose(A.T @ residual, 0, atol=1e-8)}")
print(f" ||residual||^2 (minimum error): {np.sum(residual**2):.4f}")

# Projection matrix properties
print(f"\nHat matrix P properties:")
print(f" Symmetric: {np.allclose(P, P.T)}")
print(f" Idempotent: {np.allclose(P @ P, P)}")
print(f" trace(P) = rank(A) = {np.trace(P):.4f}") # = d = 5

least_squares_via_projection()

Linear regression as projection

import numpy as np

class LinearRegressionProjection:
"""
Linear regression via projection (OLS).
Demonstrates the geometric interpretation of regression.
"""

def __init__(self, fit_intercept: bool = True):
self.fit_intercept = fit_intercept
self.coef_ = None
self.intercept_ = None

def fit(self, X: np.ndarray, y: np.ndarray) -> 'LinearRegressionProjection':
n, d = X.shape

if self.fit_intercept:
# Augment X with a column of ones for the intercept
A = np.column_stack([np.ones(n), X]) # (n, d+1)
else:
A = X

# Least squares via normal equations (or QR for stability)
Q, R = np.linalg.qr(A)
theta = np.linalg.solve(R, Q.T @ y) # (d+1,) or (d,)

if self.fit_intercept:
self.intercept_ = theta[0]
self.coef_ = theta[1:]
else:
self.intercept_ = 0.0
self.coef_ = theta

return self

def predict(self, X: np.ndarray) -> np.ndarray:
return X @ self.coef_ + self.intercept_

def score(self, X: np.ndarray, y: np.ndarray) -> float:
"""R^2 score: fraction of variance explained."""
y_pred = self.predict(X)
ss_res = np.sum((y - y_pred)**2)
ss_tot = np.sum((y - y.mean())**2)
return 1 - ss_res / ss_tot

# Test
np.random.seed(42)
n = 200
X = np.random.randn(n, 3)
beta_true = np.array([1.5, -2.0, 0.8])
y = X @ beta_true + 1.2 + 0.5 * np.random.randn(n) # intercept = 1.2

lr = LinearRegressionProjection(fit_intercept=True)
lr.fit(X, y)

print(f"True coefficients: {beta_true}")
print(f"Estimated: {lr.coef_.round(4)}")
print(f"True intercept: 1.2000")
print(f"Estimated: {lr.intercept_:.4f}")
print(f"R^2 score: {lr.score(X, y):.4f}")

# Compare to numpy lstsq
A_aug = np.column_stack([np.ones(n), X])
theta_np, _, _, _ = np.linalg.lstsq(A_aug, y, rcond=None)
print(f"\nnumpy lstsq intercept: {theta_np[0]:.4f}")
print(f"numpy lstsq coefs: {theta_np[1:].round(4)}")
print(f"Match: {np.allclose([lr.intercept_] + list(lr.coef_), theta_np, atol=1e-8)}")

Part 7 - NumPy: Dot Products, Projections, Gram-Schmidt

Complete reference

import numpy as np

# ── Dot product ──────────────────────────────────────────────────────────────
u = np.array([1.0, 2.0, 3.0])
v = np.array([4.0, 5.0, 6.0])

dot1 = u @ v # operator syntax (preferred)
dot2 = np.dot(u, v) # function syntax
dot3 = np.inner(u, v) # same as np.dot for 1D
dot4 = u.T @ v # explicit transpose (same result for 1D)

print(f"u @ v = {dot1}") # 32.0

# Batch dot products
U = np.random.randn(10, 512) # 10 query vectors
V = np.random.randn(20, 512) # 20 key vectors

# Dot products between all pairs (10, 20)
scores = U @ V.T # shape (10, 20)
print(f"\nAttention score matrix shape: {scores.shape}") # (10, 20)

# ── Norms ────────────────────────────────────────────────────────────────────
norm_u = np.linalg.norm(u) # L2 norm (default)
norm_u_sq = u @ u # squared L2 norm
u_hat = u / norm_u # unit vector

print(f"\n||u|| = {norm_u:.4f}")
print(f"u_hat: {u_hat.round(4)}, ||u_hat|| = {np.linalg.norm(u_hat):.6f}")

# ── Cosine similarity ────────────────────────────────────────────────────────
def cosine_similarity(u: np.ndarray, v: np.ndarray) -> float:
"""Cosine similarity between vectors u and v."""
return float((u @ v) / (np.linalg.norm(u) * np.linalg.norm(v)))

def cosine_similarity_matrix(U: np.ndarray, V: np.ndarray) -> np.ndarray:
"""
Pairwise cosine similarity between rows of U and rows of V.
U: (n, d), V: (m, d) -> output: (n, m)
"""
U_norm = U / np.linalg.norm(U, axis=1, keepdims=True)
V_norm = V / np.linalg.norm(V, axis=1, keepdims=True)
return U_norm @ V_norm.T

print(f"\nCosine similarity (u, v): {cosine_similarity(u, v):.4f}")

A = np.random.randn(5, 3)
B = np.random.randn(4, 3)
cos_mat = cosine_similarity_matrix(A, B)
print(f"Cosine similarity matrix: {cos_mat.shape}") # (5, 4)

# ── Projections ──────────────────────────────────────────────────────────────
def proj_vector(u: np.ndarray, v: np.ndarray) -> np.ndarray:
"""Project u onto direction of v."""
v_hat = v / np.linalg.norm(v)
return (u @ v_hat) * v_hat

def proj_subspace(A: np.ndarray, b: np.ndarray) -> np.ndarray:
"""Project b onto column space of A."""
return A @ np.linalg.pinv(A) @ b

def projection_matrix(A: np.ndarray) -> np.ndarray:
"""Compute projection matrix P = A (A^T A)^{-1} A^T."""
A_pinv = np.linalg.pinv(A)
return A @ A_pinv

# ── Gram-Schmidt via QR ───────────────────────────────────────────────────────
V_mat = np.random.randn(5, 3) # 3 vectors in R^5

# Production: use np.linalg.qr
Q, R = np.linalg.qr(V_mat) # Q: (5, 3), R: (3, 3)
print(f"\nQR: Q shape = {Q.shape}, R shape = {R.shape}")
print(f"Q orthonormal: {np.allclose(Q.T @ Q, np.eye(3))}")
print(f"Reconstruction: {np.allclose(V_mat, Q @ R)}")

# ── Least squares ─────────────────────────────────────────────────────────────
n_samples, n_features = 100, 5
A_ls = np.random.randn(n_samples, n_features)
b_ls = np.random.randn(n_samples)

# Method 1: lstsq (recommended for production)
x_lstsq, _, _, _ = np.linalg.lstsq(A_ls, b_ls, rcond=None)

# Method 2: normal equations (less stable)
x_normal = np.linalg.solve(A_ls.T @ A_ls, A_ls.T @ b_ls)

# Method 3: QR
Q_ls, R_ls = np.linalg.qr(A_ls)
x_qr = np.linalg.solve(R_ls, Q_ls.T @ b_ls)

print(f"\nLeast squares methods agree: "
f"{np.allclose(x_lstsq, x_normal) and np.allclose(x_lstsq, x_qr)}")

Part 8 - ML Connections: Attention, Cosine Similarity, Regression

Scaled dot-product attention

The attention mechanism in transformers computes:

Attention(Q, K, V) = softmax(Q K^T / sqrt(d_k)) V

where Q ∈ ℝⁿˣᵈₖ (queries), K ∈ ℝᵐˣᵈₖ (keys), V ∈ ℝᵐˣᵈᵥ (values).

The term QKᵀ is a matrix of dot products - for each query row qᵢ and each key row kⱼ, it computes qᵢ · kⱼ (their alignment score). Softmax converts these scores into weights, and V is the weighted sum of values.

import numpy as np

def scaled_dot_product_attention(
Q: np.ndarray, # (n_q, d_k) -- query vectors
K: np.ndarray, # (n_k, d_k) -- key vectors
V: np.ndarray, # (n_k, d_v) -- value vectors
mask: np.ndarray = None, # (n_q, n_k) -- optional mask (e.g., causal)
) -> tuple:
"""
Scaled dot-product attention (Vaswani et al., 2017).

The scaling by sqrt(d_k) prevents dot products from growing large
in high dimensions, which would push softmax into regions with
near-zero gradients (softmax saturation).

Returns:
output: (n_q, d_v) -- weighted values
weights: (n_q, n_k) -- attention weights (softmax of scores)
"""
d_k = Q.shape[-1]

# Step 1: compute raw alignment scores (dot products)
# scores[i, j] = q_i . k_j (how much query i aligns with key j)
scores = Q @ K.T # (n_q, n_k)

# Step 2: scale by 1/sqrt(d_k)
# Without scaling: in d_k dimensions, ||q||, ||k|| ~ sqrt(d_k)
# so q.k ~ d_k for random unit vectors, pushing softmax to saturation
scores = scores / np.sqrt(d_k)

# Step 3: optional mask (e.g., causal mask for autoregressive models)
if mask is not None:
scores = np.where(mask == 0, -1e9, scores)

# Step 4: softmax over key dimension
# softmax[i, :] is the attention distribution for query i
scores_shifted = scores - scores.max(axis=-1, keepdims=True) # numerical stability
exp_scores = np.exp(scores_shifted)
weights = exp_scores / exp_scores.sum(axis=-1, keepdims=True) # (n_q, n_k)

# Step 5: weighted sum of values
output = weights @ V # (n_q, d_v)

return output, weights

# Example: 4 tokens, d_k = d_v = 8
np.random.seed(42)
n_tokens = 4
d_k = 8
d_v = 8

Q = np.random.randn(n_tokens, d_k)
K = np.random.randn(n_tokens, d_k)
V = np.random.randn(n_tokens, d_v)

output, weights = scaled_dot_product_attention(Q, K, V)
print(f"Q shape: {Q.shape}")
print(f"K shape: {K.shape}")
print(f"V shape: {V.shape}")
print(f"Output shape: {output.shape}") # (4, 8)
print(f"Attention weights shape: {weights.shape}") # (4, 4)
print(f"\nAttention weights (each row sums to 1):")
print(weights.round(3))
print(f"Row sums: {weights.sum(axis=1).round(6)}") # all 1.0

# Causal mask: token i can only attend to tokens j <= i
causal_mask = np.tril(np.ones((n_tokens, n_tokens)))
output_causal, weights_causal = scaled_dot_product_attention(Q, K, V, mask=causal_mask)
print(f"\nCausal attention weights (upper triangle = 0):")
print(weights_causal.round(3))

Why divide by sqrt(d_k)?

import numpy as np

def attention_scaling_analysis():
"""
Show why the sqrt(d_k) scaling is necessary.
"""
np.random.seed(42)

for d_k in [4, 64, 512, 4096]:
# Random unit query and key vectors
q = np.random.randn(d_k)
q /= np.linalg.norm(q)
K = np.random.randn(20, d_k)
K /= np.linalg.norm(K, axis=1, keepdims=True)

raw_scores = K @ q # dot products
scaled_scores = raw_scores / np.sqrt(d_k)

# After softmax, check if distribution is peaky (bad) or spread (good)
softmax_raw = np.exp(raw_scores - raw_scores.max())
softmax_raw /= softmax_raw.sum()

softmax_scaled = np.exp(scaled_scores - scaled_scores.max())
softmax_scaled /= softmax_scaled.sum()

# Entropy: higher = more spread (better for attention)
entropy_raw = -np.sum(softmax_raw * np.log(softmax_raw + 1e-10))
entropy_scaled = -np.sum(softmax_scaled * np.log(softmax_scaled + 1e-10))

print(f"d_k={d_k:5d}: raw score std={raw_scores.std():.2f}, "
f"entropy (raw)={entropy_raw:.3f}, "
f"entropy (scaled)={entropy_scaled:.3f}")

print("\nAs d_k grows, raw dot products have larger std -> softmax peaks on one key")
print("Scaling by 1/sqrt(d_k) keeps std near 1 -> balanced attention")

attention_scaling_analysis()

Cosine similarity for retrieval

import numpy as np

def cosine_retrieval_demo():
"""
Cosine similarity for nearest-neighbor retrieval.
Used in embedding search (vector databases, RAG systems).
"""
np.random.seed(42)
d = 128 # embedding dimension

# Simulated embeddings: 100 documents
doc_embeddings = np.random.randn(100, d)
doc_embeddings /= np.linalg.norm(doc_embeddings, axis=1, keepdims=True) # normalize

# Query embedding
query = np.random.randn(d)
query /= np.linalg.norm(query)

# Cosine similarity: for normalized vectors, dot product = cosine similarity
similarities = doc_embeddings @ query # (100,) -- cosine similarities

# Top-5 most similar documents
top_k = 5
top_indices = np.argsort(similarities)[::-1][:top_k]
top_scores = similarities[top_indices]

print(f"Query embedding: {query[:5].round(4)}...")
print(f"\nTop-{top_k} most similar documents:")
for i, (idx, score) in enumerate(zip(top_indices, top_scores)):
print(f" Rank {i+1}: doc {idx:3d}, cosine similarity = {score:.4f}")

# Tip: for large-scale retrieval, use FAISS (Facebook AI Similarity Search)
# FAISS computes dot products between normalized vectors = cosine similarity
# at scale (millions of vectors) using optimized BLAS routines

cosine_retrieval_demo()

Interview Questions

Q1: Why does scaled dot-product attention divide by sqrt(d_k)?

Short answer: To prevent softmax saturation in high dimensions.

Detailed derivation:

Assume query q and key k are independent random vectors with components sampled from a distribution with mean 0 and variance 1 (as produced by standard initialization).

The dot product q · k = Σᵢ qᵢkᵢ has:

  • Mean: E[qᵢkᵢ] = E[qᵢ]E[kᵢ] = 0 (independent, zero-mean)
  • Variance: Var(Σᵢ qᵢkᵢ) = Σᵢ Var(qᵢkᵢ) = Σᵢ E[qᵢ²]E[kᵢ²] = d_k

So the standard deviation of q · k grows as sqrt(d_k).

When d_k is large (e.g., 512), dot products have large magnitude. The softmax function applied to large-magnitude inputs becomes very "peaky" - it approaches a one-hot vector, concentrating all weight on the maximum score.

This causes two problems:

  1. Vanishing gradients: the gradient of softmax is small when the output is near one-hot
  2. Loss of information: attention effectively ignores all but one key

Fix: divide by sqrt(d_k) to normalize the scores. After scaling, q · k / sqrt(d_k) has variance 1 regardless of d_k - the softmax receives inputs with consistent scale.

Interview tip: "We divide by sqrt(d_k) because the dot product q·k has standard deviation proportional to sqrt(d_k). Without scaling, large d_k causes softmax saturation and vanishing gradients. Dividing by sqrt(d_k) keeps the variance at 1, maintaining a well-spread attention distribution."

Q2: Derive the least squares normal equations from the projection interpretation.

Setup: minimize ‖Ax - b‖² over x, where A ∈ ℝⁿˣᵈ (n > d), b ∈ ℝⁿ.

Geometric insight: Ax lives in the column space of A (col(A)). We want the point Ax* ∈ col(A) closest to b. This is the orthogonal projection of b onto col(A).

Derivation:

The optimal solution satisfies: the residual r = b - Ax* must be orthogonal to every vector in col(A).

Column space of A = span{a₁, a₂, ..., aₐ} (columns of A).

Orthogonality condition: for every column aⱼ of A:

a_j^T (b - Ax*) = 0 for j = 1, ..., d

Collecting all j into matrix form:

A^T (b - Ax*) = 0
A^T b = A^T A x*

Assuming A has full column rank (AᵀA is invertible):

x* = (A^T A)^{-1} A^T b

These are the normal equations. The name "normal" comes from the geometric fact that the residual b - Ax* is normal (perpendicular) to col(A).

Verification: plugging x* back:

A^T (b - Ax*) = A^T b - A^T A (A^T A)^{-1} A^T b = A^T b - A^T b = 0 ✓

Why AᵀA is invertible (when A has full column rank):

If Ax = 0 → xᵀAᵀAx = 0 → ‖Ax‖² = 0 → Ax = 0 → x = 0 (full column rank).

So AᵀA is positive definite, hence invertible.

Q3: What is an orthogonal projection matrix and what are its properties?

An orthogonal projection matrix P ∈ ℝⁿˣⁿ is a matrix that projects every vector in ℝⁿ onto some subspace S, such that the residual is orthogonal to S.

For the subspace col(A) (column space of A with full column rank):

P = A (A^T A)^{-1} A^T

For a single unit vector v (1D subspace):

P = v v^T

Properties:

  1. Idempotent: P² = P

    • Proof: P² = A(AᵀA)⁻¹AᵀA(AᵀA)⁻¹Aᵀ = A(AᵀA)⁻¹Aᵀ = P
    • Geometric: projecting twice = projecting once (already in the subspace)
  2. Symmetric: Pᵀ = P

    • Proof: [A(AᵀA)⁻¹Aᵀ]ᵀ = A[(AᵀA)⁻¹]ᵀAᵀ = A(AᵀA)⁻¹Aᵀ = P
    • (since AᵀA is symmetric, its inverse is symmetric)
  3. Eigenvalues are 0 or 1:

    • For v ∈ col(A): Pv = v (eigenvalue 1)
    • For v ∈ col(A)^⊥: Pv = 0 (eigenvalue 0)
    • So P has exactly rank(P) eigenvalues equal to 1
  4. trace(P) = rank(P) = dim(S):

    • trace = sum of eigenvalues = number of 1-eigenvalues = dimension of subspace
  5. Complementary projection: I - P projects onto S^⊥

    • (I - P)² = I - P (idempotent), (I - P)ᵀ = I - P (symmetric)
    • x = Px + (I-P)x decomposes x into component in S and component in S^⊥

ML use: in linear regression, P = X(XᵀX)⁻¹Xᵀ is the "hat matrix" (since Py = ŷ). Its diagonal entries Pᵢᵢ are "leverage scores" - how much influence observation i has on its own predicted value.

Q4: Why is the Gram-Schmidt process important for QR decomposition?

Gram-Schmidt is the algorithmic core of QR decomposition. Given A ∈ ℝⁿˣᵈ with linearly independent columns, Gram-Schmidt produces Q (orthonormal) and R (upper triangular) such that A = QR.

The connection:

At each step of Gram-Schmidt applied to columns a₁, a₂, ..., aₐ of A:

Step 1: q₁ = a₁ / ||a₁||
Step 2: u₂ = a₂ - (a₂·q₁)q₁; q₂ = u₂ / ||u₂||
Step 3: u₃ = a₃ - (a₃·q₁)q₁ - (a₃·q₂)q₂; q₃ = u₃/||u₃||

The coefficients in the subtraction are exactly the entries of R:

R[1,1] = ||a₁||
R[1,2] = a₂ · q₁ R[2,2] = ||u₂||
R[1,3] = a₃ · q₁ R[2,3] = a₃ · q₂ R[3,3] = ||u₃||

So A = QR is literally the encoding of the Gram-Schmidt process.

Why QR decomposition is important in ML:

  1. Solving least squares: Ax = b → QRx = b → Rx = Qᵀb. Since R is triangular, this is solved by back-substitution - O(d²) instead of the O(d³) inversion of AᵀA.

  2. Numerical stability: the QR-based least squares avoids forming AᵀA (which squares the condition number). np.linalg.lstsq uses QR internally.

  3. QR iteration for eigenvalues: the standard algorithm for computing all eigenvalues of a matrix is repeated QR decomposition (QR iteration / Francis QR algorithm).

  4. Orthogonal initialization: QR decomposition of a random matrix gives an orthogonal matrix Q - used as initialization for RNNs and some weight matrices to preserve gradient norms.

  5. Gram-Schmidt instability: the classical Gram-Schmidt is numerically unstable; Modified Gram-Schmidt and Householder reflections (used in production QR) are the stable variants.

Practice Challenges

Level 1: Predict

Challenge: Without computing, predict the result:

  1. The projection matrix P = vvᵀ/‖v‖² where v = [1, 0, 0]ᵀ. What is P?
  2. If P is a projection matrix, what is the projection matrix onto the orthogonal complement?
  3. u = [3, 4] and v = [4, 3]. Without computing, which is larger: ‖proj_v(u)‖ or ‖proj_u(v)‖?
Answer
  1. P = vvᵀ/‖v‖² where v = [1, 0, 0]ᵀ:
P = [1] [1 0 0] = [1 0 0]
[0] [0 0 0]
[0] [0 0 0]

This projects onto the x-axis. P maps [x, y, z] → [x, 0, 0]. Verified: P² = P, Pᵀ = P.

  1. Orthogonal complement projection: I - P. If P projects onto S, then I - P projects onto S^⊥. For the x-axis projection: I - P maps [x, y, z] → [0, y, z] (the yz-plane).

  2. Which projection is larger? Both have magnitude ‖u‖‖v‖cos(θ)/‖v‖ or ‖u‖‖v‖cos(θ)/‖u‖.

  • ‖proj_v(u)‖ = |u·v| / ‖v‖ = |(3)(4)+(4)(3)| / 5 = 24/5 = 4.8
  • ‖proj_u(v)‖ = |u·v| / ‖u‖ = 24/5 = 4.8

They are equal! Both equal |u·v| / ‖v‖ = |u·v| / ‖u‖ only when ‖u‖ = ‖v‖. Here ‖u‖ = ‖v‖ = 5, so indeed equal.

import numpy as np
u, v = np.array([3.0, 4.0]), np.array([4.0, 3.0])
print(np.linalg.norm(u), np.linalg.norm(v)) # 5.0, 5.0
proj_vu = (u @ v / (v @ v)) * v
proj_uv = (u @ v / (u @ u)) * u
print(np.linalg.norm(proj_vu), np.linalg.norm(proj_uv)) # 4.8, 4.8

Level 2: Debug

Challenge: The following attention implementation has a bug that causes incorrect results. Find and fix it:

import numpy as np

def buggy_attention(Q, K, V):
d_k = Q.shape[-1]
# Bug somewhere in here
scores = Q @ K / np.sqrt(d_k)
weights = np.exp(scores) / np.exp(scores).sum()
return weights @ V
Answer

Three bugs:

Bug 1: Q @ K should be Q @ K.T. Q has shape (n_q, d_k), K has shape (n_k, d_k). The dot product between each query and each key requires K to be transposed.

Bug 2: np.exp(scores).sum() sums over all elements, not along the key dimension. Each query should have its own attention distribution. Should be .sum(axis=-1, keepdims=True).

Bug 3: No numerical stability trick. np.exp(scores) can overflow for large scores. Should subtract the max per row before exponentiating.

import numpy as np

def fixed_attention(Q: np.ndarray, K: np.ndarray, V: np.ndarray) -> np.ndarray:
"""
Fixed scaled dot-product attention.
Q: (n_q, d_k), K: (n_k, d_k), V: (n_k, d_v)
Returns: (n_q, d_v)
"""
d_k = Q.shape[-1]

# Fix 1: transpose K
scores = Q @ K.T / np.sqrt(d_k) # (n_q, n_k)

# Fix 3: subtract max for numerical stability
scores -= scores.max(axis=-1, keepdims=True)

exp_scores = np.exp(scores)

# Fix 2: normalize along key dimension
weights = exp_scores / exp_scores.sum(axis=-1, keepdims=True) # (n_q, n_k)

return weights @ V # (n_q, d_v)

# Verify
np.random.seed(42)
Q = np.random.randn(3, 8)
K = np.random.randn(5, 8)
V = np.random.randn(5, 4)

output = fixed_attention(Q, K, V)
print(f"Output shape: {output.shape}") # (3, 4)

# Check: each row of weights sums to 1
d_k = 8
scores = Q @ K.T / np.sqrt(d_k)
scores -= scores.max(axis=-1, keepdims=True)
weights = np.exp(scores) / np.exp(scores).sum(axis=-1, keepdims=True)
print(f"Attention weight row sums: {weights.sum(axis=1).round(6)}") # all 1.0

Level 3: Design

Challenge: Implement multi-head attention from scratch using only NumPy. Each head independently computes scaled dot-product attention in a lower-dimensional subspace. Verify that the concatenated output has the correct shape and that each head produces a valid attention distribution.

Answer
import numpy as np

class MultiHeadAttention:
"""
Multi-Head Attention (Vaswani et al., 2017) implemented in NumPy.

Each head projects Q, K, V into d_k, d_k, d_v dimensional subspaces,
computes attention independently, then concatenates.

Total computation: h heads x attention(d_k) -- each head sees a
different projection of the input, enabling specialization.
"""

def __init__(self, d_model: int, n_heads: int, seed: int = 42):
assert d_model % n_heads == 0, "d_model must be divisible by n_heads"
self.d_model = d_model
self.n_heads = n_heads
self.d_k = d_model // n_heads # dimension per head
self.d_v = d_model // n_heads

rng = np.random.default_rng(seed)

# Weight matrices for each head
# W_Q, W_K, W_V: (n_heads, d_model, d_k) -- one matrix per head
self.W_Q = rng.standard_normal((n_heads, d_model, self.d_k)) * 0.02
self.W_K = rng.standard_normal((n_heads, d_model, self.d_k)) * 0.02
self.W_V = rng.standard_normal((n_heads, d_model, self.d_v)) * 0.02

# Output projection: concatenated heads -> d_model
self.W_O = rng.standard_normal((n_heads * self.d_v, d_model)) * 0.02

def _attention(self, Q: np.ndarray, K: np.ndarray, V: np.ndarray) -> tuple:
"""
Single head attention.
Q: (seq_len, d_k), K: (seq_len, d_k), V: (seq_len, d_v)
"""
d_k = Q.shape[-1]
scores = Q @ K.T / np.sqrt(d_k) # (seq_len, seq_len)
scores -= scores.max(axis=-1, keepdims=True) # stability
weights = np.exp(scores)
weights /= weights.sum(axis=-1, keepdims=True) # (seq_len, seq_len)
return weights @ V, weights # (seq_len, d_v), (seq_len, seq_len)

def forward(self, X: np.ndarray) -> tuple:
"""
X: (seq_len, d_model) -- input sequence
Returns: output (seq_len, d_model), attention_weights (n_heads, seq_len, seq_len)
"""
seq_len, d_model = X.shape
head_outputs = []
all_weights = []

for h in range(self.n_heads):
# Project to head-specific subspaces
Q_h = X @ self.W_Q[h] # (seq_len, d_k)
K_h = X @ self.W_K[h] # (seq_len, d_k)
V_h = X @ self.W_V[h] # (seq_len, d_v)

# Compute attention in this head's subspace
out_h, w_h = self._attention(Q_h, K_h, V_h) # (seq_len, d_v)
head_outputs.append(out_h)
all_weights.append(w_h)

# Concatenate all heads: (seq_len, n_heads * d_v)
concat = np.concatenate(head_outputs, axis=-1)

# Output projection: (seq_len, d_model)
output = concat @ self.W_O

attention_weights = np.stack(all_weights, axis=0) # (n_heads, seq_len, seq_len)
return output, attention_weights

# Test
d_model = 64
n_heads = 8
seq_len = 10

mha = MultiHeadAttention(d_model=d_model, n_heads=n_heads)
X = np.random.randn(seq_len, d_model)
output, weights = mha.forward(X)

print(f"Input shape: {X.shape}") # (10, 64)
print(f"Output shape: {output.shape}") # (10, 64)
print(f"Weights shape: {weights.shape}") # (8, 10, 10)

# Verify each head has valid attention distribution
print(f"\nEach head's attention weights sum to 1 per query:")
for h in range(n_heads):
row_sums = weights[h].sum(axis=-1)
print(f" Head {h}: min={row_sums.min():.6f}, max={row_sums.max():.6f}") # all 1.0

# Each head's d_k
print(f"\nd_k per head: {mha.d_k}") # 8 (= 64 / 8)

Quick Reference Cheatsheet

OperationMathNumPyNotes
Dot productu · v = Σᵢ uᵢvᵢu @ v or np.dot(u, v)Also np.inner(u, v)
Norm‖u‖ = sqrt(u·u)np.linalg.norm(u)Default is L2
Unit vectoru / ‖u‖u / np.linalg.norm(u)For cosine similarity
Cosine similarityu·v / (‖u‖‖v‖)(u@v)/(norm(u)*norm(v))In [-1, 1]
Anglearccos(u·v / ‖u‖‖v‖)np.arccos(np.clip(cos, -1, 1))In [0, π]
Scalar projectionu·v / ‖v‖u @ v_hatComponent of u along v
Vector projection(u·v / ‖v‖²) v(u@v / (v@v)) * vVector along v
Projection matrixA(AᵀA)⁻¹AᵀA @ np.linalg.pinv(A)Onto col(A)
QR decompositionA = QRnp.linalg.qr(A)Q orthonormal, R upper triangular
Least squaresx* = (AᵀA)⁻¹Aᵀbnp.linalg.lstsq(A, b, rcond=None)[0]Use lstsq, not inv
Attention scoresQKᵀ / sqrt(d_k)Q @ K.T / np.sqrt(d_k)Pre-softmax alignment

Key Takeaways

  • The dot product u · v = ‖u‖‖v‖cos(θ) measures alignment - how much two vectors point in the same direction. This geometric interpretation is the foundation of attention.
  • Orthogonal vectors have zero dot product. Orthogonality means independence in the linear sense. Principal components are orthogonal, ensuring they capture non-overlapping information.
  • The vector projection of u onto v is (u·v / ‖v‖²) v - the component of u in the direction of v. The residual u - proj is orthogonal to v.
  • Projection matrices P satisfy P² = P (idempotent) and Pᵀ = P (symmetric). Their eigenvalues are 0 or 1. trace(P) = dim of the subspace projected onto.
  • Gram-Schmidt converts any basis into an orthonormal basis by repeatedly subtracting projections. It is the algorithm underlying QR decomposition.
  • Least squares minimizes ‖Ax - b‖² by projecting b onto col(A). The solution x* = (AᵀA)⁻¹Aᵀb satisfies the orthogonality condition Aᵀ(b - Ax*) = 0.
  • Scaled dot-product attention computes alignment scores QKᵀ / sqrt(d_k). The scaling prevents softmax saturation: without it, dot products grow as sqrt(d_k), making gradients vanish.

Next: Norms and Distance Metrics →

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Dot Products & Projections demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.