Dot Products and Projections - The Math Behind Attention

Reading time: ~25 minutes | Level: Mathematical Foundations → ML Engineering

The transformer's attention mechanism is, at its core, a scaled dot product. The reason attention works - the reason GPT and BERT work - is a geometric property of dot products: they measure alignment. If you understand projections, you understand attention.

When query vector q and key vector k have a large dot product, q is "aligned" with k - the query is looking for exactly what the key is offering. When the dot product is small (near zero), the query and key are nearly orthogonal - the key is irrelevant to the query. The attention weight is just the softmax of these alignment scores.

This lesson builds the full geometric machinery of dot products and projections, then connects it to attention, cosine similarity, and least squares regression - three of the most important computational primitives in modern ML.

What You Will Learn

Dot product: algebraic and geometric definitions, and why they measure alignment
Orthogonality: when dot product = 0 and why independent directions matter
Projection of one vector onto another: the fundamental building block
Projection matrices: idempotent, symmetric, and what they represent geometrically
Gram-Schmidt: constructing orthonormal bases from arbitrary bases
Least squares via projection: the cleanest derivation of the normal equations
NumPy: dot products, projections, Gram-Schmidt, projection matrices
ML: attention mechanism, cosine similarity, least squares regression

Prerequisites

Part 1 - The Dot Product: Algebraic vs Geometric

The dot product (inner product) of two vectors u, v ∈ ℝⁿ has two equivalent definitions:

Algebraic definition

u · v = u^T v = sum_{i=1}^{n} u_i v_i

It is a sum of pairwise products. Simple, computational, fast.

Geometric definition

u · v = ||u|| * ||v|| * cos(theta)

where θ is the angle between u and v, and ‖u‖ = sqrt(uᵀu) is the Euclidean norm.

Why the geometric definition reveals alignment

The cosine function ranges from -1 to 1:

θ	cos(θ)	Dot product	Interpretation
0°	1	‖u‖‖v‖	Perfectly aligned (same direction)
90°	0	0	Orthogonal (perpendicular, independent)
180°	-1	-‖u‖‖v‖	Exactly opposite directions

The dot product measures how much u and v point in the same direction, scaled by their magnitudes.

import numpy as np

def dot_product_demo():
    """Demonstrate algebraic vs geometric definitions of dot product."""

    u = np.array([3.0, 4.0])   # ||u|| = 5
    v = np.array([1.0, 0.0])   # ||v|| = 1 (unit vector along x-axis)

    # Algebraic
    dot_algebraic = u @ v      # or np.dot(u, v)

    # Geometric
    norm_u = np.linalg.norm(u)
    norm_v = np.linalg.norm(v)
    cos_theta = dot_algebraic / (norm_u * norm_v)
    theta_deg = np.degrees(np.arccos(np.clip(cos_theta, -1, 1)))

    print(f"u = {u}, ||u|| = {norm_u}")
    print(f"v = {v}, ||v|| = {norm_v}")
    print(f"\nAlgebraic: u·v = {dot_algebraic}")
    print(f"Geometric: ||u||·||v||·cos(theta) = {norm_u:.2f} * {norm_v:.2f} * {cos_theta:.4f} = {dot_algebraic}")
    print(f"Angle between u and v: {theta_deg:.2f} degrees")

    # Key properties
    print(f"\nProperties:")
    print(f"  Commutativity: u·v = v·u: {np.isclose(u@v, v@u)}")
    print(f"  Distributivity: u·(v+w) = u·v + u·w:")

    w = np.array([2.0, 1.0])
    print(f"    u·(v+w) = {u@(v+w):.4f}")
    print(f"    u·v + u·w = {u@v + u@w:.4f}")
    print(f"    Match: {np.isclose(u@(v+w), u@v + u@w)}")

    # Dot product as projection (preview)
    # u·v = the component of u in the direction of v, times ||v||
    u_component_along_v = dot_algebraic / norm_v  # scalar projection
    print(f"\n  Scalar projection of u onto v: {u_component_along_v:.4f}")
    # This is the 'shadow' of u cast along the direction of v

dot_product_demo()

Dot product in high dimensions

The geometric interpretation holds in any dimension. In high-dimensional embedding spaces (d = 512, 1024, 4096), dot products between embedding vectors measure alignment - the foundation of semantic similarity, retrieval, and attention.

import numpy as np

def high_dim_alignment():
    """
    In high dimensions, random vectors are nearly orthogonal.
    This is why random initialization does not cause attention saturation.
    """
    np.random.seed(42)
    results = {}

    for d in [2, 10, 100, 1000]:
        u = np.random.randn(d)
        v = np.random.randn(d)
        # Normalize
        u /= np.linalg.norm(u)
        v /= np.linalg.norm(v)

        cos_sim = u @ v
        results[d] = cos_sim

    print("Cosine similarity between random unit vectors:")
    for d, cos in results.items():
        print(f"  d={d:5d}: cos(theta) = {cos:.4f}  (theta ~= {np.degrees(np.arccos(abs(cos))):.1f} deg)")

    # In high dimensions: random unit vectors are nearly orthogonal
    # This is the "blessing of dimensionality" for attention
    print("\nAs d increases, random vectors become nearly orthogonal.")
    print("Attention can distinguish relevant (aligned) from irrelevant (orthogonal) keys.")

high_dim_alignment()

Part 2 - Orthogonality: When the Dot Product Is Zero

Two vectors u and v are orthogonal (perpendicular) if:

u · v = 0

Geometrically: they form a 90° angle. Algebraically: their inner product vanishes.

Why orthogonality matters in ML

Orthogonality = independence in the linear sense. Orthogonal features carry non-overlapping information.

Principal components are orthogonal: PCA produces directions with zero pairwise dot products. Each PC captures information not already in the others.
Attention heads can specialize: if different attention heads learn nearly orthogonal query/key subspaces, they capture different types of relationships in the input.
Regularization via orthogonality: orthogonal weight matrices preserve gradient norms during backpropagation (orthogonal matrices have singular values all equal to 1 → condition number = 1 → perfect gradient flow).
Gram-Schmidt: the process for constructing an orthogonal basis from any set of vectors - the foundation of QR decomposition.

import numpy as np

def orthogonality_demo():
    """Orthogonality properties and verification."""

    # Standard basis: perfectly orthogonal
    e1 = np.array([1.0, 0.0, 0.0])
    e2 = np.array([0.0, 1.0, 0.0])
    e3 = np.array([0.0, 0.0, 1.0])

    print("Standard basis dot products:")
    print(f"  e1 · e2 = {e1 @ e2}")   # 0
    print(f"  e1 · e3 = {e1 @ e3}")   # 0
    print(f"  e1 · e1 = {e1 @ e1}")   # 1

    # Orthonormal matrix: Q^T Q = I
    np.random.seed(42)
    A = np.random.randn(4, 4)
    Q, R = np.linalg.qr(A)   # Q is orthogonal (Q^T = Q^{-1})

    print(f"\nOrthogonal matrix Q^T @ Q:")
    print(np.round(Q.T @ Q, 10))    # Identity

    # Key property: orthogonal matrices preserve norms and angles
    v = np.random.randn(4)
    print(f"\n||v|| = {np.linalg.norm(v):.6f}")
    print(f"||Qv|| = {np.linalg.norm(Q @ v):.6f}")  # Same! Orthogonal map = isometry

    # Orthogonal matrices preserve dot products
    u = np.random.randn(4)
    print(f"\nu · v = {u @ v:.6f}")
    print(f"(Qu) · (Qv) = {(Q@u) @ (Q@v):.6f}")  # Same

    # Orthogonal initialization: why it helps training
    # Gradient propagates through Q unscaled: ||Q^T g|| = ||g||
    g = np.random.randn(4)  # gradient
    print(f"\n||grad||     = {np.linalg.norm(g):.6f}")
    print(f"||Q^T grad|| = {np.linalg.norm(Q.T @ g):.6f}")  # Preserved

orthogonality_demo()

Orthogonal complement

The orthogonal complement of a subspace S is:

S^perp = {v : v · s = 0 for all s in S}

Every vector x in ℝⁿ can be uniquely decomposed as:

x = x_S + x_{S^perp}

where x_S ∈ S and x_{S^perp} ∈ S^⊥. This is the fundamental decomposition theorem - and projection computes x_S.

Part 3 - Projection of One Vector Onto Another

The scalar projection of u onto v (the component of u in the direction of v):

scalar_proj = (u · v) / ||v||  =  ||u|| cos(theta)

The vector projection of u onto v (the actual vector component along v):

proj_v(u) = ((u · v) / ||v||^2) * v  =  ((u · v) / (v · v)) * v

This is the closest point to u in the span of v.

import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

def vector_projection(u: np.ndarray, v: np.ndarray) -> np.ndarray:
    """
    Compute the orthogonal projection of u onto v.

    Returns the vector component of u in the direction of v.
    The residual u - proj_v(u) is orthogonal to v.
    """
    v_unit = v / np.linalg.norm(v)
    scalar = u @ v_unit      # scalar projection
    return scalar * v_unit   # vector projection

def projection_demo():
    """Demonstrate vector projection with verification."""
    u = np.array([4.0, 3.0])
    v = np.array([2.0, 0.0])

    proj = vector_projection(u, v)
    residual = u - proj

    print(f"u = {u}")
    print(f"v = {v}")
    print(f"proj_v(u) = {proj}")
    print(f"residual  = {residual}")

    # Key property: residual is orthogonal to v
    print(f"\nresidual · v = {residual @ v:.10f}")  # should be 0
    print(f"Orthogonality holds: {np.isclose(residual @ v, 0)}")

    # Decomposition: u = proj + residual
    print(f"\nu = proj + residual: {np.allclose(u, proj + residual)}")

    # Scalar projection = dot product with unit vector
    v_hat = v / np.linalg.norm(v)
    scalar_proj = u @ v_hat
    print(f"\nScalar projection: {scalar_proj:.4f}")
    print(f"||proj||:           {np.linalg.norm(proj):.4f}")   # Should match

    # Visual
    fig, ax = plt.subplots(figsize=(7, 6))
    ax.set_aspect('equal')
    ax.axhline(0, color='gray', linewidth=0.5)
    ax.axvline(0, color='gray', linewidth=0.5)

    def draw_arrow(ax, vec, color, label, style='-'):
        ax.annotate('', xy=vec, xytext=[0, 0],
                    arrowprops=dict(arrowstyle='->', color=color, lw=2))
        ax.text(vec[0] + 0.1, vec[1] + 0.1, label, color=color, fontsize=12)

    draw_arrow(ax, u, '#f59e0b', 'u')
    draw_arrow(ax, v, '#6366f1', 'v')
    draw_arrow(ax, proj, '#10b981', 'proj_v(u)')
    ax.annotate('', xy=u, xytext=proj,
                arrowprops=dict(arrowstyle='->', color='#ef4444', lw=1.5,
                                linestyle='dashed'))
    ax.text((proj[0]+u[0])/2 + 0.1, (proj[1]+u[1])/2 + 0.1,
            'residual', color='#ef4444', fontsize=11)

    # Right-angle mark at proj
    ax.plot([proj[0], proj[0]+0.15], [proj[1], proj[1]+0.15], 'k-', linewidth=1)
    ax.plot([proj[0]+0.15, proj[0]+0.3], [proj[1]+0.15, proj[1]], 'k-', linewidth=1)

    ax.set_xlim(-0.5, 5.5)
    ax.set_ylim(-0.5, 4.5)
    ax.set_title('Vector Projection: u = proj_v(u) + residual')
    plt.tight_layout()
    plt.savefig('/tmp/vector_projection.png', dpi=100, bbox_inches='tight')
    print("\nSaved: /tmp/vector_projection.png")

projection_demo()

Projection onto a subspace

To project onto a higher-dimensional subspace spanned by columns of A ∈ ℝⁿˣᵏ:

proj_A(b) = A (A^T A)^{-1} A^T b    (when A has full column rank)

The residual b - proj_A(b) is orthogonal to every column of A (and to the entire column space of A).

import numpy as np

def project_onto_subspace(A: np.ndarray, b: np.ndarray) -> np.ndarray:
    """
    Project vector b onto the column space of A.
    Uses the pseudoinverse to handle rank-deficient A.
    """
    # A @ A_pinv = P (projection matrix onto col(A))
    A_pinv = np.linalg.pinv(A)
    return A @ (A_pinv @ b)

# Example: project b onto a 2D plane in 3D space
A = np.array([[1, 0],
              [0, 1],
              [0, 0]], dtype=float)  # span = xy-plane in R^3

b = np.array([3.0, 4.0, 5.0])       # has a z-component

proj = project_onto_subspace(A, b)
residual = b - proj

print(f"b = {b}")
print(f"proj onto xy-plane: {proj}")   # [3, 4, 0]
print(f"residual: {residual}")         # [0, 0, 5]

# Verify residual is orthogonal to each column of A
print(f"\nresidual . col1 = {residual @ A[:, 0]:.4f}")  # 0
print(f"residual . col2 = {residual @ A[:, 1]:.4f}")  # 0

Part 4 - Projection Matrices: Idempotent and Symmetric

The projection matrix onto the subspace spanned by a unit vector v is:

P = v v^T / (v^T v)  =  v v^T        (if ||v|| = 1)

More generally, the projection matrix onto the column space of A:

P = A (A^T A)^{-1} A^T    (hat matrix in statistics)

Key properties of projection matrices

Idempotent: P² = P (applying the projection twice = applying it once)
Symmetric: Pᵀ = P (projection is a self-adjoint operator)
Eigenvalues are 0 or 1: the subspace being projected onto has eigenvalue 1; its orthogonal complement has eigenvalue 0
trace(P) = rank(P) = dimension of the subspace

import numpy as np

def projection_matrix_properties():
    """Verify the key properties of projection matrices."""

    # Projection onto a line (1D subspace)
    v = np.array([3.0, 4.0, 0.0])
    v_hat = v / np.linalg.norm(v)

    P_line = np.outer(v_hat, v_hat)  # v v^T (for unit vector)
    print("Projection onto line spanned by v:")
    print(f"P =\n{P_line.round(4)}")

    # Property 1: Idempotent P^2 = P
    P_sq = P_line @ P_line
    print(f"\nIdempotent (P^2 = P): {np.allclose(P_sq, P_line)}")

    # Property 2: Symmetric P^T = P
    print(f"Symmetric (P^T = P): {np.allclose(P_line.T, P_line)}")

    # Property 3: Eigenvalues are 0 or 1
    eigenvalues = np.linalg.eigvalsh(P_line)
    print(f"Eigenvalues: {eigenvalues.round(6)}")   # [0, 0, 1]
    print(f"All eigenvalues in {{0, 1}}: "
          f"{np.all(np.isclose(eigenvalues, 0) | np.isclose(eigenvalues, 1))}")

    # Property 4: trace = rank = dim of subspace
    print(f"trace(P) = {np.trace(P_line):.4f}  (should be 1 for 1D subspace)")
    print(f"rank(P)  = {np.linalg.matrix_rank(P_line)}")

    # Projection onto a 2D plane in R^3
    A = np.column_stack([
        np.array([1.0, 0.0, 0.0]),
        np.array([0.0, 1.0, 0.0])
    ])  # xy-plane

    P_plane = A @ np.linalg.pinv(A)   # A (A^T A)^{-1} A^T
    print("\nProjection onto xy-plane (2D subspace of R^3):")
    print(f"P =\n{P_plane.round(4)}")
    print(f"Idempotent: {np.allclose(P_plane @ P_plane, P_plane)}")
    print(f"Symmetric:  {np.allclose(P_plane.T, P_plane)}")
    print(f"trace(P) = {np.trace(P_plane):.4f}  (should be 2 for 2D subspace)")

    # Complementary projection: I - P projects onto the orthogonal complement
    P_complement = np.eye(3) - P_plane
    print(f"\n(I - P) projects onto z-axis:")
    b = np.array([2.0, 3.0, 7.0])
    print(f"  P @ b      = {P_plane @ b}")         # [2, 3, 0]
    print(f"  (I-P) @ b  = {P_complement @ b}")    # [0, 0, 7]

projection_matrix_properties()

The orthogonal complement projection

If P projects onto S, then I - P projects onto S^⊥:

x = P @ x + (I - P) @ x
    \_____/   \_________/
    in S      in S^perp

Part 5 - Gram-Schmidt: Constructing Orthonormal Bases

Gram-Schmidt is the process for converting any set of linearly independent vectors into an orthonormal set that spans the same subspace.

Algorithm (for vectors v₁, v₂, ..., vₖ):

u₁ = v₁
u₂ = v₂ - proj_{u₁}(v₂)                     (subtract component along u₁)
u₃ = v₃ - proj_{u₁}(v₃) - proj_{u₂}(v₃)    (subtract components along u₁ and u₂)
...
uᵢ = vᵢ - sum_{j=1}^{i-1} proj_{uⱼ}(vᵢ)

Then normalize: eᵢ = uᵢ / ||uᵢ||

import numpy as np

def gram_schmidt(V: np.ndarray) -> np.ndarray:
    """
    Gram-Schmidt orthogonalization.

    Args:
        V: matrix whose columns are the input vectors (n, k)

    Returns:
        Q: matrix whose columns are orthonormal (n, k)

    Note: numerically unstable for nearly-dependent vectors.
    For production use, prefer np.linalg.qr (modified Gram-Schmidt internally).
    """
    n, k = V.shape
    Q = np.zeros((n, k))

    for i in range(k):
        # Start with the i-th input vector
        q = V[:, i].copy()

        # Subtract projections onto all previous orthonormal vectors
        for j in range(i):
            q -= (Q[:, j] @ V[:, i]) * Q[:, j]
            # Equivalent: q -= np.dot(q, Q[:, j]) * Q[:, j]
            # (But we use the original V[:, i] to avoid numerical drift)

        # Normalize
        norm = np.linalg.norm(q)
        if norm < 1e-12:
            raise ValueError(f"Vector {i} is linearly dependent on previous vectors")
        Q[:, i] = q / norm

    return Q

# Test: orthogonalize three vectors in R^4
np.random.seed(42)
V = np.random.randn(4, 3)

Q = gram_schmidt(V)
print("Gram-Schmidt result:")
print(f"Q shape: {Q.shape}")
print(f"Orthonormal: Q^T Q =\n{np.round(Q.T @ Q, 10)}")   # Identity

# Verify: same column space as V
# (each column of V should be in the column space of Q)
for j in range(V.shape[1]):
    v_j = V[:, j]
    proj = Q @ Q.T @ v_j
    residual = np.linalg.norm(v_j - proj)
    print(f"Column {j} reconstruction residual: {residual:.2e}")  # near 0

# Compare to np.linalg.qr (production Gram-Schmidt)
Q_np, R_np = np.linalg.qr(V)
print(f"\nnp.linalg.qr orthonormal: {np.allclose(Q_np.T @ Q_np, np.eye(3))}")

# Column spaces should be identical (up to sign)
# Check: columns of Q span the same space as columns of Q_np
for j in range(3):
    q_mine = Q[:, j]
    q_np = Q_np[:, j]
    # One should be a scalar multiple of the other (with possible sign flip)
    cos_sim = abs(q_mine @ q_np)
    print(f"  Column {j} cos similarity: {cos_sim:.8f}")  # should be 1.0

Modified Gram-Schmidt (numerically stable version)

The classical Gram-Schmidt above is numerically unstable when vectors are nearly dependent. Modified Gram-Schmidt avoids the issue by re-orthogonalizing against the already-computed (updated) vectors:

import numpy as np

def modified_gram_schmidt(V: np.ndarray) -> np.ndarray:
    """
    Modified Gram-Schmidt orthogonalization.
    Numerically more stable than classical Gram-Schmidt.
    """
    n, k = V.shape
    Q = V.copy().astype(float)

    for i in range(k):
        # Normalize the i-th column
        norm = np.linalg.norm(Q[:, i])
        if norm < 1e-12:
            raise ValueError(f"Vector {i} is linearly dependent")
        Q[:, i] /= norm

        # Remove component along Q[:, i] from all subsequent columns
        for j in range(i + 1, k):
            Q[:, j] -= (Q[:, i] @ Q[:, j]) * Q[:, i]

    return Q

# Show numerical stability difference
# Create nearly-dependent vectors (bad case for classical GS)
np.random.seed(42)
eps = 1e-8
V_bad = np.array([[1, 1, 1],
                   [eps, eps, 0],
                   [0, eps, eps]], dtype=float)

Q_classical = gram_schmidt(V_bad)
Q_modified = modified_gram_schmidt(V_bad)

print("Classical GS orthogonality error:",
      np.max(np.abs(Q_classical.T @ Q_classical - np.eye(3))))
print("Modified  GS orthogonality error:",
      np.max(np.abs(Q_modified.T @ Q_modified - np.eye(3))))

QR decomposition

Gram-Schmidt applied to the columns of A ∈ ℝⁿˣᵏ gives the QR decomposition:

A = Q R

where Q ∈ ℝⁿˣᵏ is orthonormal (QᵀQ = I) and R ∈ ℝᵏˣᵏ is upper triangular.

QR decomposition is the backbone of:

Solving least squares (QR is more stable than normal equations)
Computing eigenvalues (QR iteration algorithm)
Householder reflections in numerical linear algebra

Part 6 - Least Squares via Projection

The geometric picture

Given an overdetermined system Ax = b (more equations than unknowns, A ∈ ℝⁿˣᵈ with n > d), there is usually no exact solution. Instead, we want the x that minimizes ‖Ax - b‖².

Geometric insight: the vector Ax lives in the column space of A (as x varies). We want the point in col(A) closest to b - this is exactly the orthogonal projection of b onto col(A).

Ax* = proj_{col(A)}(b)  =  A (A^T A)^{-1} A^T b

From Ax* = A(AᵀA)⁻¹Aᵀb, multiply both sides by (AᵀA)⁻¹Aᵀ on the left:

x* = (A^T A)^{-1} A^T b

This is the normal equation - the least squares solution.

Why "normal" equations?

The residual r = b - Ax* must be orthogonal (normal) to the column space of A:

A^T r = A^T (b - Ax*) = 0
A^T b = A^T A x*
x* = (A^T A)^{-1} A^T b

The condition Aᵀr = 0 says every column of A is orthogonal to the residual. This is the orthogonality condition for projection.

import numpy as np

def least_squares_via_projection():
    """
    Derive and verify the least squares solution via projection.
    """
    np.random.seed(42)
    n, d = 100, 5  # overdetermined: 100 equations, 5 unknowns

    # True parameters
    x_true = np.array([2.0, -1.0, 3.0, 0.5, -2.0])
    A = np.random.randn(n, d)  # design matrix
    b = A @ x_true + 0.1 * np.random.randn(n)  # noisy observations

    # Method 1: Normal equations (closed form)
    # x* = (A^T A)^{-1} A^T b
    ATA = A.T @ A               # (d, d) -- Gram matrix
    ATb = A.T @ b               # (d,)
    x_normal = np.linalg.solve(ATA, ATb)   # faster than inv(ATA) @ ATb

    # Method 2: Projection matrix approach
    # b_proj = A (A^T A)^{-1} A^T b    (project b onto col(A))
    P = A @ np.linalg.solve(ATA, A.T)   # hat matrix (n, n)
    b_proj = P @ b
    x_proj = np.linalg.solve(ATA, A.T @ b_proj)

    # Method 3: np.linalg.lstsq (production: uses SVD, most stable)
    x_lstsq, residuals, rank, sv = np.linalg.lstsq(A, b, rcond=None)

    # Method 4: QR decomposition
    Q, R = np.linalg.qr(A)
    # A = QR -> A^T A x* = A^T b -> R^T Q^T Q R x* = R^T Q^T b -> R x* = Q^T b
    x_qr = np.linalg.solve(R, Q.T @ b)

    print("Least squares solutions:")
    print(f"  Normal equations: {x_normal.round(4)}")
    print(f"  Projection:       {x_proj.round(4)}")
    print(f"  lstsq (SVD):      {x_lstsq.round(4)}")
    print(f"  QR decomposition: {x_qr.round(4)}")
    print(f"  True x*:          {x_true}")

    print(f"\nAll methods agree: {np.allclose(x_normal, x_lstsq, atol=1e-8)}")

    # Key projection properties
    residual = b - A @ x_normal
    print(f"\nProjection verification:")
    print(f"  Residual orthogonal to col(A): "
          f"{np.allclose(A.T @ residual, 0, atol=1e-8)}")
    print(f"  ||residual||^2 (minimum error): {np.sum(residual**2):.4f}")

    # Projection matrix properties
    print(f"\nHat matrix P properties:")
    print(f"  Symmetric:   {np.allclose(P, P.T)}")
    print(f"  Idempotent:  {np.allclose(P @ P, P)}")
    print(f"  trace(P) = rank(A) = {np.trace(P):.4f}")   # = d = 5

least_squares_via_projection()

Linear regression as projection

import numpy as np

class LinearRegressionProjection:
    """
    Linear regression via projection (OLS).
    Demonstrates the geometric interpretation of regression.
    """

    def __init__(self, fit_intercept: bool = True):
        self.fit_intercept = fit_intercept
        self.coef_ = None
        self.intercept_ = None

    def fit(self, X: np.ndarray, y: np.ndarray) -> 'LinearRegressionProjection':
        n, d = X.shape

        if self.fit_intercept:
            # Augment X with a column of ones for the intercept
            A = np.column_stack([np.ones(n), X])   # (n, d+1)
        else:
            A = X

        # Least squares via normal equations (or QR for stability)
        Q, R = np.linalg.qr(A)
        theta = np.linalg.solve(R, Q.T @ y)   # (d+1,) or (d,)

        if self.fit_intercept:
            self.intercept_ = theta[0]
            self.coef_ = theta[1:]
        else:
            self.intercept_ = 0.0
            self.coef_ = theta

        return self

    def predict(self, X: np.ndarray) -> np.ndarray:
        return X @ self.coef_ + self.intercept_

    def score(self, X: np.ndarray, y: np.ndarray) -> float:
        """R^2 score: fraction of variance explained."""
        y_pred = self.predict(X)
        ss_res = np.sum((y - y_pred)**2)
        ss_tot = np.sum((y - y.mean())**2)
        return 1 - ss_res / ss_tot

# Test
np.random.seed(42)
n = 200
X = np.random.randn(n, 3)
beta_true = np.array([1.5, -2.0, 0.8])
y = X @ beta_true + 1.2 + 0.5 * np.random.randn(n)  # intercept = 1.2

lr = LinearRegressionProjection(fit_intercept=True)
lr.fit(X, y)

print(f"True coefficients: {beta_true}")
print(f"Estimated:         {lr.coef_.round(4)}")
print(f"True intercept: 1.2000")
print(f"Estimated:      {lr.intercept_:.4f}")
print(f"R^2 score: {lr.score(X, y):.4f}")

# Compare to numpy lstsq
A_aug = np.column_stack([np.ones(n), X])
theta_np, _, _, _ = np.linalg.lstsq(A_aug, y, rcond=None)
print(f"\nnumpy lstsq intercept: {theta_np[0]:.4f}")
print(f"numpy lstsq coefs:     {theta_np[1:].round(4)}")
print(f"Match: {np.allclose([lr.intercept_] + list(lr.coef_), theta_np, atol=1e-8)}")

Part 7 - NumPy: Dot Products, Projections, Gram-Schmidt

Complete reference

import numpy as np

# ── Dot product ──────────────────────────────────────────────────────────────
u = np.array([1.0, 2.0, 3.0])
v = np.array([4.0, 5.0, 6.0])

dot1 = u @ v                   # operator syntax (preferred)
dot2 = np.dot(u, v)           # function syntax
dot3 = np.inner(u, v)         # same as np.dot for 1D
dot4 = u.T @ v                # explicit transpose (same result for 1D)

print(f"u @ v = {dot1}")       # 32.0

# Batch dot products
U = np.random.randn(10, 512)  # 10 query vectors
V = np.random.randn(20, 512)  # 20 key vectors

# Dot products between all pairs (10, 20)
scores = U @ V.T               # shape (10, 20)
print(f"\nAttention score matrix shape: {scores.shape}")  # (10, 20)

# ── Norms ────────────────────────────────────────────────────────────────────
norm_u = np.linalg.norm(u)           # L2 norm (default)
norm_u_sq = u @ u                    # squared L2 norm
u_hat = u / norm_u                   # unit vector

print(f"\n||u|| = {norm_u:.4f}")
print(f"u_hat: {u_hat.round(4)}, ||u_hat|| = {np.linalg.norm(u_hat):.6f}")

# ── Cosine similarity ────────────────────────────────────────────────────────
def cosine_similarity(u: np.ndarray, v: np.ndarray) -> float:
    """Cosine similarity between vectors u and v."""
    return float((u @ v) / (np.linalg.norm(u) * np.linalg.norm(v)))

def cosine_similarity_matrix(U: np.ndarray, V: np.ndarray) -> np.ndarray:
    """
    Pairwise cosine similarity between rows of U and rows of V.
    U: (n, d), V: (m, d) -> output: (n, m)
    """
    U_norm = U / np.linalg.norm(U, axis=1, keepdims=True)
    V_norm = V / np.linalg.norm(V, axis=1, keepdims=True)
    return U_norm @ V_norm.T

print(f"\nCosine similarity (u, v): {cosine_similarity(u, v):.4f}")

A = np.random.randn(5, 3)
B = np.random.randn(4, 3)
cos_mat = cosine_similarity_matrix(A, B)
print(f"Cosine similarity matrix: {cos_mat.shape}")  # (5, 4)

# ── Projections ──────────────────────────────────────────────────────────────
def proj_vector(u: np.ndarray, v: np.ndarray) -> np.ndarray:
    """Project u onto direction of v."""
    v_hat = v / np.linalg.norm(v)
    return (u @ v_hat) * v_hat

def proj_subspace(A: np.ndarray, b: np.ndarray) -> np.ndarray:
    """Project b onto column space of A."""
    return A @ np.linalg.pinv(A) @ b

def projection_matrix(A: np.ndarray) -> np.ndarray:
    """Compute projection matrix P = A (A^T A)^{-1} A^T."""
    A_pinv = np.linalg.pinv(A)
    return A @ A_pinv

# ── Gram-Schmidt via QR ───────────────────────────────────────────────────────
V_mat = np.random.randn(5, 3)       # 3 vectors in R^5

# Production: use np.linalg.qr
Q, R = np.linalg.qr(V_mat)         # Q: (5, 3), R: (3, 3)
print(f"\nQR: Q shape = {Q.shape}, R shape = {R.shape}")
print(f"Q orthonormal: {np.allclose(Q.T @ Q, np.eye(3))}")
print(f"Reconstruction: {np.allclose(V_mat, Q @ R)}")

# ── Least squares ─────────────────────────────────────────────────────────────
n_samples, n_features = 100, 5
A_ls = np.random.randn(n_samples, n_features)
b_ls = np.random.randn(n_samples)

# Method 1: lstsq (recommended for production)
x_lstsq, _, _, _ = np.linalg.lstsq(A_ls, b_ls, rcond=None)

# Method 2: normal equations (less stable)
x_normal = np.linalg.solve(A_ls.T @ A_ls, A_ls.T @ b_ls)

# Method 3: QR
Q_ls, R_ls = np.linalg.qr(A_ls)
x_qr = np.linalg.solve(R_ls, Q_ls.T @ b_ls)

print(f"\nLeast squares methods agree: "
      f"{np.allclose(x_lstsq, x_normal) and np.allclose(x_lstsq, x_qr)}")

Part 8 - ML Connections: Attention, Cosine Similarity, Regression

Scaled dot-product attention

The attention mechanism in transformers computes:

Attention(Q, K, V) = softmax(Q K^T / sqrt(d_k)) V

where Q ∈ ℝⁿˣᵈₖ (queries), K ∈ ℝᵐˣᵈₖ (keys), V ∈ ℝᵐˣᵈᵥ (values).

The term QKᵀ is a matrix of dot products - for each query row qᵢ and each key row kⱼ, it computes qᵢ · kⱼ (their alignment score). Softmax converts these scores into weights, and V is the weighted sum of values.

import numpy as np

def scaled_dot_product_attention(
    Q: np.ndarray,    # (n_q, d_k) -- query vectors
    K: np.ndarray,    # (n_k, d_k) -- key vectors
    V: np.ndarray,    # (n_k, d_v) -- value vectors
    mask: np.ndarray = None,  # (n_q, n_k) -- optional mask (e.g., causal)
) -> tuple:
    """
    Scaled dot-product attention (Vaswani et al., 2017).

    The scaling by sqrt(d_k) prevents dot products from growing large
    in high dimensions, which would push softmax into regions with
    near-zero gradients (softmax saturation).

    Returns:
        output: (n_q, d_v) -- weighted values
        weights: (n_q, n_k) -- attention weights (softmax of scores)
    """
    d_k = Q.shape[-1]

    # Step 1: compute raw alignment scores (dot products)
    # scores[i, j] = q_i . k_j  (how much query i aligns with key j)
    scores = Q @ K.T          # (n_q, n_k)

    # Step 2: scale by 1/sqrt(d_k)
    # Without scaling: in d_k dimensions, ||q||, ||k|| ~ sqrt(d_k)
    # so q.k ~ d_k for random unit vectors, pushing softmax to saturation
    scores = scores / np.sqrt(d_k)

    # Step 3: optional mask (e.g., causal mask for autoregressive models)
    if mask is not None:
        scores = np.where(mask == 0, -1e9, scores)

    # Step 4: softmax over key dimension
    # softmax[i, :] is the attention distribution for query i
    scores_shifted = scores - scores.max(axis=-1, keepdims=True)  # numerical stability
    exp_scores = np.exp(scores_shifted)
    weights = exp_scores / exp_scores.sum(axis=-1, keepdims=True)  # (n_q, n_k)

    # Step 5: weighted sum of values
    output = weights @ V       # (n_q, d_v)

    return output, weights

# Example: 4 tokens, d_k = d_v = 8
np.random.seed(42)
n_tokens = 4
d_k = 8
d_v = 8

Q = np.random.randn(n_tokens, d_k)
K = np.random.randn(n_tokens, d_k)
V = np.random.randn(n_tokens, d_v)

output, weights = scaled_dot_product_attention(Q, K, V)
print(f"Q shape: {Q.shape}")
print(f"K shape: {K.shape}")
print(f"V shape: {V.shape}")
print(f"Output shape: {output.shape}")        # (4, 8)
print(f"Attention weights shape: {weights.shape}")  # (4, 4)
print(f"\nAttention weights (each row sums to 1):")
print(weights.round(3))
print(f"Row sums: {weights.sum(axis=1).round(6)}")   # all 1.0

# Causal mask: token i can only attend to tokens j <= i
causal_mask = np.tril(np.ones((n_tokens, n_tokens)))
output_causal, weights_causal = scaled_dot_product_attention(Q, K, V, mask=causal_mask)
print(f"\nCausal attention weights (upper triangle = 0):")
print(weights_causal.round(3))

Why divide by sqrt(d_k)?

import numpy as np

def attention_scaling_analysis():
    """
    Show why the sqrt(d_k) scaling is necessary.
    """
    np.random.seed(42)

    for d_k in [4, 64, 512, 4096]:
        # Random unit query and key vectors
        q = np.random.randn(d_k)
        q /= np.linalg.norm(q)
        K = np.random.randn(20, d_k)
        K /= np.linalg.norm(K, axis=1, keepdims=True)

        raw_scores = K @ q            # dot products
        scaled_scores = raw_scores / np.sqrt(d_k)

        # After softmax, check if distribution is peaky (bad) or spread (good)
        softmax_raw = np.exp(raw_scores - raw_scores.max())
        softmax_raw /= softmax_raw.sum()

        softmax_scaled = np.exp(scaled_scores - scaled_scores.max())
        softmax_scaled /= softmax_scaled.sum()

        # Entropy: higher = more spread (better for attention)
        entropy_raw = -np.sum(softmax_raw * np.log(softmax_raw + 1e-10))
        entropy_scaled = -np.sum(softmax_scaled * np.log(softmax_scaled + 1e-10))

        print(f"d_k={d_k:5d}: raw score std={raw_scores.std():.2f}, "
              f"entropy (raw)={entropy_raw:.3f}, "
              f"entropy (scaled)={entropy_scaled:.3f}")

    print("\nAs d_k grows, raw dot products have larger std -> softmax peaks on one key")
    print("Scaling by 1/sqrt(d_k) keeps std near 1 -> balanced attention")

attention_scaling_analysis()

Cosine similarity for retrieval

import numpy as np

def cosine_retrieval_demo():
    """
    Cosine similarity for nearest-neighbor retrieval.
    Used in embedding search (vector databases, RAG systems).
    """
    np.random.seed(42)
    d = 128    # embedding dimension

    # Simulated embeddings: 100 documents
    doc_embeddings = np.random.randn(100, d)
    doc_embeddings /= np.linalg.norm(doc_embeddings, axis=1, keepdims=True)  # normalize

    # Query embedding
    query = np.random.randn(d)
    query /= np.linalg.norm(query)

    # Cosine similarity: for normalized vectors, dot product = cosine similarity
    similarities = doc_embeddings @ query    # (100,) -- cosine similarities

    # Top-5 most similar documents
    top_k = 5
    top_indices = np.argsort(similarities)[::-1][:top_k]
    top_scores = similarities[top_indices]

    print(f"Query embedding: {query[:5].round(4)}...")
    print(f"\nTop-{top_k} most similar documents:")
    for i, (idx, score) in enumerate(zip(top_indices, top_scores)):
        print(f"  Rank {i+1}: doc {idx:3d}, cosine similarity = {score:.4f}")

    # Tip: for large-scale retrieval, use FAISS (Facebook AI Similarity Search)
    # FAISS computes dot products between normalized vectors = cosine similarity
    # at scale (millions of vectors) using optimized BLAS routines

cosine_retrieval_demo()

Interview Questions

Q1: Why does scaled dot-product attention divide by sqrt(d_k)?

Short answer: To prevent softmax saturation in high dimensions.

Detailed derivation:

Assume query q and key k are independent random vectors with components sampled from a distribution with mean 0 and variance 1 (as produced by standard initialization).

The dot product q · k = Σᵢ qᵢkᵢ has:

Mean: E[qᵢkᵢ] = E[qᵢ]E[kᵢ] = 0 (independent, zero-mean)
Variance: Var(Σᵢ qᵢkᵢ) = Σᵢ Var(qᵢkᵢ) = Σᵢ E[qᵢ²]E[kᵢ²] = d_k

So the standard deviation of q · k grows as sqrt(d_k).

When d_k is large (e.g., 512), dot products have large magnitude. The softmax function applied to large-magnitude inputs becomes very "peaky" - it approaches a one-hot vector, concentrating all weight on the maximum score.

This causes two problems:

Vanishing gradients: the gradient of softmax is small when the output is near one-hot
Loss of information: attention effectively ignores all but one key

Fix: divide by sqrt(d_k) to normalize the scores. After scaling, q · k / sqrt(d_k) has variance 1 regardless of d_k - the softmax receives inputs with consistent scale.

Interview tip: "We divide by sqrt(d_k) because the dot product q·k has standard deviation proportional to sqrt(d_k). Without scaling, large d_k causes softmax saturation and vanishing gradients. Dividing by sqrt(d_k) keeps the variance at 1, maintaining a well-spread attention distribution."

Q2: Derive the least squares normal equations from the projection interpretation.

Setup: minimize ‖Ax - b‖² over x, where A ∈ ℝⁿˣᵈ (n > d), b ∈ ℝⁿ.

Geometric insight: Ax lives in the column space of A (col(A)). We want the point Ax* ∈ col(A) closest to b. This is the orthogonal projection of b onto col(A).

Derivation:

The optimal solution satisfies: the residual r = b - Ax* must be orthogonal to every vector in col(A).

Column space of A = span{a₁, a₂, ..., aₐ} (columns of A).

Orthogonality condition: for every column aⱼ of A:

a_j^T (b - Ax*) = 0    for j = 1, ..., d

Collecting all j into matrix form:

A^T (b - Ax*) = 0
A^T b = A^T A x*

Assuming A has full column rank (AᵀA is invertible):

x* = (A^T A)^{-1} A^T b

These are the normal equations. The name "normal" comes from the geometric fact that the residual b - Ax* is normal (perpendicular) to col(A).

Verification: plugging x* back:

A^T (b - Ax*) = A^T b - A^T A (A^T A)^{-1} A^T b = A^T b - A^T b = 0 ✓

Why AᵀA is invertible (when A has full column rank):

If Ax = 0 → xᵀAᵀAx = 0 → ‖Ax‖² = 0 → Ax = 0 → x = 0 (full column rank).

So AᵀA is positive definite, hence invertible.

Q3: What is an orthogonal projection matrix and what are its properties?

An orthogonal projection matrix P ∈ ℝⁿˣⁿ is a matrix that projects every vector in ℝⁿ onto some subspace S, such that the residual is orthogonal to S.

For the subspace col(A) (column space of A with full column rank):

P = A (A^T A)^{-1} A^T

For a single unit vector v (1D subspace):

P = v v^T

Properties:

Idempotent: P² = P
- Proof: P² = A(AᵀA)⁻¹AᵀA(AᵀA)⁻¹Aᵀ = A(AᵀA)⁻¹Aᵀ = P
- Geometric: projecting twice = projecting once (already in the subspace)
Symmetric: Pᵀ = P
- Proof: [A(AᵀA)⁻¹Aᵀ]ᵀ = A[(AᵀA)⁻¹]ᵀAᵀ = A(AᵀA)⁻¹Aᵀ = P
- (since AᵀA is symmetric, its inverse is symmetric)
Eigenvalues are 0 or 1:
- For v ∈ col(A): Pv = v (eigenvalue 1)
- For v ∈ col(A)^⊥: Pv = 0 (eigenvalue 0)
- So P has exactly rank(P) eigenvalues equal to 1
trace(P) = rank(P) = dim(S):
- trace = sum of eigenvalues = number of 1-eigenvalues = dimension of subspace
Complementary projection: I - P projects onto S^⊥
- (I - P)² = I - P (idempotent), (I - P)ᵀ = I - P (symmetric)
- x = Px + (I-P)x decomposes x into component in S and component in S^⊥

ML use: in linear regression, P = X(XᵀX)⁻¹Xᵀ is the "hat matrix" (since Py = ŷ). Its diagonal entries Pᵢᵢ are "leverage scores" - how much influence observation i has on its own predicted value.

Q4: Why is the Gram-Schmidt process important for QR decomposition?

Gram-Schmidt is the algorithmic core of QR decomposition. Given A ∈ ℝⁿˣᵈ with linearly independent columns, Gram-Schmidt produces Q (orthonormal) and R (upper triangular) such that A = QR.

The connection:

At each step of Gram-Schmidt applied to columns a₁, a₂, ..., aₐ of A:

Step 1: q₁ = a₁ / ||a₁||
Step 2: u₂ = a₂ - (a₂·q₁)q₁;   q₂ = u₂ / ||u₂||
Step 3: u₃ = a₃ - (a₃·q₁)q₁ - (a₃·q₂)q₂;   q₃ = u₃/||u₃||

The coefficients in the subtraction are exactly the entries of R:

R[1,1] = ||a₁||
R[1,2] = a₂ · q₁    R[2,2] = ||u₂||
R[1,3] = a₃ · q₁    R[2,3] = a₃ · q₂    R[3,3] = ||u₃||

So A = QR is literally the encoding of the Gram-Schmidt process.

Why QR decomposition is important in ML:

Solving least squares: Ax = b → QRx = b → Rx = Qᵀb. Since R is triangular, this is solved by back-substitution - O(d²) instead of the O(d³) inversion of AᵀA.
Numerical stability: the QR-based least squares avoids forming AᵀA (which squares the condition number). np.linalg.lstsq uses QR internally.
QR iteration for eigenvalues: the standard algorithm for computing all eigenvalues of a matrix is repeated QR decomposition (QR iteration / Francis QR algorithm).
Orthogonal initialization: QR decomposition of a random matrix gives an orthogonal matrix Q - used as initialization for RNNs and some weight matrices to preserve gradient norms.
Gram-Schmidt instability: the classical Gram-Schmidt is numerically unstable; Modified Gram-Schmidt and Householder reflections (used in production QR) are the stable variants.

Practice Challenges

Level 1: Predict

Challenge: Without computing, predict the result:

The projection matrix P = vvᵀ/‖v‖² where v = [1, 0, 0]ᵀ. What is P?
If P is a projection matrix, what is the projection matrix onto the orthogonal complement?
u = [3, 4] and v = [4, 3]. Without computing, which is larger: ‖proj_v(u)‖ or ‖proj_u(v)‖?

Answer

P = vvᵀ/‖v‖² where v = [1, 0, 0]ᵀ:

P = [1] [1 0 0] = [1 0 0]
    [0]           [0 0 0]
    [0]           [0 0 0]

This projects onto the x-axis. P maps [x, y, z] → [x, 0, 0]. Verified: P² = P, Pᵀ = P.

Orthogonal complement projection: I - P. If P projects onto S, then I - P projects onto S^⊥. For the x-axis projection: I - P maps [x, y, z] → [0, y, z] (the yz-plane).
Which projection is larger? Both have magnitude ‖u‖‖v‖cos(θ)/‖v‖ or ‖u‖‖v‖cos(θ)/‖u‖.

‖proj_v(u)‖ = |u·v| / ‖v‖ = |(3)(4)+(4)(3)| / 5 = 24/5 = 4.8
‖proj_u(v)‖ = |u·v| / ‖u‖ = 24/5 = 4.8

They are equal! Both equal |u·v| / ‖v‖ = |u·v| / ‖u‖ only when ‖u‖ = ‖v‖. Here ‖u‖ = ‖v‖ = 5, so indeed equal.

import numpy as np
u, v = np.array([3.0, 4.0]), np.array([4.0, 3.0])
print(np.linalg.norm(u), np.linalg.norm(v))  # 5.0, 5.0
proj_vu = (u @ v / (v @ v)) * v
proj_uv = (u @ v / (u @ u)) * u
print(np.linalg.norm(proj_vu), np.linalg.norm(proj_uv))  # 4.8, 4.8

Level 2: Debug

Challenge: The following attention implementation has a bug that causes incorrect results. Find and fix it:

import numpy as np

def buggy_attention(Q, K, V):
    d_k = Q.shape[-1]
    # Bug somewhere in here
    scores = Q @ K / np.sqrt(d_k)
    weights = np.exp(scores) / np.exp(scores).sum()
    return weights @ V

Answer

Three bugs:

Bug 1: Q @ K should be Q @ K.T. Q has shape (n_q, d_k), K has shape (n_k, d_k). The dot product between each query and each key requires K to be transposed.

Bug 2: np.exp(scores).sum() sums over all elements, not along the key dimension. Each query should have its own attention distribution. Should be .sum(axis=-1, keepdims=True).

Bug 3: No numerical stability trick. np.exp(scores) can overflow for large scores. Should subtract the max per row before exponentiating.

import numpy as np

def fixed_attention(Q: np.ndarray, K: np.ndarray, V: np.ndarray) -> np.ndarray:
    """
    Fixed scaled dot-product attention.
    Q: (n_q, d_k), K: (n_k, d_k), V: (n_k, d_v)
    Returns: (n_q, d_v)
    """
    d_k = Q.shape[-1]

    # Fix 1: transpose K
    scores = Q @ K.T / np.sqrt(d_k)       # (n_q, n_k)

    # Fix 3: subtract max for numerical stability
    scores -= scores.max(axis=-1, keepdims=True)

    exp_scores = np.exp(scores)

    # Fix 2: normalize along key dimension
    weights = exp_scores / exp_scores.sum(axis=-1, keepdims=True)   # (n_q, n_k)

    return weights @ V   # (n_q, d_v)

# Verify
np.random.seed(42)
Q = np.random.randn(3, 8)
K = np.random.randn(5, 8)
V = np.random.randn(5, 4)

output = fixed_attention(Q, K, V)
print(f"Output shape: {output.shape}")   # (3, 4)

# Check: each row of weights sums to 1
d_k = 8
scores = Q @ K.T / np.sqrt(d_k)
scores -= scores.max(axis=-1, keepdims=True)
weights = np.exp(scores) / np.exp(scores).sum(axis=-1, keepdims=True)
print(f"Attention weight row sums: {weights.sum(axis=1).round(6)}")  # all 1.0

Level 3: Design

Challenge: Implement multi-head attention from scratch using only NumPy. Each head independently computes scaled dot-product attention in a lower-dimensional subspace. Verify that the concatenated output has the correct shape and that each head produces a valid attention distribution.

Answer

import numpy as np

class MultiHeadAttention:
    """
    Multi-Head Attention (Vaswani et al., 2017) implemented in NumPy.

    Each head projects Q, K, V into d_k, d_k, d_v dimensional subspaces,
    computes attention independently, then concatenates.

    Total computation: h heads x attention(d_k) -- each head sees a
    different projection of the input, enabling specialization.
    """

    def __init__(self, d_model: int, n_heads: int, seed: int = 42):
        assert d_model % n_heads == 0, "d_model must be divisible by n_heads"
        self.d_model = d_model
        self.n_heads = n_heads
        self.d_k = d_model // n_heads   # dimension per head
        self.d_v = d_model // n_heads

        rng = np.random.default_rng(seed)

        # Weight matrices for each head
        # W_Q, W_K, W_V: (n_heads, d_model, d_k) -- one matrix per head
        self.W_Q = rng.standard_normal((n_heads, d_model, self.d_k)) * 0.02
        self.W_K = rng.standard_normal((n_heads, d_model, self.d_k)) * 0.02
        self.W_V = rng.standard_normal((n_heads, d_model, self.d_v)) * 0.02

        # Output projection: concatenated heads -> d_model
        self.W_O = rng.standard_normal((n_heads * self.d_v, d_model)) * 0.02

    def _attention(self, Q: np.ndarray, K: np.ndarray, V: np.ndarray) -> tuple:
        """
        Single head attention.
        Q: (seq_len, d_k), K: (seq_len, d_k), V: (seq_len, d_v)
        """
        d_k = Q.shape[-1]
        scores = Q @ K.T / np.sqrt(d_k)       # (seq_len, seq_len)
        scores -= scores.max(axis=-1, keepdims=True)  # stability
        weights = np.exp(scores)
        weights /= weights.sum(axis=-1, keepdims=True)  # (seq_len, seq_len)
        return weights @ V, weights            # (seq_len, d_v), (seq_len, seq_len)

    def forward(self, X: np.ndarray) -> tuple:
        """
        X: (seq_len, d_model) -- input sequence
        Returns: output (seq_len, d_model), attention_weights (n_heads, seq_len, seq_len)
        """
        seq_len, d_model = X.shape
        head_outputs = []
        all_weights = []

        for h in range(self.n_heads):
            # Project to head-specific subspaces
            Q_h = X @ self.W_Q[h]    # (seq_len, d_k)
            K_h = X @ self.W_K[h]    # (seq_len, d_k)
            V_h = X @ self.W_V[h]    # (seq_len, d_v)

            # Compute attention in this head's subspace
            out_h, w_h = self._attention(Q_h, K_h, V_h)  # (seq_len, d_v)
            head_outputs.append(out_h)
            all_weights.append(w_h)

        # Concatenate all heads: (seq_len, n_heads * d_v)
        concat = np.concatenate(head_outputs, axis=-1)

        # Output projection: (seq_len, d_model)
        output = concat @ self.W_O

        attention_weights = np.stack(all_weights, axis=0)  # (n_heads, seq_len, seq_len)
        return output, attention_weights

# Test
d_model = 64
n_heads = 8
seq_len = 10

mha = MultiHeadAttention(d_model=d_model, n_heads=n_heads)
X = np.random.randn(seq_len, d_model)
output, weights = mha.forward(X)

print(f"Input shape:  {X.shape}")               # (10, 64)
print(f"Output shape: {output.shape}")           # (10, 64)
print(f"Weights shape: {weights.shape}")         # (8, 10, 10)

# Verify each head has valid attention distribution
print(f"\nEach head's attention weights sum to 1 per query:")
for h in range(n_heads):
    row_sums = weights[h].sum(axis=-1)
    print(f"  Head {h}: min={row_sums.min():.6f}, max={row_sums.max():.6f}")  # all 1.0

# Each head's d_k
print(f"\nd_k per head: {mha.d_k}")   # 8 (= 64 / 8)

Quick Reference Cheatsheet

Operation	Math	NumPy	Notes
Dot product	u · v = Σᵢ uᵢvᵢ	`u @ v` or `np.dot(u, v)`	Also `np.inner(u, v)`
Norm	‖u‖ = sqrt(u·u)	`np.linalg.norm(u)`	Default is L2
Unit vector	u / ‖u‖	`u / np.linalg.norm(u)`	For cosine similarity
Cosine similarity	u·v / (‖u‖‖v‖)	`(u@v)/(norm(u)*norm(v))`	In [-1, 1]
Angle	arccos(u·v / ‖u‖‖v‖)	`np.arccos(np.clip(cos, -1, 1))`	In [0, π]
Scalar projection	u·v / ‖v‖	`u @ v_hat`	Component of u along v
Vector projection	(u·v / ‖v‖²) v	`(u@v / (v@v)) * v`	Vector along v
Projection matrix	A(AᵀA)⁻¹Aᵀ	`A @ np.linalg.pinv(A)`	Onto col(A)
QR decomposition	A = QR	`np.linalg.qr(A)`	Q orthonormal, R upper triangular
Least squares	x* = (AᵀA)⁻¹Aᵀb	`np.linalg.lstsq(A, b, rcond=None)[0]`	Use lstsq, not inv
Attention scores	QKᵀ / sqrt(d_k)	`Q @ K.T / np.sqrt(d_k)`	Pre-softmax alignment

Key Takeaways

The dot product u · v = ‖u‖‖v‖cos(θ) measures alignment - how much two vectors point in the same direction. This geometric interpretation is the foundation of attention.
Orthogonal vectors have zero dot product. Orthogonality means independence in the linear sense. Principal components are orthogonal, ensuring they capture non-overlapping information.
The vector projection of u onto v is (u·v / ‖v‖²) v - the component of u in the direction of v. The residual u - proj is orthogonal to v.
Projection matrices P satisfy P² = P (idempotent) and Pᵀ = P (symmetric). Their eigenvalues are 0 or 1. trace(P) = dim of the subspace projected onto.
Gram-Schmidt converts any basis into an orthonormal basis by repeatedly subtracting projections. It is the algorithm underlying QR decomposition.
Least squares minimizes ‖Ax - b‖² by projecting b onto col(A). The solution x* = (AᵀA)⁻¹Aᵀb satisfies the orthogonality condition Aᵀ(b - Ax*) = 0.
Scaled dot-product attention computes alignment scores QKᵀ / sqrt(d_k). The scaling prevents softmax saturation: without it, dot products grow as sqrt(d_k), making gradients vanish.

Next: Norms and Distance Metrics →

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Dot Products & Projections demo on the EngineersOfAI Playground - no code required.

:::

What You Will Learn​

Prerequisites​

Part 1 - The Dot Product: Algebraic vs Geometric​

Algebraic definition​

Geometric definition​

Why the geometric definition reveals alignment​

Dot product in high dimensions​

Part 2 - Orthogonality: When the Dot Product Is Zero​

Why orthogonality matters in ML​

Orthogonal complement​

Part 3 - Projection of One Vector Onto Another​

Projection onto a subspace​

Part 4 - Projection Matrices: Idempotent and Symmetric​

Key properties of projection matrices​

The orthogonal complement projection​

Part 5 - Gram-Schmidt: Constructing Orthonormal Bases​

Modified Gram-Schmidt (numerically stable version)​

QR decomposition​

Part 6 - Least Squares via Projection​

The geometric picture​

Why "normal" equations?​

Linear regression as projection​

Part 7 - NumPy: Dot Products, Projections, Gram-Schmidt​

Complete reference​

Part 8 - ML Connections: Attention, Cosine Similarity, Regression​

Scaled dot-product attention​

Why divide by sqrt(d_k)?​

Cosine similarity for retrieval​

Interview Questions​

Practice Challenges​

Level 1: Predict​

Level 2: Debug​

Level 3: Design​

Quick Reference Cheatsheet​

Key Takeaways​

What You Will Learn

Prerequisites

Part 1 - The Dot Product: Algebraic vs Geometric

Algebraic definition

Geometric definition

Why the geometric definition reveals alignment

Dot product in high dimensions

Part 2 - Orthogonality: When the Dot Product Is Zero

Why orthogonality matters in ML

Orthogonal complement

Part 3 - Projection of One Vector Onto Another

Projection onto a subspace

Part 4 - Projection Matrices: Idempotent and Symmetric

Key properties of projection matrices

The orthogonal complement projection

Part 5 - Gram-Schmidt: Constructing Orthonormal Bases

Modified Gram-Schmidt (numerically stable version)

QR decomposition

Part 6 - Least Squares via Projection

The geometric picture

Why "normal" equations?

Linear regression as projection

Part 7 - NumPy: Dot Products, Projections, Gram-Schmidt

Complete reference

Part 8 - ML Connections: Attention, Cosine Similarity, Regression

Scaled dot-product attention

Why divide by sqrt(d_k)?

Cosine similarity for retrieval

Interview Questions

Practice Challenges

Level 1: Predict

Level 2: Debug

Level 3: Design

Quick Reference Cheatsheet

Key Takeaways