Linear Transformations - The Geometry of Neural Network Layers
Reading time: ~24 minutes | Level: Mathematical Foundations → ML Engineering
Every layer of a neural network applies a linear transformation. Every convolutional filter is a linear transformation. The entire field of representation learning is the study of which linear transformations to compose - and in what order.
When you call nn.Linear(512, 256) in PyTorch, you are instantiating a linear map from ℝ⁵¹² to ℝ²⁵⁶. When you stack layers, you are composing those maps. When you analyze what a network has learned, you are examining the geometry of those compositions. Understanding linear transformations is not a prerequisite for running PyTorch - but it is a prerequisite for reasoning about what your network is actually doing.
What You Will Learn
- What makes a function a linear map: the two defining properties
- Kernel (null space): the directions a transformation destroys and why that equals information loss
- Image (column space): the set of everything a transformation can produce
- Rank-nullity theorem: the fundamental constraint on information flow through a layer
- Change of basis: why the same transformation looks different in different coordinate systems
- Composition of linear maps: why matrix multiplication is not arbitrary
- NumPy: implementing and visualizing linear transformations
- ML connections: layer activations, weight matrices, LoRA low-rank adaptation
Prerequisites
- Lesson 01: Vectors and Vector Spaces
- Lesson 02: Matrix Operations
- Lesson 03: Eigenvalues and Eigenvectors
Part 1 - Linear Maps: The Two Defining Properties
A function T: ℝⁿ → ℝᵐ is a linear transformation (also called a linear map) if and only if it satisfies two properties for all vectors u, v and all scalars c:
Additivity: T(u + v) = T(u) + T(v)
Homogeneity: T(cu) = c·T(u)
These two conditions combine into one: T(αu + βv) = αT(u) + βT(v) for all α, β - the superposition principle.
What these properties mean geometrically
Additivity means: transform vectors separately and add, or add first and transform - same result. Grids stay grids. Parallel lines stay parallel. The origin stays fixed (set u = v = 0: T(0) = 0).
Homogeneity means: scaling before or after the transformation produces the same result. If you double the input, you double the output.
A transformation satisfying both is "structure-preserving" - it preserves the linear structure of the vector space.
What is NOT a linear transformation
# These are NOT linear transformations:
def f1(x): return x + 5 # translation: T(0) = 5 ≠ 0
def f2(x): return x ** 2 # squaring: not homogeneous
def f3(x): return abs(x) # absolute value: violates additivity
def relu(x): return max(0, x) # ReLU: piecewise linear, NOT linear
:::danger ReLU is NOT a linear transformation
ReLU(x) = max(0, x) is piecewise linear but not a linear transformation. It violates both additivity and homogeneity when input signs change. This is critical: ReLU is what gives neural networks expressive power beyond linear functions. Without activations like ReLU, stacking linear layers produces another linear layer - W₂(W₁x) = (W₂W₁)x.
:::
Every linear transformation is a matrix
The most fundamental theorem about linear maps:
Every linear transformation T: ℝⁿ → ℝᵐ can be represented as a matrix multiplication T(x) = Ax for some matrix A ∈ ℝᵐˣⁿ.
The columns of A are exactly the images of the standard basis vectors: A = [T(e₁) | T(e₂) | ... | T(eₙ)].
import numpy as np
def transformation_matrix(T, n_in: int, n_out: int) -> np.ndarray:
"""
Construct the matrix A such that T(x) = A @ x.
Works for any linear transformation T: R^n_in → R^n_out.
"""
A = np.zeros((n_out, n_in))
for j in range(n_in):
e_j = np.zeros(n_in)
e_j[j] = 1.0 # standard basis vector
A[:, j] = T(e_j) # j-th column = T(e_j)
return A
# Example: 90 degree counterclockwise rotation in 2D
def rotate_90(x: np.ndarray) -> np.ndarray:
return np.array([-x[1], x[0]])
A_rot = transformation_matrix(rotate_90, n_in=2, n_out=2)
print(f"Rotation matrix:\n{A_rot}")
# [[ 0. -1.]
# [ 1. 0.]]
# Verify linearity
u = np.array([3.0, 1.0])
v = np.array([1.0, 2.0])
alpha, beta = 2.0, -1.0
lhs = rotate_90(alpha * u + beta * v)
rhs = alpha * rotate_90(u) + beta * rotate_90(v)
print(f"Superposition holds: {np.allclose(lhs, rhs)}") # True
# Show: T(x) = A @ x for any x
x = np.array([5.0, 3.0])
print(f"T(x) via function: {rotate_90(x)}")
print(f"T(x) via matrix: {A_rot @ x}")
Part 2 - Kernel (Null Space): What a Transformation Destroys
The kernel (or null space) of a linear transformation T: ℝⁿ → ℝᵐ is the set of all vectors that map to the zero vector:
ker(T) = {x ∈ ℝⁿ : T(x) = 0} = null(A)
The kernel is always a subspace (it contains zero, and is closed under addition and scaling).
Kernel as information loss
If x ∈ ker(T), then T(x) = 0. More generally:
If T(x₁) = T(x₂), then T(x₁ - x₂) = 0, so x₁ - x₂ ∈ ker(T).
This means: two inputs x₁ and x₂ that differ only by a vector in the kernel are indistinguishable after the transformation. The kernel is exactly the information that gets destroyed.
A transformation with a non-trivial kernel (dim(ker) > 0) is not injective - it cannot be inverted. Information is lost. This is the mathematical statement of what happens when a neural network layer reduces dimensionality.
Computing the kernel in NumPy
import numpy as np
def null_space(A: np.ndarray, tol: float = 1e-10) -> np.ndarray:
"""
Compute a basis for the null space of A using SVD.
Returns columns whose span equals ker(A).
"""
_, s, Vt = np.linalg.svd(A, full_matrices=True)
n = A.shape[1]
rank = int(np.sum(s > tol))
# Right singular vectors beyond rank span the null space
null_basis = Vt[rank:].T # columns span ker(A)
return null_basis
# Example: projection matrix onto x-axis in 3D
# A maps [x, y, z] to [x, 0, 0]
A = np.array([[1, 0, 0],
[0, 0, 0],
[0, 0, 0]], dtype=float)
ker = null_space(A)
print(f"Null space basis:\n{ker}")
# Columns span {[0,1,0], [0,0,1]} - the y-z plane gets destroyed
# Verify: A @ ker_vector = 0
for i in range(ker.shape[1]):
v = ker[:, i]
result = A @ v
print(f"A @ ker_vec_{i}: {result}") # [0, 0, 0]
# Neural network weight matrix example
np.random.seed(42)
W = np.random.randn(256, 512) # 256 outputs, 512 inputs
ker_W = null_space(W)
print(f"\nW shape: {W.shape}")
print(f"Null space dimension: {ker_W.shape[1]}")
# At least 256 directions in input space are completely ignored
The kernel tells you about invertibility
import numpy as np
def kernel_dimension(A: np.ndarray, tol: float = 1e-10) -> int:
"""Dimension of the kernel (null space) of A."""
_, s, _ = np.linalg.svd(A, full_matrices=False)
n = A.shape[1]
rank = np.sum(s > tol)
return n - int(rank)
# Square rank-deficient matrix
A_square = np.array([[1.0, 2.0],
[2.0, 4.0]]) # row 2 = 2 * row 1, rank 1
dim_ker = kernel_dimension(A_square)
print(f"dim(ker(A_square)) = {dim_ker}") # 1
print(f"A is invertible: {dim_ker == 0}") # False
# Proof: det(A) = 1*4 - 2*2 = 0
print(f"det(A_square) = {np.linalg.det(A_square):.4f}") # 0
Part 3 - Image (Column Space): What a Transformation Can Produce
The image (or column space, or range) of a linear transformation T: ℝⁿ → ℝᵐ is the set of all vectors that T can produce:
im(T) = {T(x) : x ∈ ℝⁿ} = col(A)
The image is the span of the columns of A. It is a subspace of the output space ℝᵐ.
Why the column space matters
The image tells you what outputs are achievable. A linear system Ax = b has a solution if and only if b ∈ im(A) - i.e., b lies in the column space of A.
In ML terms: the column space of a weight matrix W determines which output activations are reachable, no matter how you set the input.
Computing the column space
import numpy as np
def column_space_basis(A: np.ndarray, tol: float = 1e-10) -> np.ndarray:
"""
Compute an orthonormal basis for the column space of A using SVD.
Returns: matrix whose columns form an orthonormal basis for im(A).
"""
U, s, _ = np.linalg.svd(A, full_matrices=True)
rank = int(np.sum(s > tol))
return U[:, :rank] # left singular vectors with nonzero singular values
# Example
A = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]], dtype=float)
# Rows are linearly dependent: row3 = 2*row2 - row1, so rank 2
col_basis = column_space_basis(A)
print(f"A shape: {A.shape}")
print(f"Column space dimension (rank): {col_basis.shape[1]}") # 2
# Verify: is column 1 of A in the column space?
b = A[:, 0] # first column
proj = col_basis @ col_basis.T @ b
residual = np.linalg.norm(b - proj)
print(f"\nProjection residual for column of A: {residual:.2e}") # ≈ 0
# Is a random vector in the column space?
b_random = np.array([1.0, 0.0, 0.0])
proj_random = col_basis @ col_basis.T @ b_random
residual_random = np.linalg.norm(b_random - proj_random)
print(f"Random vector residual: {residual_random:.4f}") # > 0 (not in col space)
Rank: the dimension of the image
The rank of a matrix A is:
rank(A) = dim(im(A)) = number of linearly independent columns
It equals the number of non-zero singular values. For A ∈ ℝᵐˣⁿ: rank(A) ≤ min(m, n).
import numpy as np
# rank in NumPy
A = np.random.randn(100, 50) # tall matrix, rank <= 50
rank_A = np.linalg.matrix_rank(A)
print(f"rank(A) = {rank_A}") # 50 with probability 1 (random full column rank)
# Low-rank matrix
A_lowrank = np.outer(np.random.randn(100), np.random.randn(50)) # rank-1
print(f"rank(low-rank) = {np.linalg.matrix_rank(A_lowrank)}") # 1
Part 4 - Rank-Nullity Theorem: The Fundamental Information Constraint
The rank-nullity theorem is one of the most important results in linear algebra:
dim(ker(T)) + dim(im(T)) = n
where n = dim(domain) is the number of input dimensions.
In matrix terms: nullity(A) + rank(A) = n (number of columns).
What it says about information flow
The theorem states a hard constraint: input dimensions are either preserved into the image, or destroyed into the kernel. Every input dimension goes somewhere.
dim(im(A)) = rank(A)input dimensions survive into the outputdim(ker(A)) = n - rank(A)input dimensions are destroyed
There is no third option. This is the conservation law for linear transformations.
import numpy as np
def verify_rank_nullity(A: np.ndarray, tol: float = 1e-10) -> None:
"""Verify the rank-nullity theorem for matrix A."""
m, n = A.shape
_, s, _ = np.linalg.svd(A, full_matrices=False)
rank = int(np.sum(s > tol))
nullity = n - rank
print(f"A: {m}x{n}")
print(f" rank(A) = {rank}")
print(f" nullity(A) = {nullity}")
print(f" rank + nullity = {rank + nullity} (should be n = {n})")
print(f" Theorem holds: {rank + nullity == n}")
# Tall matrix (more rows than columns)
A1 = np.random.randn(100, 20)
verify_rank_nullity(A1)
# rank = 20, nullity = 0 (injective: no information lost)
print()
# Wide matrix (more columns than rows)
A2 = np.random.randn(20, 100)
verify_rank_nullity(A2)
# rank = 20, nullity = 80 (80 input dimensions destroyed)
print()
# Square singular matrix
A3 = np.array([[1, 2, 3],
[2, 4, 6],
[1, 1, 1]], dtype=float)
verify_rank_nullity(A3)
# rank = 2, nullity = 1
Rank-nullity in neural network layers
Consider a linear layer W ∈ ℝᵐˣⁿ (n inputs, m outputs):
| Case | Condition | Meaning |
|---|---|---|
| m = n, rank = n | Square, full rank | Invertible layer: no information lost |
| m < n (compression) | rank = m | At least n - m dimensions destroyed |
| m > n (expansion) | rank = n | All input dimensions preserved; output has null space |
| rank < min(m, n) | Rank-deficient | Fewer effective dimensions than size suggests |
:::tip Rank-nullity explains why over-parameterized models can generalize A wide layer W ∈ ℝᵐˣⁿ with m ≪ n has nullity ≥ n - m. The optimization landscape is highly degenerate - many weight combinations produce identical outputs. This degeneracy is not a bug: it provides the redundancy that allows finding a well-generalizing solution via implicit regularization from gradient descent. :::
Part 5 - Change of Basis: Same Transformation, New Coordinates
Every vector exists in some coordinate system (basis). A change of basis is a linear transformation that re-expresses the same geometric object using a different coordinate system.
What a basis is
A basis for ℝⁿ is a set of n linearly independent vectors {b₁, b₂, ..., bₙ} such that every vector in ℝⁿ can be written as a unique linear combination of them.
The standard basis is {e₁, e₂, ..., eₙ} where eᵢ has a 1 in position i and 0 elsewhere. But any n linearly independent vectors form a valid basis.
Change of basis matrix
If B = [b₁ | b₂ | ... | bₙ] is a matrix whose columns are the new basis vectors, then:
x_new = B⁻¹ @ x_standard (from standard to new coordinates)
x_standard = B @ x_new (from new to standard coordinates)
A linear transformation A expressed in the standard basis becomes:
A_new = B⁻¹ @ A @ B
This is a similarity transformation. The matrix looks different in different bases, but it represents the same geometric transformation.
PCA as change of basis
PCA computes the eigenvectors of the covariance matrix and uses them as a new basis. Projecting data into this basis rotates the coordinate system so that:
- The first axis points along maximum variance
- The second axis points along the next-maximum variance (orthogonal to first)
- And so on
import numpy as np
def change_of_basis_demo():
"""Demonstrate change of basis using PCA as the example."""
np.random.seed(42)
# Generate correlated 2D data
mean = [0, 0]
cov = [[3, 2], [2, 2]]
X = np.random.multivariate_normal(mean, cov, size=200) # (200, 2)
# Step 1: center
X_centered = X - X.mean(axis=0)
# Step 2: covariance matrix
Sigma = X_centered.T @ X_centered / (len(X) - 1) # (2, 2)
# Step 3: eigendecomposition -> new basis
eigenvalues, eigenvectors = np.linalg.eigh(Sigma)
# eigh returns ascending order -> flip for descending
idx = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[idx]
B = eigenvectors[:, idx] # change-of-basis matrix: columns = new basis vectors
print(f"New basis vectors (principal components):")
print(f" PC1: {B[:, 0].round(4)}")
print(f" PC2: {B[:, 1].round(4)}")
print(f"Eigenvalues (variances): {eigenvalues.round(4)}")
# Step 4: project into new basis (change coordinates)
X_new = X_centered @ B
# In the new basis, axes are uncorrelated:
new_cov = X_new.T @ X_new / (len(X_new) - 1)
print(f"\nCovariance in new basis:")
print(f" Diagonal entries (variances): {np.diag(new_cov).round(4)}")
print(f" Off-diagonal (should be near 0): {new_cov[0,1]:.6f}")
# The same linear transformation, expressed in the new basis
A = np.array([[2.0, 1.0], [0.0, 1.5]]) # some linear map in standard basis
A_new_basis = np.linalg.inv(B) @ A @ B
print(f"\nA in standard basis:\n{A}")
print(f"A in PCA basis:\n{A_new_basis.round(4)}")
change_of_basis_demo()
Why change of basis is everywhere in ML
| ML Concept | Change of Basis Interpretation |
|---|---|
| PCA | Rotate to covariance eigenvector basis |
| Whitening | Rescale PCA basis to unit variance |
| Word embeddings | Words expressed in a learned semantic basis |
| Attention heads | Each head operates in a subspace of the embedding basis |
| Neural net hidden layer | Nonlinear map followed by new linear basis |
| Fourier transform | Express signal in frequency basis |
Part 6 - Composition of Linear Maps = Matrix Multiplication
If T₁: ℝⁿ → ℝᵐ and T₂: ℝᵐ → ℝᵖ are linear maps, their composition T₂ ∘ T₁: ℝⁿ → ℝᵖ is also a linear map.
The matrix of the composition is the product of the matrices:
Matrix of (T₂ ∘ T₁) = A₂ @ A₁
This is why matrix multiplication is defined the way it is - it is not an arbitrary rule, it is the algebraic encoding of function composition.
Sequential neural network layers
import numpy as np
# A two-layer linear network
# x in R^512 -> h in R^256 -> y in R^10
W1 = np.random.randn(256, 512) * 0.01 # layer 1 weight
W2 = np.random.randn(10, 256) * 0.01 # layer 2 weight
x = np.random.randn(512)
# Forward pass: apply each linear map in sequence
h = W1 @ x # first transformation
y = W2 @ h # second transformation
# Composition: the combined map is W2 @ W1
W_combined = W2 @ W1 # shape (10, 512)
y_combined = W_combined @ x
print(f"Two-pass output match: {np.allclose(y, y_combined)}") # True
# Key insight: two linear layers with no activation = one linear layer
# -> without nonlinearity, depth adds NO expressive power
rank_combined = np.linalg.matrix_rank(W_combined)
rank_W1 = np.linalg.matrix_rank(W1)
rank_W2 = np.linalg.matrix_rank(W2)
print(f"rank(W1) = {rank_W1}, rank(W2) = {rank_W2}, rank(W2@W1) = {rank_combined}")
# rank(W_combined) <= 10 (bottlenecked by the 10-dimensional output)
Why composition limits expressiveness
The rank of a product is bounded:
rank(A @ B) <= min(rank(A), rank(B))
This means: the effective rank of a composed linear map can never exceed the rank of the narrowest layer. A bottleneck layer with m hidden units can represent at most m linearly independent output directions - no matter how many layers come before or after it (without nonlinearity).
Part 7 - NumPy: Implementing and Visualizing Linear Transformations
Core diagnostic toolkit
import numpy as np
def linear_map_diagnostics(A: np.ndarray, tol: float = 1e-10) -> dict:
"""
Comprehensive diagnostics for a linear transformation A: R^n to R^m.
Returns rank, nullity, column space basis, null space basis.
"""
m, n = A.shape
U, s, Vt = np.linalg.svd(A, full_matrices=True)
rank = int(np.sum(s > tol))
nullity = n - rank
# Column space basis: first `rank` left singular vectors
col_space_basis = U[:, :rank]
# Null space basis: right singular vectors beyond rank
null_space_basis = Vt[rank:].T # shape (n, nullity)
return {
"shape": (m, n),
"rank": rank,
"nullity": nullity,
"column_space_basis": col_space_basis,
"null_space_basis": null_space_basis,
"singular_values": s[:min(m, n)],
}
# Test with a concrete matrix
A = np.array([[1, 2, 0, 1],
[2, 4, 1, 3],
[3, 6, 1, 4]], dtype=float)
diag = linear_map_diagnostics(A)
print(f"Shape: {diag['shape']}")
print(f"Rank: {diag['rank']}")
print(f"Nullity: {diag['nullity']}")
print(f"Rank + Nullity = {diag['rank'] + diag['nullity']} (should be n={A.shape[1]})")
# Verify null space: A @ null_vector should equal 0
null_basis = diag["null_space_basis"]
for i in range(null_basis.shape[1]):
v = null_basis[:, i]
print(f"A @ null_vec_{i}: max abs = {np.abs(A @ v).max():.2e}") # near 0
# Project a vector onto the column space
b = np.array([1.0, 2.0, 4.0])
col_basis = diag["column_space_basis"]
b_proj = col_basis @ col_basis.T @ b # orthogonal projection onto col(A)
b_residual = b - b_proj
print(f"\nresidual orthogonal to col(A): "
f"{np.allclose(col_basis.T @ b_residual, 0, atol=1e-10)}")
Visualizing 2D transformations
import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
def visualize_transformation(A: np.ndarray, title: str, ax) -> None:
"""Show how a 2x2 matrix transforms the unit circle and basis vectors."""
theta = np.linspace(0, 2 * np.pi, 200)
circle = np.vstack([np.cos(theta), np.sin(theta)]) # (2, 200)
circle_t = A @ circle
ax.set_aspect('equal')
ax.axhline(0, color='gray', linewidth=0.5)
ax.axvline(0, color='gray', linewidth=0.5)
ax.plot(circle[0], circle[1], 'b--', alpha=0.4, linewidth=1.5, label='original')
ax.plot(circle_t[0], circle_t[1], 'r-', linewidth=2, label='transformed')
# Draw original and transformed basis vectors
colors = ['#10b981', '#6366f1']
for j in range(2):
e_j = np.eye(2)[:, j]
e_t = A @ e_j
ax.annotate('', xy=e_j, xytext=[0, 0],
arrowprops=dict(arrowstyle='->', color=colors[j], lw=1.5))
ax.annotate('', xy=e_t, xytext=[0, 0],
arrowprops=dict(arrowstyle='->', color=colors[j], lw=2.5))
ax.set_title(title)
ax.legend(fontsize=8)
ax.set_xlim(-2.5, 2.5)
ax.set_ylim(-2.5, 2.5)
fig, axes = plt.subplots(2, 2, figsize=(10, 10))
transformations = [
(np.array([[2, 0], [0, 0.5]], dtype=float), "Scale (x2, y0.5)"),
(np.array([[0, -1], [1, 0]], dtype=float), "Rotation 90 deg"),
(np.array([[1, 1], [0, 1]], dtype=float), "Shear"),
(np.array([[1, 0], [0, -1]], dtype=float), "Reflection over x-axis"),
]
for ax, (A, title) in zip(axes.flat, transformations):
visualize_transformation(A, title, ax)
plt.tight_layout()
plt.savefig('/tmp/linear_transformations_2d.png', dpi=100, bbox_inches='tight')
print("Saved: /tmp/linear_transformations_2d.png")
Part 8 - ML Connections: Layers, LoRA, and Representation Learning
Weight matrices as linear maps
Every nn.Linear(n, m) layer in PyTorch is a linear map W: ℝⁿ → ℝᵐ (plus a bias, which makes it an affine map). The linear part:
- Maps inputs into a new representation space: hidden activations are coordinates in a new basis
- Has rank(W) ≤ min(n, m): the effective dimensionality of the representation
- Has nullity(W) = n - rank(W): input directions that produce zero activation
import numpy as np
def analyze_layer_geometry(W: np.ndarray) -> None:
"""Analyze the geometry of a neural network weight matrix."""
m, n = W.shape
_, s, _ = np.linalg.svd(W, full_matrices=False)
# Effective rank: singular values > 1% of max
eff_rank = int(np.sum(s > 0.01 * s[0]))
rank = int(np.sum(s > 1e-10 * s[0]))
print(f"Layer: R^{n} -> R^{m}")
print(f" Theoretical max rank: {min(m, n)}")
print(f" Numerical rank: {rank}")
print(f" Effective rank: {eff_rank} (s > 1% of max)")
print(f" Nullity: {n - rank}")
print(f" Condition number: {s[0]/s[-1]:.2f}")
explained = np.cumsum(s**2) / np.sum(s**2)
k90 = np.searchsorted(explained, 0.90) + 1
print(f" Top-{k90} singular values capture 90% of Frobenius norm squared")
np.random.seed(42)
print("=== Random initialization ===")
analyze_layer_geometry(np.random.randn(256, 512) * 0.02)
print("\n=== Low-rank signal (simulates trained matrix) ===")
U_true = np.random.randn(256, 8)
V_true = np.random.randn(8, 512)
W_lowrank = U_true @ V_true + 0.001 * np.random.randn(256, 512)
analyze_layer_geometry(W_lowrank)
LoRA: exploiting low-rank structure
LoRA (Low-Rank Adaptation) is a fine-tuning technique built directly on the rank structure of weight matrices. The observation:
Trained weight matrices often have low intrinsic rank - most of their information is captured by a small number of singular vectors.
Instead of fine-tuning the full weight matrix W ∈ ℝᵐˣⁿ (m×n parameters), LoRA reparameterizes the update as:
W_finetuned = W_pretrained + DeltaW
DeltaW = B @ A where B in R^(m x r), A in R^(r x n), r << min(m, n)
Only B and A are learned. Parameter savings: instead of m×n, train m×r + r×n = r(m+n) parameters.
import numpy as np
class LoRALayer:
"""
LoRA (Low-Rank Adaptation) of a linear layer.
Approximates fine-tuning DeltaW = B @ A where rank(DeltaW) <= r.
"""
def __init__(self, W_pretrained: np.ndarray, rank: int, alpha: float = 1.0):
m, n = W_pretrained.shape
self.W = W_pretrained.copy() # frozen
self.r = rank
self.alpha = alpha
# LoRA parameters (these are trained during fine-tuning)
self.A = np.random.randn(rank, n) * 0.01 # (r, n)
self.B = np.zeros((m, rank)) # (m, r) -- init to zero
# B=0 means DeltaW=0 at initialization (preserves pretrained behavior)
def forward(self, x: np.ndarray) -> np.ndarray:
"""x shape: (n,) or (batch, n)"""
y_pretrained = x @ self.W.T
scale = self.alpha / self.r
y_delta = x @ self.A.T @ self.B.T * scale
return y_pretrained + y_delta
@property
def param_count(self) -> dict:
m, n = self.W.shape
return {
"frozen_W": m * n,
"lora_total": self.r * (m + n),
"savings_pct": 100 * (1 - self.r * (m + n) / (m * n)),
}
# Fine-tuning an attention projection (common LoRA target)
np.random.seed(42)
W = np.random.randn(4096, 4096) * 0.02
lora = LoRALayer(W, rank=16)
params = lora.param_count
print(f"Full layer parameters: {params['frozen_W']:,}")
print(f"LoRA parameters: {params['lora_total']:,}")
print(f"Parameter savings: {params['savings_pct']:.1f}%")
x = np.random.randn(4096)
y = lora.forward(x)
print(f"Output shape: {y.shape}")
:::tip The rank-nullity theorem explains why LoRA works If trained weight matrices have intrinsic rank r ≪ min(m, n), then dim(ker(W)) ≈ n - r. Most input directions map to zero - they are in the null space. Fine-tuning only needs to adjust the r non-trivial directions. LoRA's low-rank decomposition is a direct application of this theorem: represent the weight update in the r-dimensional subspace that actually matters. :::
Layer activations as change of basis
The hidden representation h = W @ x is literally a change of coordinates: x expressed in the basis defined by the rows of W.
import numpy as np
np.random.seed(0)
W1 = np.random.randn(64, 128) # layer 1: R^128 -> R^64
W2 = np.random.randn(32, 64) # layer 2: R^64 -> R^32
x = np.random.randn(128)
h1 = W1 @ x # x re-expressed in W1's row space
h2 = W2 @ h1 # h1 further re-expressed in W2's row space
# Combined: (W2 W1) x -- one linear map from R^128 to R^32
h2_direct = (W2 @ W1) @ x
print(f"Stacked layers match combined: {np.allclose(h2, h2_direct)}") # True
print(f"rank(W2 @ W1): {np.linalg.matrix_rank(W2 @ W1)}")
# Bounded by min(32, rank(W1), rank(W2)) = 32
# No benefit to stacking without nonlinearity
Interview Questions
Q1: What is the null space of a matrix and what does it mean for a neural network layer?
The null space (kernel) of a matrix W is the set of all vectors x such that Wx = 0:
ker(W) = {x : Wx = 0}
It is a subspace of the input space, with dimension = n - rank(W) (by the rank-nullity theorem).
For a neural network layer W ∈ ℝᵐˣⁿ:
-
Information loss: Any input x in the null space produces zero activation. Two inputs x₁ and x₂ that differ only by a null-space vector are indistinguishable: W(x₁) = W(x₁ + z) for any z ∈ ker(W).
-
Effective dimensionality: dim(ker(W)) = n - rank(W) input dimensions are invisible to the layer. Only rank(W) input directions actually influence the output.
-
Compression layers (m < n): the null space has dimension at least n - m. At minimum n - m input directions are destroyed, regardless of what later layers do.
-
LoRA implication: If rank(W) ≈ r for small r, then most input dimensions (n - r of them) are in the null space. Fine-tuning only needs to modify the r active directions - motivation for LoRA's low-rank parameterization.
Q2: What is the rank-nullity theorem and why does it matter for deep learning?
The rank-nullity theorem states: for a linear map T: ℝⁿ → ℝᵐ represented by matrix A:
rank(A) + nullity(A) = n
where rank(A) = dim(im(A)) and nullity(A) = dim(ker(A)).
Why it matters for deep learning:
-
Hard information bottleneck: A layer W ∈ ℝᵐˣⁿ with m < n can preserve at most m input dimensions. The remaining n - m are destroyed. This is a mathematical constraint, not a design choice.
-
Effective capacity: Even a wide layer may have low effective rank. Rank-nullity reveals the true information throughput: rank(W) dimensions in, rank(W) dimensions out.
-
Composition bottleneck: rank(W₂ W₁) ≤ min(rank(W₂), rank(W₁)). The narrowest layer limits the effective dimensionality of all subsequent representations.
-
LoRA: Trained weight matrices have low intrinsic rank (most dimensions in the null space) - the empirical observation that motivates parameter-efficient fine-tuning.
-
Over-parameterization: Wide networks (n ≫ m) have large null spaces, providing many equivalent solutions - key to why gradient descent finds well-generalizing minima.
Q3: How is PCA a change of basis?
PCA is exactly a change of basis where the new basis vectors are the eigenvectors of the data covariance matrix.
The mathematical steps:
- Center: X_c = X - mean(X)
- Covariance: Σ = X_cᵀ X_c / (n-1) - symmetric positive semidefinite
- Eigendecompose: Σ = Q Λ Qᵀ, where Q = [q₁ | ... | qₐ] are orthonormal eigenvectors
- Change of basis: express each data point in the new basis: x_new = Qᵀ x_c
The matrix Q is the change-of-basis matrix. Since Q is orthogonal (Qᵀ = Q⁻¹), it represents a pure rotation with no distortion.
What the new basis achieves:
- In the original basis, features are correlated (off-diagonal Σ entries non-zero)
- In the eigenvector basis, covariance becomes diagonal Λ - all features decorrelated
- Diagonal entries λ₁ ≥ λ₂ ≥ ... are the variances along each new axis
Dimensionality reduction: keeping only the first k columns of Q (largest eigenvalues) gives the least-squares optimal k-dimensional approximation.
Key insight: PCA does not modify the data - it only changes the coordinate system to one that is more informative. The "principal components" are just coordinate axes in a new basis.
Q4: What makes a function linear and why does that matter for ML?
A function T: ℝⁿ → ℝᵐ is linear if and only if:
- Additivity: T(u + v) = T(u) + T(v) for all u, v
- Homogeneity: T(cu) = c·T(u) for all scalars c and vectors u
Combined: T(αu + βv) = αT(u) + βT(v) - the superposition principle.
Equivalently: T is linear iff T(x) = Ax for some matrix A ∈ ℝᵐˣⁿ.
Why it matters for ML:
-
Computability: Linear maps are O(mn) to apply, have well-understood geometry (rank, null space, condition number), and compose cleanly.
-
Non-expressiveness of pure linearity: Stacking n linear layers = one linear layer (A_n ⋯ A₁)x. Without nonlinear activations, any depth network is equivalent to a single matrix multiplication.
-
Nonlinear activations (ReLU, GeLU) break linearity, enabling universal function approximation. Linear layers provide directional control; activations provide bending.
-
Optimization: The Jacobian of T(x) = Ax is constant (equals A). Gradients flow cleanly through linear layers without vanishing or exploding on their own.
-
Analysis: Linear layers can be analyzed with SVD (LoRA), spectral norms (Lipschitz bounds), condition numbers (training stability), rank (effective dimensionality).
Practice Challenges
Level 1: Predict
Challenge: Without computing, predict the rank and nullity for each matrix:
- A ∈ ℝ⁵ˣ³ that is randomly generated (generic case)
- B = vvᵀ where v ∈ ℝ⁴ (outer product of a vector with itself)
- C ∈ ℝ³ˣ⁵ where rows 2 and 3 are linear combinations of row 1
Answer
-
A ∈ ℝ⁵ˣ³ random: rank = 3, nullity = 0. A random matrix has full column rank with probability 1. Rank-nullity: 3 + 0 = 3. The map is injective (no information loss).
-
B = vvᵀ ∈ ℝ⁴ˣ⁴: rank = 1, nullity = 3. Every column of B is a scalar multiple of v: B[:,j] = v·vⱼ. So col(B) = span{v}. Rank-nullity: 1 + 3 = 4.
-
C ∈ ℝ³ˣ⁵ with dependent rows: rank = 1, nullity = 4. Only one independent row. Row rank = column rank = 1. Rank-nullity: 1 + 4 = 5.
import numpy as np
v = np.array([1.0, 2.0, 3.0, 4.0])
B = np.outer(v, v)
print(np.linalg.matrix_rank(B)) # 1
row = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
C = np.vstack([row, 2*row, 3*row])
print(np.linalg.matrix_rank(C)) # 1
Level 2: Debug
Challenge: The following code attempts to compute the projection of b onto the column space of A, but crashes or gives wrong results for rank-deficient A. Find and fix the bug:
import numpy as np
A = np.array([[1, 2],
[2, 4],
[3, 6]], dtype=float) # rank 1 (columns proportional)
b = np.array([1.0, 3.0, 2.0])
# Attempted projection: b_proj = A (A^T A)^{-1} A^T b
P = A @ np.linalg.inv(A.T @ A) @ A.T
b_proj = P @ b
print(b_proj)
Answer
Bug: A.T @ A is singular when A is rank-deficient (here rank = 1, so A.T @ A ∈ ℝ²ˣ² is singular). np.linalg.inv either raises LinAlgError or returns numerically garbage results.
Fix: Use the pseudoinverse, which handles rank-deficient matrices correctly:
import numpy as np
A = np.array([[1, 2],
[2, 4],
[3, 6]], dtype=float) # rank 1
b = np.array([1.0, 3.0, 2.0])
# Fix 1: pseudoinverse handles rank deficiency
A_pinv = np.linalg.pinv(A)
P_correct = A @ A_pinv
b_proj = P_correct @ b
print(f"Correct projection: {b_proj}")
# Fix 2: SVD-based (most numerically stable)
U, s, Vt = np.linalg.svd(A, full_matrices=True)
rank = int(np.sum(s > 1e-10 * s[0]))
col_basis = U[:, :rank]
b_proj_svd = col_basis @ col_basis.T @ b
print(f"SVD projection: {b_proj_svd}")
# Verify: residual is orthogonal to column space
residual = b - b_proj
print(f"residual dot col1: {np.dot(residual, A[:, 0]):.2e}") # near 0
print(f"residual dot col2: {np.dot(residual, A[:, 1]):.2e}") # near 0
Root cause: The projection formula P = A(AᵀA)⁻¹Aᵀ (the hat matrix in regression) is only valid when A has full column rank. For rank-deficient A, use P = A A⁺ where A⁺ = pinv(A).
Level 3: Design
Challenge: Implement effective_rank(W, threshold=0.99) returning the number of singular values needed to capture threshold of the total Frobenius norm squared. Then use it to compare a random matrix against a simulated trained matrix.
Answer
import numpy as np
def effective_rank(W: np.ndarray, threshold: float = 0.99) -> dict:
"""
Compute the effective rank of W at the given explained-variance threshold.
Args:
W: weight matrix of shape (m, n)
threshold: fraction of Frobenius norm squared to capture
Returns:
dict with rank info and singular value spectrum
"""
_, s, _ = np.linalg.svd(W, full_matrices=False)
s_sq = s ** 2
cumulative = np.cumsum(s_sq) / s_sq.sum()
# Minimum k such that top-k singular values capture >= threshold
k = int(np.searchsorted(cumulative, threshold)) + 1
return {
"matrix_shape": W.shape,
"max_possible_rank": min(W.shape),
"effective_rank": k,
"singular_values": s,
"cumulative_explained": cumulative,
"compression_ratio": k / min(W.shape),
}
np.random.seed(42)
m, n = 256, 512
# Random initialization: singular values roughly uniform
W_random = np.random.randn(m, n) * 0.02
# Simulated trained matrix: low-rank signal dominates
r_true = 32
U_sig = np.random.randn(m, r_true)
V_sig = np.random.randn(r_true, n)
W_trained = W_random + 0.5 * U_sig @ V_sig
for name, W in [("Random init", W_random), ("Trained (low-rank signal)", W_trained)]:
result = effective_rank(W, threshold=0.99)
print(f"\n=== {name} ===")
print(f" Max possible rank: {result['max_possible_rank']}")
print(f" Effective rank (99%): {result['effective_rank']}")
print(f" Compression ratio: {result['compression_ratio']:.3f}")
# LoRA parameter savings at effective rank
r_eff = effective_rank(W_trained, threshold=0.99)["effective_rank"]
params_full = m * n
params_lora = r_eff * (m + n)
print(f"\nLoRA at effective rank {r_eff}:")
print(f" Full params: {params_full:,}")
print(f" LoRA params: {params_lora:,}")
print(f" Savings: {100*(1 - params_lora/params_full):.1f}%")
Quick Reference Cheatsheet
| Concept | Definition | NumPy | ML Use |
|---|---|---|---|
| Kernel / null space | {x : Ax = 0} | Vt[rank:].T from SVD | Destroyed input directions |
| Image / column space | {Ax : x ∈ ℝⁿ} | U[:, :rank] from SVD | Reachable outputs |
| Rank | dim(im(A)) | np.linalg.matrix_rank(A) | Effective dimensionality |
| Nullity | n - rank | n - matrix_rank(A) | Redundant input dimensions |
| Rank-nullity | rank + nullity = n | Always holds | Conservation law for info flow |
| Change of basis | A_new = B⁻¹AB | np.linalg.solve(B, A @ B) | PCA, whitening |
| Composition | T₂ ∘ T₁ → A₂A₁ | A2 @ A1 | Stacking layers = multiplying matrices |
| Projection onto col(A) | A A⁺ b | A @ np.linalg.pinv(A) @ b | Least squares, attention |
Key Takeaways
- A linear map is completely characterized by its matrix: T(x) = Ax. Every neural network linear layer is a linear map from input space to output space.
- The kernel (null space) is the set of inputs destroyed by the transformation. Its dimension equals n - rank(A) - the amount of information permanently lost.
- The image (column space) is the set of achievable outputs. Its dimension equals rank(A).
- Rank-nullity theorem: rank + nullity = n. Every input dimension either survives into the image or is destroyed into the kernel. No middle ground.
- Change of basis re-expresses the same transformation in different coordinates. PCA is change of basis to the eigenvector coordinate system where covariance becomes diagonal.
- Stacking linear layers without nonlinear activations produces a single linear layer. Depth without nonlinearity provides zero additional expressive power.
- LoRA works because trained weight matrices have low intrinsic rank. Most input dimensions are in the null space, so only a small subspace needs fine-tuning.
Next: PCA from Linear Algebra →
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Matrix Transformations in 3D demo on the EngineersOfAI Playground - no code required.
:::
