Skip to main content

Linear Transformations - The Geometry of Neural Network Layers

Reading time: ~24 minutes | Level: Mathematical Foundations → ML Engineering

Every layer of a neural network applies a linear transformation. Every convolutional filter is a linear transformation. The entire field of representation learning is the study of which linear transformations to compose - and in what order.

When you call nn.Linear(512, 256) in PyTorch, you are instantiating a linear map from ℝ⁵¹² to ℝ²⁵⁶. When you stack layers, you are composing those maps. When you analyze what a network has learned, you are examining the geometry of those compositions. Understanding linear transformations is not a prerequisite for running PyTorch - but it is a prerequisite for reasoning about what your network is actually doing.

What You Will Learn

  • What makes a function a linear map: the two defining properties
  • Kernel (null space): the directions a transformation destroys and why that equals information loss
  • Image (column space): the set of everything a transformation can produce
  • Rank-nullity theorem: the fundamental constraint on information flow through a layer
  • Change of basis: why the same transformation looks different in different coordinate systems
  • Composition of linear maps: why matrix multiplication is not arbitrary
  • NumPy: implementing and visualizing linear transformations
  • ML connections: layer activations, weight matrices, LoRA low-rank adaptation

Prerequisites

Part 1 - Linear Maps: The Two Defining Properties

A function T: ℝⁿ → ℝᵐ is a linear transformation (also called a linear map) if and only if it satisfies two properties for all vectors u, v and all scalars c:

Additivity: T(u + v) = T(u) + T(v)
Homogeneity: T(cu) = c·T(u)

These two conditions combine into one: T(αu + βv) = αT(u) + βT(v) for all α, β - the superposition principle.

What these properties mean geometrically

Additivity means: transform vectors separately and add, or add first and transform - same result. Grids stay grids. Parallel lines stay parallel. The origin stays fixed (set u = v = 0: T(0) = 0).

Homogeneity means: scaling before or after the transformation produces the same result. If you double the input, you double the output.

A transformation satisfying both is "structure-preserving" - it preserves the linear structure of the vector space.

What is NOT a linear transformation

# These are NOT linear transformations:
def f1(x): return x + 5 # translation: T(0) = 5 ≠ 0
def f2(x): return x ** 2 # squaring: not homogeneous
def f3(x): return abs(x) # absolute value: violates additivity
def relu(x): return max(0, x) # ReLU: piecewise linear, NOT linear

:::danger ReLU is NOT a linear transformation ReLU(x) = max(0, x) is piecewise linear but not a linear transformation. It violates both additivity and homogeneity when input signs change. This is critical: ReLU is what gives neural networks expressive power beyond linear functions. Without activations like ReLU, stacking linear layers produces another linear layer - W₂(W₁x) = (W₂W₁)x. :::

Every linear transformation is a matrix

The most fundamental theorem about linear maps:

Every linear transformation T: ℝⁿ → ℝᵐ can be represented as a matrix multiplication T(x) = Ax for some matrix A ∈ ℝᵐˣⁿ.

The columns of A are exactly the images of the standard basis vectors: A = [T(e₁) | T(e₂) | ... | T(eₙ)].

import numpy as np

def transformation_matrix(T, n_in: int, n_out: int) -> np.ndarray:
"""
Construct the matrix A such that T(x) = A @ x.
Works for any linear transformation T: R^n_in → R^n_out.
"""
A = np.zeros((n_out, n_in))
for j in range(n_in):
e_j = np.zeros(n_in)
e_j[j] = 1.0 # standard basis vector
A[:, j] = T(e_j) # j-th column = T(e_j)
return A

# Example: 90 degree counterclockwise rotation in 2D
def rotate_90(x: np.ndarray) -> np.ndarray:
return np.array([-x[1], x[0]])

A_rot = transformation_matrix(rotate_90, n_in=2, n_out=2)
print(f"Rotation matrix:\n{A_rot}")
# [[ 0. -1.]
# [ 1. 0.]]

# Verify linearity
u = np.array([3.0, 1.0])
v = np.array([1.0, 2.0])
alpha, beta = 2.0, -1.0

lhs = rotate_90(alpha * u + beta * v)
rhs = alpha * rotate_90(u) + beta * rotate_90(v)
print(f"Superposition holds: {np.allclose(lhs, rhs)}") # True

# Show: T(x) = A @ x for any x
x = np.array([5.0, 3.0])
print(f"T(x) via function: {rotate_90(x)}")
print(f"T(x) via matrix: {A_rot @ x}")

Part 2 - Kernel (Null Space): What a Transformation Destroys

The kernel (or null space) of a linear transformation T: ℝⁿ → ℝᵐ is the set of all vectors that map to the zero vector:

ker(T) = {x ∈ ℝⁿ : T(x) = 0} = null(A)

The kernel is always a subspace (it contains zero, and is closed under addition and scaling).

Kernel as information loss

If x ∈ ker(T), then T(x) = 0. More generally:

If T(x₁) = T(x₂), then T(x₁ - x₂) = 0, so x₁ - x₂ ∈ ker(T).

This means: two inputs x₁ and x₂ that differ only by a vector in the kernel are indistinguishable after the transformation. The kernel is exactly the information that gets destroyed.

A transformation with a non-trivial kernel (dim(ker) > 0) is not injective - it cannot be inverted. Information is lost. This is the mathematical statement of what happens when a neural network layer reduces dimensionality.

Computing the kernel in NumPy

import numpy as np

def null_space(A: np.ndarray, tol: float = 1e-10) -> np.ndarray:
"""
Compute a basis for the null space of A using SVD.
Returns columns whose span equals ker(A).
"""
_, s, Vt = np.linalg.svd(A, full_matrices=True)
n = A.shape[1]
rank = int(np.sum(s > tol))
# Right singular vectors beyond rank span the null space
null_basis = Vt[rank:].T # columns span ker(A)
return null_basis

# Example: projection matrix onto x-axis in 3D
# A maps [x, y, z] to [x, 0, 0]
A = np.array([[1, 0, 0],
[0, 0, 0],
[0, 0, 0]], dtype=float)

ker = null_space(A)
print(f"Null space basis:\n{ker}")
# Columns span {[0,1,0], [0,0,1]} - the y-z plane gets destroyed

# Verify: A @ ker_vector = 0
for i in range(ker.shape[1]):
v = ker[:, i]
result = A @ v
print(f"A @ ker_vec_{i}: {result}") # [0, 0, 0]

# Neural network weight matrix example
np.random.seed(42)
W = np.random.randn(256, 512) # 256 outputs, 512 inputs

ker_W = null_space(W)
print(f"\nW shape: {W.shape}")
print(f"Null space dimension: {ker_W.shape[1]}")
# At least 256 directions in input space are completely ignored

The kernel tells you about invertibility

import numpy as np

def kernel_dimension(A: np.ndarray, tol: float = 1e-10) -> int:
"""Dimension of the kernel (null space) of A."""
_, s, _ = np.linalg.svd(A, full_matrices=False)
n = A.shape[1]
rank = np.sum(s > tol)
return n - int(rank)

# Square rank-deficient matrix
A_square = np.array([[1.0, 2.0],
[2.0, 4.0]]) # row 2 = 2 * row 1, rank 1

dim_ker = kernel_dimension(A_square)
print(f"dim(ker(A_square)) = {dim_ker}") # 1
print(f"A is invertible: {dim_ker == 0}") # False

# Proof: det(A) = 1*4 - 2*2 = 0
print(f"det(A_square) = {np.linalg.det(A_square):.4f}") # 0

Part 3 - Image (Column Space): What a Transformation Can Produce

The image (or column space, or range) of a linear transformation T: ℝⁿ → ℝᵐ is the set of all vectors that T can produce:

im(T) = {T(x) : x ∈ ℝⁿ} = col(A)

The image is the span of the columns of A. It is a subspace of the output space ℝᵐ.

Why the column space matters

The image tells you what outputs are achievable. A linear system Ax = b has a solution if and only if b ∈ im(A) - i.e., b lies in the column space of A.

In ML terms: the column space of a weight matrix W determines which output activations are reachable, no matter how you set the input.

Computing the column space

import numpy as np

def column_space_basis(A: np.ndarray, tol: float = 1e-10) -> np.ndarray:
"""
Compute an orthonormal basis for the column space of A using SVD.
Returns: matrix whose columns form an orthonormal basis for im(A).
"""
U, s, _ = np.linalg.svd(A, full_matrices=True)
rank = int(np.sum(s > tol))
return U[:, :rank] # left singular vectors with nonzero singular values

# Example
A = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]], dtype=float)
# Rows are linearly dependent: row3 = 2*row2 - row1, so rank 2

col_basis = column_space_basis(A)
print(f"A shape: {A.shape}")
print(f"Column space dimension (rank): {col_basis.shape[1]}") # 2

# Verify: is column 1 of A in the column space?
b = A[:, 0] # first column
proj = col_basis @ col_basis.T @ b
residual = np.linalg.norm(b - proj)
print(f"\nProjection residual for column of A: {residual:.2e}") # ≈ 0

# Is a random vector in the column space?
b_random = np.array([1.0, 0.0, 0.0])
proj_random = col_basis @ col_basis.T @ b_random
residual_random = np.linalg.norm(b_random - proj_random)
print(f"Random vector residual: {residual_random:.4f}") # > 0 (not in col space)

Rank: the dimension of the image

The rank of a matrix A is:

rank(A) = dim(im(A)) = number of linearly independent columns

It equals the number of non-zero singular values. For A ∈ ℝᵐˣⁿ: rank(A) ≤ min(m, n).

import numpy as np

# rank in NumPy
A = np.random.randn(100, 50) # tall matrix, rank <= 50
rank_A = np.linalg.matrix_rank(A)
print(f"rank(A) = {rank_A}") # 50 with probability 1 (random full column rank)

# Low-rank matrix
A_lowrank = np.outer(np.random.randn(100), np.random.randn(50)) # rank-1
print(f"rank(low-rank) = {np.linalg.matrix_rank(A_lowrank)}") # 1

Part 4 - Rank-Nullity Theorem: The Fundamental Information Constraint

The rank-nullity theorem is one of the most important results in linear algebra:

dim(ker(T)) + dim(im(T)) = n

where n = dim(domain) is the number of input dimensions.

In matrix terms: nullity(A) + rank(A) = n (number of columns).

What it says about information flow

The theorem states a hard constraint: input dimensions are either preserved into the image, or destroyed into the kernel. Every input dimension goes somewhere.

  • dim(im(A)) = rank(A) input dimensions survive into the output
  • dim(ker(A)) = n - rank(A) input dimensions are destroyed

There is no third option. This is the conservation law for linear transformations.

import numpy as np

def verify_rank_nullity(A: np.ndarray, tol: float = 1e-10) -> None:
"""Verify the rank-nullity theorem for matrix A."""
m, n = A.shape

_, s, _ = np.linalg.svd(A, full_matrices=False)
rank = int(np.sum(s > tol))
nullity = n - rank

print(f"A: {m}x{n}")
print(f" rank(A) = {rank}")
print(f" nullity(A) = {nullity}")
print(f" rank + nullity = {rank + nullity} (should be n = {n})")
print(f" Theorem holds: {rank + nullity == n}")

# Tall matrix (more rows than columns)
A1 = np.random.randn(100, 20)
verify_rank_nullity(A1)
# rank = 20, nullity = 0 (injective: no information lost)

print()

# Wide matrix (more columns than rows)
A2 = np.random.randn(20, 100)
verify_rank_nullity(A2)
# rank = 20, nullity = 80 (80 input dimensions destroyed)

print()

# Square singular matrix
A3 = np.array([[1, 2, 3],
[2, 4, 6],
[1, 1, 1]], dtype=float)
verify_rank_nullity(A3)
# rank = 2, nullity = 1

Rank-nullity in neural network layers

Consider a linear layer W ∈ ℝᵐˣⁿ (n inputs, m outputs):

CaseConditionMeaning
m = n, rank = nSquare, full rankInvertible layer: no information lost
m < n (compression)rank = mAt least n - m dimensions destroyed
m > n (expansion)rank = nAll input dimensions preserved; output has null space
rank < min(m, n)Rank-deficientFewer effective dimensions than size suggests

:::tip Rank-nullity explains why over-parameterized models can generalize A wide layer W ∈ ℝᵐˣⁿ with m ≪ n has nullity ≥ n - m. The optimization landscape is highly degenerate - many weight combinations produce identical outputs. This degeneracy is not a bug: it provides the redundancy that allows finding a well-generalizing solution via implicit regularization from gradient descent. :::

Part 5 - Change of Basis: Same Transformation, New Coordinates

Every vector exists in some coordinate system (basis). A change of basis is a linear transformation that re-expresses the same geometric object using a different coordinate system.

What a basis is

A basis for ℝⁿ is a set of n linearly independent vectors {b₁, b₂, ..., bₙ} such that every vector in ℝⁿ can be written as a unique linear combination of them.

The standard basis is {e₁, e₂, ..., eₙ} where eᵢ has a 1 in position i and 0 elsewhere. But any n linearly independent vectors form a valid basis.

Change of basis matrix

If B = [b₁ | b₂ | ... | bₙ] is a matrix whose columns are the new basis vectors, then:

x_new = B⁻¹ @ x_standard (from standard to new coordinates)
x_standard = B @ x_new (from new to standard coordinates)

A linear transformation A expressed in the standard basis becomes:

A_new = B⁻¹ @ A @ B

This is a similarity transformation. The matrix looks different in different bases, but it represents the same geometric transformation.

PCA as change of basis

PCA computes the eigenvectors of the covariance matrix and uses them as a new basis. Projecting data into this basis rotates the coordinate system so that:

  • The first axis points along maximum variance
  • The second axis points along the next-maximum variance (orthogonal to first)
  • And so on
import numpy as np

def change_of_basis_demo():
"""Demonstrate change of basis using PCA as the example."""
np.random.seed(42)

# Generate correlated 2D data
mean = [0, 0]
cov = [[3, 2], [2, 2]]
X = np.random.multivariate_normal(mean, cov, size=200) # (200, 2)

# Step 1: center
X_centered = X - X.mean(axis=0)

# Step 2: covariance matrix
Sigma = X_centered.T @ X_centered / (len(X) - 1) # (2, 2)

# Step 3: eigendecomposition -> new basis
eigenvalues, eigenvectors = np.linalg.eigh(Sigma)
# eigh returns ascending order -> flip for descending
idx = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[idx]
B = eigenvectors[:, idx] # change-of-basis matrix: columns = new basis vectors

print(f"New basis vectors (principal components):")
print(f" PC1: {B[:, 0].round(4)}")
print(f" PC2: {B[:, 1].round(4)}")
print(f"Eigenvalues (variances): {eigenvalues.round(4)}")

# Step 4: project into new basis (change coordinates)
X_new = X_centered @ B

# In the new basis, axes are uncorrelated:
new_cov = X_new.T @ X_new / (len(X_new) - 1)
print(f"\nCovariance in new basis:")
print(f" Diagonal entries (variances): {np.diag(new_cov).round(4)}")
print(f" Off-diagonal (should be near 0): {new_cov[0,1]:.6f}")

# The same linear transformation, expressed in the new basis
A = np.array([[2.0, 1.0], [0.0, 1.5]]) # some linear map in standard basis
A_new_basis = np.linalg.inv(B) @ A @ B
print(f"\nA in standard basis:\n{A}")
print(f"A in PCA basis:\n{A_new_basis.round(4)}")

change_of_basis_demo()

Why change of basis is everywhere in ML

ML ConceptChange of Basis Interpretation
PCARotate to covariance eigenvector basis
WhiteningRescale PCA basis to unit variance
Word embeddingsWords expressed in a learned semantic basis
Attention headsEach head operates in a subspace of the embedding basis
Neural net hidden layerNonlinear map followed by new linear basis
Fourier transformExpress signal in frequency basis

Part 6 - Composition of Linear Maps = Matrix Multiplication

If T₁: ℝⁿ → ℝᵐ and T₂: ℝᵐ → ℝᵖ are linear maps, their composition T₂ ∘ T₁: ℝⁿ → ℝᵖ is also a linear map.

The matrix of the composition is the product of the matrices:

Matrix of (T₂ ∘ T₁) = A₂ @ A₁

This is why matrix multiplication is defined the way it is - it is not an arbitrary rule, it is the algebraic encoding of function composition.

Sequential neural network layers

import numpy as np

# A two-layer linear network
# x in R^512 -> h in R^256 -> y in R^10
W1 = np.random.randn(256, 512) * 0.01 # layer 1 weight
W2 = np.random.randn(10, 256) * 0.01 # layer 2 weight

x = np.random.randn(512)

# Forward pass: apply each linear map in sequence
h = W1 @ x # first transformation
y = W2 @ h # second transformation

# Composition: the combined map is W2 @ W1
W_combined = W2 @ W1 # shape (10, 512)
y_combined = W_combined @ x

print(f"Two-pass output match: {np.allclose(y, y_combined)}") # True

# Key insight: two linear layers with no activation = one linear layer
# -> without nonlinearity, depth adds NO expressive power
rank_combined = np.linalg.matrix_rank(W_combined)
rank_W1 = np.linalg.matrix_rank(W1)
rank_W2 = np.linalg.matrix_rank(W2)
print(f"rank(W1) = {rank_W1}, rank(W2) = {rank_W2}, rank(W2@W1) = {rank_combined}")
# rank(W_combined) <= 10 (bottlenecked by the 10-dimensional output)

Why composition limits expressiveness

The rank of a product is bounded:

rank(A @ B) <= min(rank(A), rank(B))

This means: the effective rank of a composed linear map can never exceed the rank of the narrowest layer. A bottleneck layer with m hidden units can represent at most m linearly independent output directions - no matter how many layers come before or after it (without nonlinearity).

Part 7 - NumPy: Implementing and Visualizing Linear Transformations

Core diagnostic toolkit

import numpy as np

def linear_map_diagnostics(A: np.ndarray, tol: float = 1e-10) -> dict:
"""
Comprehensive diagnostics for a linear transformation A: R^n to R^m.
Returns rank, nullity, column space basis, null space basis.
"""
m, n = A.shape
U, s, Vt = np.linalg.svd(A, full_matrices=True)

rank = int(np.sum(s > tol))
nullity = n - rank

# Column space basis: first `rank` left singular vectors
col_space_basis = U[:, :rank]

# Null space basis: right singular vectors beyond rank
null_space_basis = Vt[rank:].T # shape (n, nullity)

return {
"shape": (m, n),
"rank": rank,
"nullity": nullity,
"column_space_basis": col_space_basis,
"null_space_basis": null_space_basis,
"singular_values": s[:min(m, n)],
}

# Test with a concrete matrix
A = np.array([[1, 2, 0, 1],
[2, 4, 1, 3],
[3, 6, 1, 4]], dtype=float)

diag = linear_map_diagnostics(A)
print(f"Shape: {diag['shape']}")
print(f"Rank: {diag['rank']}")
print(f"Nullity: {diag['nullity']}")
print(f"Rank + Nullity = {diag['rank'] + diag['nullity']} (should be n={A.shape[1]})")

# Verify null space: A @ null_vector should equal 0
null_basis = diag["null_space_basis"]
for i in range(null_basis.shape[1]):
v = null_basis[:, i]
print(f"A @ null_vec_{i}: max abs = {np.abs(A @ v).max():.2e}") # near 0

# Project a vector onto the column space
b = np.array([1.0, 2.0, 4.0])
col_basis = diag["column_space_basis"]
b_proj = col_basis @ col_basis.T @ b # orthogonal projection onto col(A)
b_residual = b - b_proj
print(f"\nresidual orthogonal to col(A): "
f"{np.allclose(col_basis.T @ b_residual, 0, atol=1e-10)}")

Visualizing 2D transformations

import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

def visualize_transformation(A: np.ndarray, title: str, ax) -> None:
"""Show how a 2x2 matrix transforms the unit circle and basis vectors."""
theta = np.linspace(0, 2 * np.pi, 200)
circle = np.vstack([np.cos(theta), np.sin(theta)]) # (2, 200)
circle_t = A @ circle

ax.set_aspect('equal')
ax.axhline(0, color='gray', linewidth=0.5)
ax.axvline(0, color='gray', linewidth=0.5)
ax.plot(circle[0], circle[1], 'b--', alpha=0.4, linewidth=1.5, label='original')
ax.plot(circle_t[0], circle_t[1], 'r-', linewidth=2, label='transformed')

# Draw original and transformed basis vectors
colors = ['#10b981', '#6366f1']
for j in range(2):
e_j = np.eye(2)[:, j]
e_t = A @ e_j
ax.annotate('', xy=e_j, xytext=[0, 0],
arrowprops=dict(arrowstyle='->', color=colors[j], lw=1.5))
ax.annotate('', xy=e_t, xytext=[0, 0],
arrowprops=dict(arrowstyle='->', color=colors[j], lw=2.5))

ax.set_title(title)
ax.legend(fontsize=8)
ax.set_xlim(-2.5, 2.5)
ax.set_ylim(-2.5, 2.5)

fig, axes = plt.subplots(2, 2, figsize=(10, 10))
transformations = [
(np.array([[2, 0], [0, 0.5]], dtype=float), "Scale (x2, y0.5)"),
(np.array([[0, -1], [1, 0]], dtype=float), "Rotation 90 deg"),
(np.array([[1, 1], [0, 1]], dtype=float), "Shear"),
(np.array([[1, 0], [0, -1]], dtype=float), "Reflection over x-axis"),
]

for ax, (A, title) in zip(axes.flat, transformations):
visualize_transformation(A, title, ax)

plt.tight_layout()
plt.savefig('/tmp/linear_transformations_2d.png', dpi=100, bbox_inches='tight')
print("Saved: /tmp/linear_transformations_2d.png")

Part 8 - ML Connections: Layers, LoRA, and Representation Learning

Weight matrices as linear maps

Every nn.Linear(n, m) layer in PyTorch is a linear map W: ℝⁿ → ℝᵐ (plus a bias, which makes it an affine map). The linear part:

  • Maps inputs into a new representation space: hidden activations are coordinates in a new basis
  • Has rank(W) ≤ min(n, m): the effective dimensionality of the representation
  • Has nullity(W) = n - rank(W): input directions that produce zero activation
import numpy as np

def analyze_layer_geometry(W: np.ndarray) -> None:
"""Analyze the geometry of a neural network weight matrix."""
m, n = W.shape
_, s, _ = np.linalg.svd(W, full_matrices=False)

# Effective rank: singular values > 1% of max
eff_rank = int(np.sum(s > 0.01 * s[0]))
rank = int(np.sum(s > 1e-10 * s[0]))

print(f"Layer: R^{n} -> R^{m}")
print(f" Theoretical max rank: {min(m, n)}")
print(f" Numerical rank: {rank}")
print(f" Effective rank: {eff_rank} (s > 1% of max)")
print(f" Nullity: {n - rank}")
print(f" Condition number: {s[0]/s[-1]:.2f}")

explained = np.cumsum(s**2) / np.sum(s**2)
k90 = np.searchsorted(explained, 0.90) + 1
print(f" Top-{k90} singular values capture 90% of Frobenius norm squared")

np.random.seed(42)
print("=== Random initialization ===")
analyze_layer_geometry(np.random.randn(256, 512) * 0.02)

print("\n=== Low-rank signal (simulates trained matrix) ===")
U_true = np.random.randn(256, 8)
V_true = np.random.randn(8, 512)
W_lowrank = U_true @ V_true + 0.001 * np.random.randn(256, 512)
analyze_layer_geometry(W_lowrank)

LoRA: exploiting low-rank structure

LoRA (Low-Rank Adaptation) is a fine-tuning technique built directly on the rank structure of weight matrices. The observation:

Trained weight matrices often have low intrinsic rank - most of their information is captured by a small number of singular vectors.

Instead of fine-tuning the full weight matrix W ∈ ℝᵐˣⁿ (m×n parameters), LoRA reparameterizes the update as:

W_finetuned = W_pretrained + DeltaW
DeltaW = B @ A where B in R^(m x r), A in R^(r x n), r << min(m, n)

Only B and A are learned. Parameter savings: instead of m×n, train m×r + r×n = r(m+n) parameters.

import numpy as np

class LoRALayer:
"""
LoRA (Low-Rank Adaptation) of a linear layer.
Approximates fine-tuning DeltaW = B @ A where rank(DeltaW) <= r.
"""

def __init__(self, W_pretrained: np.ndarray, rank: int, alpha: float = 1.0):
m, n = W_pretrained.shape
self.W = W_pretrained.copy() # frozen
self.r = rank
self.alpha = alpha

# LoRA parameters (these are trained during fine-tuning)
self.A = np.random.randn(rank, n) * 0.01 # (r, n)
self.B = np.zeros((m, rank)) # (m, r) -- init to zero
# B=0 means DeltaW=0 at initialization (preserves pretrained behavior)

def forward(self, x: np.ndarray) -> np.ndarray:
"""x shape: (n,) or (batch, n)"""
y_pretrained = x @ self.W.T
scale = self.alpha / self.r
y_delta = x @ self.A.T @ self.B.T * scale
return y_pretrained + y_delta

@property
def param_count(self) -> dict:
m, n = self.W.shape
return {
"frozen_W": m * n,
"lora_total": self.r * (m + n),
"savings_pct": 100 * (1 - self.r * (m + n) / (m * n)),
}

# Fine-tuning an attention projection (common LoRA target)
np.random.seed(42)
W = np.random.randn(4096, 4096) * 0.02

lora = LoRALayer(W, rank=16)
params = lora.param_count
print(f"Full layer parameters: {params['frozen_W']:,}")
print(f"LoRA parameters: {params['lora_total']:,}")
print(f"Parameter savings: {params['savings_pct']:.1f}%")

x = np.random.randn(4096)
y = lora.forward(x)
print(f"Output shape: {y.shape}")

:::tip The rank-nullity theorem explains why LoRA works If trained weight matrices have intrinsic rank r ≪ min(m, n), then dim(ker(W)) ≈ n - r. Most input directions map to zero - they are in the null space. Fine-tuning only needs to adjust the r non-trivial directions. LoRA's low-rank decomposition is a direct application of this theorem: represent the weight update in the r-dimensional subspace that actually matters. :::

Layer activations as change of basis

The hidden representation h = W @ x is literally a change of coordinates: x expressed in the basis defined by the rows of W.

import numpy as np

np.random.seed(0)
W1 = np.random.randn(64, 128) # layer 1: R^128 -> R^64
W2 = np.random.randn(32, 64) # layer 2: R^64 -> R^32

x = np.random.randn(128)

h1 = W1 @ x # x re-expressed in W1's row space
h2 = W2 @ h1 # h1 further re-expressed in W2's row space

# Combined: (W2 W1) x -- one linear map from R^128 to R^32
h2_direct = (W2 @ W1) @ x

print(f"Stacked layers match combined: {np.allclose(h2, h2_direct)}") # True
print(f"rank(W2 @ W1): {np.linalg.matrix_rank(W2 @ W1)}")
# Bounded by min(32, rank(W1), rank(W2)) = 32
# No benefit to stacking without nonlinearity

Interview Questions

Q1: What is the null space of a matrix and what does it mean for a neural network layer?

The null space (kernel) of a matrix W is the set of all vectors x such that Wx = 0:

ker(W) = {x : Wx = 0}

It is a subspace of the input space, with dimension = n - rank(W) (by the rank-nullity theorem).

For a neural network layer W ∈ ℝᵐˣⁿ:

  1. Information loss: Any input x in the null space produces zero activation. Two inputs x₁ and x₂ that differ only by a null-space vector are indistinguishable: W(x₁) = W(x₁ + z) for any z ∈ ker(W).

  2. Effective dimensionality: dim(ker(W)) = n - rank(W) input dimensions are invisible to the layer. Only rank(W) input directions actually influence the output.

  3. Compression layers (m < n): the null space has dimension at least n - m. At minimum n - m input directions are destroyed, regardless of what later layers do.

  4. LoRA implication: If rank(W) ≈ r for small r, then most input dimensions (n - r of them) are in the null space. Fine-tuning only needs to modify the r active directions - motivation for LoRA's low-rank parameterization.

Q2: What is the rank-nullity theorem and why does it matter for deep learning?

The rank-nullity theorem states: for a linear map T: ℝⁿ → ℝᵐ represented by matrix A:

rank(A) + nullity(A) = n

where rank(A) = dim(im(A)) and nullity(A) = dim(ker(A)).

Why it matters for deep learning:

  1. Hard information bottleneck: A layer W ∈ ℝᵐˣⁿ with m < n can preserve at most m input dimensions. The remaining n - m are destroyed. This is a mathematical constraint, not a design choice.

  2. Effective capacity: Even a wide layer may have low effective rank. Rank-nullity reveals the true information throughput: rank(W) dimensions in, rank(W) dimensions out.

  3. Composition bottleneck: rank(W₂ W₁) ≤ min(rank(W₂), rank(W₁)). The narrowest layer limits the effective dimensionality of all subsequent representations.

  4. LoRA: Trained weight matrices have low intrinsic rank (most dimensions in the null space) - the empirical observation that motivates parameter-efficient fine-tuning.

  5. Over-parameterization: Wide networks (n ≫ m) have large null spaces, providing many equivalent solutions - key to why gradient descent finds well-generalizing minima.

Q3: How is PCA a change of basis?

PCA is exactly a change of basis where the new basis vectors are the eigenvectors of the data covariance matrix.

The mathematical steps:

  1. Center: X_c = X - mean(X)
  2. Covariance: Σ = X_cᵀ X_c / (n-1) - symmetric positive semidefinite
  3. Eigendecompose: Σ = Q Λ Qᵀ, where Q = [q₁ | ... | qₐ] are orthonormal eigenvectors
  4. Change of basis: express each data point in the new basis: x_new = Qᵀ x_c

The matrix Q is the change-of-basis matrix. Since Q is orthogonal (Qᵀ = Q⁻¹), it represents a pure rotation with no distortion.

What the new basis achieves:

  • In the original basis, features are correlated (off-diagonal Σ entries non-zero)
  • In the eigenvector basis, covariance becomes diagonal Λ - all features decorrelated
  • Diagonal entries λ₁ ≥ λ₂ ≥ ... are the variances along each new axis

Dimensionality reduction: keeping only the first k columns of Q (largest eigenvalues) gives the least-squares optimal k-dimensional approximation.

Key insight: PCA does not modify the data - it only changes the coordinate system to one that is more informative. The "principal components" are just coordinate axes in a new basis.

Q4: What makes a function linear and why does that matter for ML?

A function T: ℝⁿ → ℝᵐ is linear if and only if:

  1. Additivity: T(u + v) = T(u) + T(v) for all u, v
  2. Homogeneity: T(cu) = c·T(u) for all scalars c and vectors u

Combined: T(αu + βv) = αT(u) + βT(v) - the superposition principle.

Equivalently: T is linear iff T(x) = Ax for some matrix A ∈ ℝᵐˣⁿ.

Why it matters for ML:

  1. Computability: Linear maps are O(mn) to apply, have well-understood geometry (rank, null space, condition number), and compose cleanly.

  2. Non-expressiveness of pure linearity: Stacking n linear layers = one linear layer (A_n ⋯ A₁)x. Without nonlinear activations, any depth network is equivalent to a single matrix multiplication.

  3. Nonlinear activations (ReLU, GeLU) break linearity, enabling universal function approximation. Linear layers provide directional control; activations provide bending.

  4. Optimization: The Jacobian of T(x) = Ax is constant (equals A). Gradients flow cleanly through linear layers without vanishing or exploding on their own.

  5. Analysis: Linear layers can be analyzed with SVD (LoRA), spectral norms (Lipschitz bounds), condition numbers (training stability), rank (effective dimensionality).

Practice Challenges

Level 1: Predict

Challenge: Without computing, predict the rank and nullity for each matrix:

  1. A ∈ ℝ⁵ˣ³ that is randomly generated (generic case)
  2. B = vvᵀ where v ∈ ℝ⁴ (outer product of a vector with itself)
  3. C ∈ ℝ³ˣ⁵ where rows 2 and 3 are linear combinations of row 1
Answer
  1. A ∈ ℝ⁵ˣ³ random: rank = 3, nullity = 0. A random matrix has full column rank with probability 1. Rank-nullity: 3 + 0 = 3. The map is injective (no information loss).

  2. B = vvᵀ ∈ ℝ⁴ˣ⁴: rank = 1, nullity = 3. Every column of B is a scalar multiple of v: B[:,j] = v·vⱼ. So col(B) = span{v}. Rank-nullity: 1 + 3 = 4.

  3. C ∈ ℝ³ˣ⁵ with dependent rows: rank = 1, nullity = 4. Only one independent row. Row rank = column rank = 1. Rank-nullity: 1 + 4 = 5.

import numpy as np

v = np.array([1.0, 2.0, 3.0, 4.0])
B = np.outer(v, v)
print(np.linalg.matrix_rank(B)) # 1

row = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
C = np.vstack([row, 2*row, 3*row])
print(np.linalg.matrix_rank(C)) # 1

Level 2: Debug

Challenge: The following code attempts to compute the projection of b onto the column space of A, but crashes or gives wrong results for rank-deficient A. Find and fix the bug:

import numpy as np

A = np.array([[1, 2],
[2, 4],
[3, 6]], dtype=float) # rank 1 (columns proportional)

b = np.array([1.0, 3.0, 2.0])

# Attempted projection: b_proj = A (A^T A)^{-1} A^T b
P = A @ np.linalg.inv(A.T @ A) @ A.T
b_proj = P @ b
print(b_proj)
Answer

Bug: A.T @ A is singular when A is rank-deficient (here rank = 1, so A.T @ A ∈ ℝ²ˣ² is singular). np.linalg.inv either raises LinAlgError or returns numerically garbage results.

Fix: Use the pseudoinverse, which handles rank-deficient matrices correctly:

import numpy as np

A = np.array([[1, 2],
[2, 4],
[3, 6]], dtype=float) # rank 1

b = np.array([1.0, 3.0, 2.0])

# Fix 1: pseudoinverse handles rank deficiency
A_pinv = np.linalg.pinv(A)
P_correct = A @ A_pinv
b_proj = P_correct @ b
print(f"Correct projection: {b_proj}")

# Fix 2: SVD-based (most numerically stable)
U, s, Vt = np.linalg.svd(A, full_matrices=True)
rank = int(np.sum(s > 1e-10 * s[0]))
col_basis = U[:, :rank]
b_proj_svd = col_basis @ col_basis.T @ b
print(f"SVD projection: {b_proj_svd}")

# Verify: residual is orthogonal to column space
residual = b - b_proj
print(f"residual dot col1: {np.dot(residual, A[:, 0]):.2e}") # near 0
print(f"residual dot col2: {np.dot(residual, A[:, 1]):.2e}") # near 0

Root cause: The projection formula P = A(AᵀA)⁻¹Aᵀ (the hat matrix in regression) is only valid when A has full column rank. For rank-deficient A, use P = A A⁺ where A⁺ = pinv(A).

Level 3: Design

Challenge: Implement effective_rank(W, threshold=0.99) returning the number of singular values needed to capture threshold of the total Frobenius norm squared. Then use it to compare a random matrix against a simulated trained matrix.

Answer
import numpy as np

def effective_rank(W: np.ndarray, threshold: float = 0.99) -> dict:
"""
Compute the effective rank of W at the given explained-variance threshold.

Args:
W: weight matrix of shape (m, n)
threshold: fraction of Frobenius norm squared to capture

Returns:
dict with rank info and singular value spectrum
"""
_, s, _ = np.linalg.svd(W, full_matrices=False)

s_sq = s ** 2
cumulative = np.cumsum(s_sq) / s_sq.sum()

# Minimum k such that top-k singular values capture >= threshold
k = int(np.searchsorted(cumulative, threshold)) + 1

return {
"matrix_shape": W.shape,
"max_possible_rank": min(W.shape),
"effective_rank": k,
"singular_values": s,
"cumulative_explained": cumulative,
"compression_ratio": k / min(W.shape),
}

np.random.seed(42)
m, n = 256, 512

# Random initialization: singular values roughly uniform
W_random = np.random.randn(m, n) * 0.02

# Simulated trained matrix: low-rank signal dominates
r_true = 32
U_sig = np.random.randn(m, r_true)
V_sig = np.random.randn(r_true, n)
W_trained = W_random + 0.5 * U_sig @ V_sig

for name, W in [("Random init", W_random), ("Trained (low-rank signal)", W_trained)]:
result = effective_rank(W, threshold=0.99)
print(f"\n=== {name} ===")
print(f" Max possible rank: {result['max_possible_rank']}")
print(f" Effective rank (99%): {result['effective_rank']}")
print(f" Compression ratio: {result['compression_ratio']:.3f}")

# LoRA parameter savings at effective rank
r_eff = effective_rank(W_trained, threshold=0.99)["effective_rank"]
params_full = m * n
params_lora = r_eff * (m + n)
print(f"\nLoRA at effective rank {r_eff}:")
print(f" Full params: {params_full:,}")
print(f" LoRA params: {params_lora:,}")
print(f" Savings: {100*(1 - params_lora/params_full):.1f}%")

Quick Reference Cheatsheet

ConceptDefinitionNumPyML Use
Kernel / null space{x : Ax = 0}Vt[rank:].T from SVDDestroyed input directions
Image / column space{Ax : x ∈ ℝⁿ}U[:, :rank] from SVDReachable outputs
Rankdim(im(A))np.linalg.matrix_rank(A)Effective dimensionality
Nullityn - rankn - matrix_rank(A)Redundant input dimensions
Rank-nullityrank + nullity = nAlways holdsConservation law for info flow
Change of basisA_new = B⁻¹ABnp.linalg.solve(B, A @ B)PCA, whitening
CompositionT₂ ∘ T₁ → A₂A₁A2 @ A1Stacking layers = multiplying matrices
Projection onto col(A)A A⁺ bA @ np.linalg.pinv(A) @ bLeast squares, attention

Key Takeaways

  • A linear map is completely characterized by its matrix: T(x) = Ax. Every neural network linear layer is a linear map from input space to output space.
  • The kernel (null space) is the set of inputs destroyed by the transformation. Its dimension equals n - rank(A) - the amount of information permanently lost.
  • The image (column space) is the set of achievable outputs. Its dimension equals rank(A).
  • Rank-nullity theorem: rank + nullity = n. Every input dimension either survives into the image or is destroyed into the kernel. No middle ground.
  • Change of basis re-expresses the same transformation in different coordinates. PCA is change of basis to the eigenvector coordinate system where covariance becomes diagonal.
  • Stacking linear layers without nonlinear activations produces a single linear layer. Depth without nonlinearity provides zero additional expressive power.
  • LoRA works because trained weight matrices have low intrinsic rank. Most input dimensions are in the null space, so only a small subspace needs fine-tuning.

Next: PCA from Linear Algebra →

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Matrix Transformations in 3D demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.