Linear Algebra for Machine Learning - Module Overview
Reading time: ~10 minutes | Level: Mathematical Foundations → ML Engineering
Every ML algorithm you will ever use is secretly a linear algebra operation.
Attention is a scaled dot product. Backpropagation is a chain of Jacobians. PCA is eigendecomposition. A neural network forward pass is a sequence of matrix multiplications. A word embedding is a vector in 512-dimensional space. The distance between two embeddings determines whether a RAG system retrieves the right document.
If you use these tools without understanding the linear algebra underneath, you are flying blind. You can call functions, but you cannot reason about why they work, when they fail, or how to fix them.
This module teaches you to see the linear algebra inside the ML.
What This Module Covers
| Lesson | Topic | ML Algorithm It Unlocks |
|---|---|---|
| 01 | Vectors and Vector Spaces | Embeddings, KNN, cosine similarity, RAG retrieval |
| 02 | Matrix Operations | Neural network forward pass, attention, backprop |
| 03 | Eigenvalues and Eigenvectors | PCA, PageRank, graph neural networks |
| 04 | SVD and Matrix Decompositions | Recommender systems, image compression, LSA |
| 05 | Linear Transformations | Layer activations, representation learning |
| 06 | PCA from Linear Algebra | Dimensionality reduction, feature preprocessing |
| 07 | Dot Products and Projections | Attention mechanism, least squares regression |
| 08 | Norms and Distance Metrics | Regularization (L1/L2), embedding search |
| 09 | Tensors for Deep Learning | Batch operations, convolution, transformer attention |
| 10 | Linear Algebra in NumPy | Implementation, debugging, performance |
How the Concepts Connect
Part 1 - Why Linear Algebra, Why Now
The embedding explosion
In 2017, a sentence was first encoded as a 512-dimensional vector. By 2024, state-of-the-art embedding models produce vectors with 3,072 dimensions. Every semantic search, every RAG pipeline, every recommendation system operates in these high-dimensional spaces.
To reason about them - to understand why cosine similarity works, why L2 distance sometimes fails, why approximate nearest neighbor algorithms are needed - you need vector spaces.
The attention mechanism is dot products
The transformer architecture, which underlies GPT, BERT, Claude, and nearly every modern LLM, is built on one operation: the scaled dot product.
Attention(Q, K, V) = softmax(QKᵀ / √d_k) · V
This is not magic. It is:
- A matrix multiplication (
QKᵀ) - covered in Lesson 02 - A scaling by a scalar (
/ √d_k) - motivated in Lesson 08 (norms) - A softmax (not linear algebra, but the output is)
- Another matrix multiplication (
· V)
If you understand matrix multiplication geometrically, you understand why attention works. Lesson 07 (Dot Products and Projections) shows you exactly how.
PCA is eigendecomposition
Principal Component Analysis appears in every ML workflow: visualizing high-dimensional data, reducing feature dimensions before training, compressing representations. It has exactly one mathematical step:
Find the eigenvectors of the covariance matrix.
That is it. Lesson 03 teaches eigenvalues. Lesson 06 applies them to PCA. Lesson 04 shows you the numerically stable path through SVD.
Part 2 - What Each Lesson Teaches
Lesson 01: Vectors and Vector Spaces
The fundamental object: a vector. Not just [1, 2, 3], but the geometric object it represents - a direction and magnitude in space. This lesson covers:
- What a vector space is and why the 8 axioms matter for ML
- L1, L2, and L∞ norms - and why they induce different ML behaviors
- Inner products and the angle between vectors
- High-dimensional geometry: why intuition breaks down above 3 dimensions
- NumPy vector operations and cosine similarity from scratch
Unlocks: Understanding why two embeddings that look close in L2 can point in completely different directions. Understanding why RAG uses cosine similarity instead of Euclidean distance.
Lesson 02: Matrix Operations
A matrix is a linear transformation. Multiplying two matrices composes two transformations. This lesson covers:
- What matrix multiplication actually does (not just row×column)
- Transpose: symmetric matrices, Gram matrix, and why
XᵀXappears everywhere - Matrix inverse: when it exists, why you almost never compute it directly
- Rank: what it reveals about your data's intrinsic dimensionality
- Determinant: the volume-scaling factor
Unlocks: Understanding why QKᵀ in attention computes pairwise similarities. Understanding why the normal equations for linear regression involve (XᵀX)⁻¹Xᵀy.
Lesson 03: Eigenvalues and Eigenvectors
Some vectors pass through a linear transformation unchanged in direction - only their magnitude scales. These are eigenvectors. The scaling factors are eigenvalues. This lesson covers:
- Geometric meaning: eigenvectors as invariant directions
- The characteristic polynomial (intuition, not memorization)
- Eigendecomposition and when it exists
- Real symmetric matrices: guaranteed real eigenvalues and orthogonal eigenvectors
- Power iteration: how eigenvalues are actually computed in practice
Unlocks: Understanding PCA, PageRank, graph Laplacian, spectral clustering, and why covariance matrices are always eigendecomposable.
Lesson 04: SVD and Matrix Decompositions
The Singular Value Decomposition generalizes eigendecomposition to any matrix (not just square ones). It is the most powerful decomposition in applied mathematics. This lesson covers:
- SVD: the fundamental theorem of linear algebra
- Geometric interpretation: rotate → scale → rotate
- Truncated SVD: dimensionality reduction without computing full eigendecomposition
- LU, QR, and Cholesky decompositions
- How to compress an image using k singular values
Unlocks: Understanding collaborative filtering (Netflix Prize), LSA for text, image compression, and why sklearn.decomposition.PCA actually uses SVD internally.
Lesson 05: Linear Transformations
A function between vector spaces that preserves addition and scalar multiplication is called a linear map. Every layer of a neural network is one. This lesson covers:
- The two defining properties of linearity
- Kernel (null space): what the transformation destroys
- Image (column space): what the transformation can produce
- Rank-nullity theorem: the fundamental constraint on information flow
- Change of basis: same transformation, different coordinate system
Unlocks: Understanding why residual connections in ResNets work (they preserve the identity linear map). Understanding what a neural network layer is geometrically doing to its inputs.
Lesson 06: PCA from Linear Algebra
PCA is not a black box. It is the direct application of eigendecomposition to the covariance matrix of centered data. This lesson covers:
- What PCA is trying to do: find directions of maximum variance
- The covariance matrix: what it encodes about your data distribution
- Eigendecomposition → principal components
- Explained variance ratio and the scree plot
- When to use PCA and when NOT to
- PCA via SVD - the numerically stable path
Unlocks: Knowing what sklearn.decomposition.PCA actually computes. Being able to implement PCA from scratch. Understanding Eigenfaces (face recognition). Knowing why PCA fails on nonlinear manifolds.
Lesson 07: Dot Products and Projections
The dot product measures alignment between two vectors. Projection takes one vector and finds its shadow along another direction. These two operations are behind regression, attention, and retrieval. This lesson covers:
- Algebraic vs. geometric definition of the dot product
- Orthogonality: when dot product = 0 and why it matters for independence
- Vector projection and projection matrices
- Gram-Schmidt orthogonalization: building an orthonormal basis
- Least squares via projection: the cleanest derivation
Unlocks: Understanding why scaled dot-product attention works geometrically. Deriving the normal equations for linear regression. Understanding why Gram-Schmidt is behind QR decomposition.
Lesson 08: Norms and Distance Metrics
A norm measures the size of a vector. Different norms induce different geometries, and different geometries produce different ML behaviors. L1 norms make models sparse. L2 norms make models smooth. This lesson covers:
- The three axioms that define a norm
- L1 geometry (diamond shape) and why it induces sparsity
- L2 geometry (sphere shape) and why it induces smoothness
- Frobenius norm for matrices
- Nuclear norm: the convex relaxation of rank
- Distance metrics from norms: Euclidean, Manhattan, Chebyshev
- When to use cosine similarity vs. Euclidean distance for embeddings
Unlocks: Understanding Lasso (L1) vs. Ridge (L2) regularization geometrically. Knowing when to use L2 distance vs. cosine similarity in vector search.
Lesson 09: Tensors for Deep Learning
A tensor is a generalization of scalars, vectors, and matrices to arbitrary dimensions. Everything in deep learning is tensor algebra. This lesson covers:
- Tensors as generalized arrays: shapes, axes, and how to read them
- Tensor contractions: generalizing matrix multiplication
- Einstein summation notation: the compact language of tensor ops
- Broadcasting: how NumPy and PyTorch extend operations across dimensions
- Vectorization: why loops are slow and tensor ops are fast (SIMD, GPU)
- Implementing scaled dot-product attention using
einsum
Unlocks: Understanding batch matrix multiplication in transformers. Understanding how convolution is a tensor contraction. Reading and writing PyTorch code that manipulates 4D tensors.
Lesson 10: Linear Algebra in NumPy
NumPy is the linear algebra engine underneath sklearn, PyTorch (CPU), TensorFlow, and JAX. This lesson is a complete engineering reference. It covers:
np.linalgmodule: every function explained with ML context- Solving linear systems correctly (not with
inv) - Performance: memory layout, vectorization, avoiding Python loops
- Numerical stability: condition number, floating-point pitfalls
- Common ML patterns: Gram matrix, covariance, whitening, rotation
- PyTorch
torch.linalg: the GPU-accelerated equivalent
Unlocks: Implementing any ML algorithm from scratch. Debugging numerical instability. Writing fast, vectorized ML code.
Part 3 - How to Use This Module
If you are time-constrained
Study in this priority order:
:::tip Priority Path (4 lessons)
- Lesson 01 (Vectors) - foundational for everything
- Lesson 02 (Matrices) - needed for forward pass reasoning
- Lesson 07 (Dot Products) - needed for attention understanding
- Lesson 06 (PCA) - most commonly needed in practice :::
If you are preparing for ML interviews
Focus on:
- Lessons 01–03: core mathematical definitions
- Lesson 06: PCA from scratch (very common interview question)
- Lesson 08: L1 vs L2 regularization (appears in almost every ML interview)
- Lesson 10: NumPy implementation patterns
If you are building production ML systems
Focus on:
- Lesson 04: SVD (used in recommender systems, dimensionality reduction)
- Lesson 08: Norms and distances (used in vector search, embedding similarity)
- Lesson 09: Tensors (needed for efficient batch processing)
- Lesson 10: NumPy performance and stability
Part 4 - Prerequisites
This module assumes:
- Comfort with Python and NumPy arrays
- High school algebra (variables, functions, equations)
- Some exposure to ML (you know what training and inference are)
This module does not assume:
- Prior university linear algebra coursework
- Deep calculus knowledge (we introduce what we need)
- Advanced mathematical maturity
Part 5 - What You Will Be Able to Do
After completing this module, you will be able to:
-
Read ML papers: When a paper writes
Attention(Q,K,V) = softmax(QKᵀ/√d)V, you will understand every symbol geometrically. -
Implement from scratch: PCA, cosine similarity, least squares regression, and the attention mechanism - all from NumPy primitives.
-
Debug ML systems: When your embedding search returns wrong results, you will know whether it's a norm issue, a distance metric issue, or a high-dimensional geometry issue.
-
Reason about model capacity: Rank deficiency in a weight matrix means information is being lost. You will know when this is a problem and when it is a feature.
-
Write efficient ML code: Broadcasting, einsum, and vectorization instead of Python loops.
-
Pass ML interviews: Every major ML interview includes linear algebra. You will be able to derive, not just recite.
Quick Reference: Linear Algebra in ML Systems
| ML Concept | Linear Algebra Behind It |
|---|---|
| Word/document embeddings | Vectors in high-dimensional space |
| Cosine similarity | Inner product / (L2 norm × L2 norm) |
| Neural network layer | Matrix multiplication + nonlinearity |
| Attention mechanism | Scaled dot product: softmax(QKᵀ/√d)V |
| Backpropagation | Chain rule = Jacobian matrix multiplication |
| PCA | Eigendecomposition of covariance matrix |
| Recommender systems | Matrix factorization via SVD |
| L1 regularization (Lasso) | L1 norm constraint on weight vector |
| L2 regularization (Ridge) | L2 norm constraint on weight vector |
| Least squares regression | Projection onto column space of X |
| Batch normalization | Centering + scaling (whitening) |
| Convolutional layer | Tensor contraction with filter tensor |
Key Takeaways
- Linear algebra is not abstract mathematics - it is the computational substrate of every ML algorithm
- Vectors represent data points, embeddings, and features in high-dimensional spaces
- Matrices represent linear transformations, weight matrices, and attention scores
- Eigenvalues and SVD reveal the intrinsic structure of data and transformations
- Norms define what "small" means and determine the geometry of regularization
- Tensors generalize everything to the batch dimensions required for GPU-accelerated ML
