Linear Algebra for Machine Learning - Module Overview

Reading time: ~10 minutes | Level: Mathematical Foundations → ML Engineering

Every ML algorithm you will ever use is secretly a linear algebra operation.

Attention is a scaled dot product. Backpropagation is a chain of Jacobians. PCA is eigendecomposition. A neural network forward pass is a sequence of matrix multiplications. A word embedding is a vector in 512-dimensional space. The distance between two embeddings determines whether a RAG system retrieves the right document.

If you use these tools without understanding the linear algebra underneath, you are flying blind. You can call functions, but you cannot reason about why they work, when they fail, or how to fix them.

This module teaches you to see the linear algebra inside the ML.

What This Module Covers

Lesson	Topic	ML Algorithm It Unlocks
01	Vectors and Vector Spaces	Embeddings, KNN, cosine similarity, RAG retrieval
02	Matrix Operations	Neural network forward pass, attention, backprop
03	Eigenvalues and Eigenvectors	PCA, PageRank, graph neural networks
04	SVD and Matrix Decompositions	Recommender systems, image compression, LSA
05	Linear Transformations	Layer activations, representation learning
06	PCA from Linear Algebra	Dimensionality reduction, feature preprocessing
07	Dot Products and Projections	Attention mechanism, least squares regression
08	Norms and Distance Metrics	Regularization (L1/L2), embedding search
09	Tensors for Deep Learning	Batch operations, convolution, transformer attention
10	Linear Algebra in NumPy	Implementation, debugging, performance

How the Concepts Connect

Part 1 - Why Linear Algebra, Why Now

The embedding explosion

In 2017, a sentence was first encoded as a 512-dimensional vector. By 2024, state-of-the-art embedding models produce vectors with 3,072 dimensions. Every semantic search, every RAG pipeline, every recommendation system operates in these high-dimensional spaces.

To reason about them - to understand why cosine similarity works, why L2 distance sometimes fails, why approximate nearest neighbor algorithms are needed - you need vector spaces.

The attention mechanism is dot products

The transformer architecture, which underlies GPT, BERT, Claude, and nearly every modern LLM, is built on one operation: the scaled dot product.

Attention(Q, K, V) = softmax(QKᵀ / √d_k) · V

This is not magic. It is:

A matrix multiplication (QKᵀ) - covered in Lesson 02
A scaling by a scalar (/ √d_k) - motivated in Lesson 08 (norms)
A softmax (not linear algebra, but the output is)
Another matrix multiplication ( · V)

If you understand matrix multiplication geometrically, you understand why attention works. Lesson 07 (Dot Products and Projections) shows you exactly how.

PCA is eigendecomposition

Principal Component Analysis appears in every ML workflow: visualizing high-dimensional data, reducing feature dimensions before training, compressing representations. It has exactly one mathematical step:

Find the eigenvectors of the covariance matrix.

That is it. Lesson 03 teaches eigenvalues. Lesson 06 applies them to PCA. Lesson 04 shows you the numerically stable path through SVD.

Part 2 - What Each Lesson Teaches

Lesson 01: Vectors and Vector Spaces

The fundamental object: a vector. Not just [1, 2, 3], but the geometric object it represents - a direction and magnitude in space. This lesson covers:

What a vector space is and why the 8 axioms matter for ML
L1, L2, and L∞ norms - and why they induce different ML behaviors
Inner products and the angle between vectors
High-dimensional geometry: why intuition breaks down above 3 dimensions
NumPy vector operations and cosine similarity from scratch

Unlocks: Understanding why two embeddings that look close in L2 can point in completely different directions. Understanding why RAG uses cosine similarity instead of Euclidean distance.

Lesson 02: Matrix Operations

A matrix is a linear transformation. Multiplying two matrices composes two transformations. This lesson covers:

What matrix multiplication actually does (not just row×column)
Transpose: symmetric matrices, Gram matrix, and why XᵀX appears everywhere
Matrix inverse: when it exists, why you almost never compute it directly
Rank: what it reveals about your data's intrinsic dimensionality
Determinant: the volume-scaling factor

Unlocks: Understanding why QKᵀ in attention computes pairwise similarities. Understanding why the normal equations for linear regression involve (XᵀX)⁻¹Xᵀy.

Lesson 03: Eigenvalues and Eigenvectors

Some vectors pass through a linear transformation unchanged in direction - only their magnitude scales. These are eigenvectors. The scaling factors are eigenvalues. This lesson covers:

Geometric meaning: eigenvectors as invariant directions
The characteristic polynomial (intuition, not memorization)
Eigendecomposition and when it exists
Real symmetric matrices: guaranteed real eigenvalues and orthogonal eigenvectors
Power iteration: how eigenvalues are actually computed in practice

Unlocks: Understanding PCA, PageRank, graph Laplacian, spectral clustering, and why covariance matrices are always eigendecomposable.

Lesson 04: SVD and Matrix Decompositions

The Singular Value Decomposition generalizes eigendecomposition to any matrix (not just square ones). It is the most powerful decomposition in applied mathematics. This lesson covers:

SVD: the fundamental theorem of linear algebra
Geometric interpretation: rotate → scale → rotate
Truncated SVD: dimensionality reduction without computing full eigendecomposition
LU, QR, and Cholesky decompositions
How to compress an image using k singular values

Unlocks: Understanding collaborative filtering (Netflix Prize), LSA for text, image compression, and why sklearn.decomposition.PCA actually uses SVD internally.

Lesson 05: Linear Transformations

A function between vector spaces that preserves addition and scalar multiplication is called a linear map. Every layer of a neural network is one. This lesson covers:

The two defining properties of linearity
Kernel (null space): what the transformation destroys
Image (column space): what the transformation can produce
Rank-nullity theorem: the fundamental constraint on information flow
Change of basis: same transformation, different coordinate system

Unlocks: Understanding why residual connections in ResNets work (they preserve the identity linear map). Understanding what a neural network layer is geometrically doing to its inputs.

Lesson 06: PCA from Linear Algebra

PCA is not a black box. It is the direct application of eigendecomposition to the covariance matrix of centered data. This lesson covers:

What PCA is trying to do: find directions of maximum variance
The covariance matrix: what it encodes about your data distribution
Eigendecomposition → principal components
Explained variance ratio and the scree plot
When to use PCA and when NOT to
PCA via SVD - the numerically stable path

Unlocks: Knowing what sklearn.decomposition.PCA actually computes. Being able to implement PCA from scratch. Understanding Eigenfaces (face recognition). Knowing why PCA fails on nonlinear manifolds.

Lesson 07: Dot Products and Projections

The dot product measures alignment between two vectors. Projection takes one vector and finds its shadow along another direction. These two operations are behind regression, attention, and retrieval. This lesson covers:

Algebraic vs. geometric definition of the dot product
Orthogonality: when dot product = 0 and why it matters for independence
Vector projection and projection matrices
Gram-Schmidt orthogonalization: building an orthonormal basis
Least squares via projection: the cleanest derivation

Unlocks: Understanding why scaled dot-product attention works geometrically. Deriving the normal equations for linear regression. Understanding why Gram-Schmidt is behind QR decomposition.

Lesson 08: Norms and Distance Metrics

A norm measures the size of a vector. Different norms induce different geometries, and different geometries produce different ML behaviors. L1 norms make models sparse. L2 norms make models smooth. This lesson covers:

The three axioms that define a norm
L1 geometry (diamond shape) and why it induces sparsity
L2 geometry (sphere shape) and why it induces smoothness
Frobenius norm for matrices
Nuclear norm: the convex relaxation of rank
Distance metrics from norms: Euclidean, Manhattan, Chebyshev
When to use cosine similarity vs. Euclidean distance for embeddings

Unlocks: Understanding Lasso (L1) vs. Ridge (L2) regularization geometrically. Knowing when to use L2 distance vs. cosine similarity in vector search.

Lesson 09: Tensors for Deep Learning

A tensor is a generalization of scalars, vectors, and matrices to arbitrary dimensions. Everything in deep learning is tensor algebra. This lesson covers:

Tensors as generalized arrays: shapes, axes, and how to read them
Tensor contractions: generalizing matrix multiplication
Einstein summation notation: the compact language of tensor ops
Broadcasting: how NumPy and PyTorch extend operations across dimensions
Vectorization: why loops are slow and tensor ops are fast (SIMD, GPU)
Implementing scaled dot-product attention using einsum

Unlocks: Understanding batch matrix multiplication in transformers. Understanding how convolution is a tensor contraction. Reading and writing PyTorch code that manipulates 4D tensors.

Lesson 10: Linear Algebra in NumPy

NumPy is the linear algebra engine underneath sklearn, PyTorch (CPU), TensorFlow, and JAX. This lesson is a complete engineering reference. It covers:

np.linalg module: every function explained with ML context
Solving linear systems correctly (not with inv)
Performance: memory layout, vectorization, avoiding Python loops
Numerical stability: condition number, floating-point pitfalls
Common ML patterns: Gram matrix, covariance, whitening, rotation
PyTorch torch.linalg: the GPU-accelerated equivalent

Unlocks: Implementing any ML algorithm from scratch. Debugging numerical instability. Writing fast, vectorized ML code.

Part 3 - How to Use This Module

If you are time-constrained

Study in this priority order:

:::tip Priority Path (4 lessons)

Lesson 01 (Vectors) - foundational for everything
Lesson 02 (Matrices) - needed for forward pass reasoning
Lesson 07 (Dot Products) - needed for attention understanding
Lesson 06 (PCA) - most commonly needed in practice :::

If you are preparing for ML interviews

Focus on:

Lessons 01–03: core mathematical definitions
Lesson 06: PCA from scratch (very common interview question)
Lesson 08: L1 vs L2 regularization (appears in almost every ML interview)
Lesson 10: NumPy implementation patterns

If you are building production ML systems

Focus on:

Lesson 04: SVD (used in recommender systems, dimensionality reduction)
Lesson 08: Norms and distances (used in vector search, embedding similarity)
Lesson 09: Tensors (needed for efficient batch processing)
Lesson 10: NumPy performance and stability

Part 4 - Prerequisites

This module assumes:

Comfort with Python and NumPy arrays
High school algebra (variables, functions, equations)
Some exposure to ML (you know what training and inference are)

This module does not assume:

Prior university linear algebra coursework
Deep calculus knowledge (we introduce what we need)
Advanced mathematical maturity

Part 5 - What You Will Be Able to Do

After completing this module, you will be able to:

Read ML papers: When a paper writes Attention(Q,K,V) = softmax(QKᵀ/√d)V, you will understand every symbol geometrically.
Implement from scratch: PCA, cosine similarity, least squares regression, and the attention mechanism - all from NumPy primitives.
Debug ML systems: When your embedding search returns wrong results, you will know whether it's a norm issue, a distance metric issue, or a high-dimensional geometry issue.
Reason about model capacity: Rank deficiency in a weight matrix means information is being lost. You will know when this is a problem and when it is a feature.
Write efficient ML code: Broadcasting, einsum, and vectorization instead of Python loops.
Pass ML interviews: Every major ML interview includes linear algebra. You will be able to derive, not just recite.

Quick Reference: Linear Algebra in ML Systems

ML Concept	Linear Algebra Behind It
Word/document embeddings	Vectors in high-dimensional space
Cosine similarity	Inner product / (L2 norm × L2 norm)
Neural network layer	Matrix multiplication + nonlinearity
Attention mechanism	Scaled dot product: `softmax(QKᵀ/√d)V`
Backpropagation	Chain rule = Jacobian matrix multiplication
PCA	Eigendecomposition of covariance matrix
Recommender systems	Matrix factorization via SVD
L1 regularization (Lasso)	L1 norm constraint on weight vector
L2 regularization (Ridge)	L2 norm constraint on weight vector
Least squares regression	Projection onto column space of X
Batch normalization	Centering + scaling (whitening)
Convolutional layer	Tensor contraction with filter tensor

Key Takeaways

Linear algebra is not abstract mathematics - it is the computational substrate of every ML algorithm
Vectors represent data points, embeddings, and features in high-dimensional spaces
Matrices represent linear transformations, weight matrices, and attention scores
Eigenvalues and SVD reveal the intrinsic structure of data and transformations
Norms define what "small" means and determine the geometry of regularization
Tensors generalize everything to the batch dimensions required for GPU-accelerated ML

Next: Vectors and Vector Spaces →

What This Module Covers​

How the Concepts Connect​

Part 1 - Why Linear Algebra, Why Now​

The embedding explosion​

The attention mechanism is dot products​

PCA is eigendecomposition​

Part 2 - What Each Lesson Teaches​

Lesson 01: Vectors and Vector Spaces​

Lesson 02: Matrix Operations​

Lesson 03: Eigenvalues and Eigenvectors​

Lesson 04: SVD and Matrix Decompositions​

Lesson 05: Linear Transformations​

Lesson 06: PCA from Linear Algebra​

Lesson 07: Dot Products and Projections​

Lesson 08: Norms and Distance Metrics​

Lesson 09: Tensors for Deep Learning​

Lesson 10: Linear Algebra in NumPy​

Part 3 - How to Use This Module​

If you are time-constrained​

If you are preparing for ML interviews​

If you are building production ML systems​

Part 4 - Prerequisites​

Part 5 - What You Will Be Able to Do​

Quick Reference: Linear Algebra in ML Systems​

Key Takeaways​