Skip to main content

Linear Algebra for Machine Learning - Module Overview

Reading time: ~10 minutes | Level: Mathematical Foundations → ML Engineering

Every ML algorithm you will ever use is secretly a linear algebra operation.

Attention is a scaled dot product. Backpropagation is a chain of Jacobians. PCA is eigendecomposition. A neural network forward pass is a sequence of matrix multiplications. A word embedding is a vector in 512-dimensional space. The distance between two embeddings determines whether a RAG system retrieves the right document.

If you use these tools without understanding the linear algebra underneath, you are flying blind. You can call functions, but you cannot reason about why they work, when they fail, or how to fix them.

This module teaches you to see the linear algebra inside the ML.

What This Module Covers

LessonTopicML Algorithm It Unlocks
01Vectors and Vector SpacesEmbeddings, KNN, cosine similarity, RAG retrieval
02Matrix OperationsNeural network forward pass, attention, backprop
03Eigenvalues and EigenvectorsPCA, PageRank, graph neural networks
04SVD and Matrix DecompositionsRecommender systems, image compression, LSA
05Linear TransformationsLayer activations, representation learning
06PCA from Linear AlgebraDimensionality reduction, feature preprocessing
07Dot Products and ProjectionsAttention mechanism, least squares regression
08Norms and Distance MetricsRegularization (L1/L2), embedding search
09Tensors for Deep LearningBatch operations, convolution, transformer attention
10Linear Algebra in NumPyImplementation, debugging, performance

How the Concepts Connect

Part 1 - Why Linear Algebra, Why Now

The embedding explosion

In 2017, a sentence was first encoded as a 512-dimensional vector. By 2024, state-of-the-art embedding models produce vectors with 3,072 dimensions. Every semantic search, every RAG pipeline, every recommendation system operates in these high-dimensional spaces.

To reason about them - to understand why cosine similarity works, why L2 distance sometimes fails, why approximate nearest neighbor algorithms are needed - you need vector spaces.

The attention mechanism is dot products

The transformer architecture, which underlies GPT, BERT, Claude, and nearly every modern LLM, is built on one operation: the scaled dot product.

Attention(Q, K, V) = softmax(QKᵀ / √d_k) · V

This is not magic. It is:

  1. A matrix multiplication (QKᵀ) - covered in Lesson 02
  2. A scaling by a scalar (/ √d_k) - motivated in Lesson 08 (norms)
  3. A softmax (not linear algebra, but the output is)
  4. Another matrix multiplication ( · V)

If you understand matrix multiplication geometrically, you understand why attention works. Lesson 07 (Dot Products and Projections) shows you exactly how.

PCA is eigendecomposition

Principal Component Analysis appears in every ML workflow: visualizing high-dimensional data, reducing feature dimensions before training, compressing representations. It has exactly one mathematical step:

Find the eigenvectors of the covariance matrix.

That is it. Lesson 03 teaches eigenvalues. Lesson 06 applies them to PCA. Lesson 04 shows you the numerically stable path through SVD.

Part 2 - What Each Lesson Teaches

Lesson 01: Vectors and Vector Spaces

The fundamental object: a vector. Not just [1, 2, 3], but the geometric object it represents - a direction and magnitude in space. This lesson covers:

  • What a vector space is and why the 8 axioms matter for ML
  • L1, L2, and L∞ norms - and why they induce different ML behaviors
  • Inner products and the angle between vectors
  • High-dimensional geometry: why intuition breaks down above 3 dimensions
  • NumPy vector operations and cosine similarity from scratch

Unlocks: Understanding why two embeddings that look close in L2 can point in completely different directions. Understanding why RAG uses cosine similarity instead of Euclidean distance.

Lesson 02: Matrix Operations

A matrix is a linear transformation. Multiplying two matrices composes two transformations. This lesson covers:

  • What matrix multiplication actually does (not just row×column)
  • Transpose: symmetric matrices, Gram matrix, and why XᵀX appears everywhere
  • Matrix inverse: when it exists, why you almost never compute it directly
  • Rank: what it reveals about your data's intrinsic dimensionality
  • Determinant: the volume-scaling factor

Unlocks: Understanding why QKᵀ in attention computes pairwise similarities. Understanding why the normal equations for linear regression involve (XᵀX)⁻¹Xᵀy.

Lesson 03: Eigenvalues and Eigenvectors

Some vectors pass through a linear transformation unchanged in direction - only their magnitude scales. These are eigenvectors. The scaling factors are eigenvalues. This lesson covers:

  • Geometric meaning: eigenvectors as invariant directions
  • The characteristic polynomial (intuition, not memorization)
  • Eigendecomposition and when it exists
  • Real symmetric matrices: guaranteed real eigenvalues and orthogonal eigenvectors
  • Power iteration: how eigenvalues are actually computed in practice

Unlocks: Understanding PCA, PageRank, graph Laplacian, spectral clustering, and why covariance matrices are always eigendecomposable.

Lesson 04: SVD and Matrix Decompositions

The Singular Value Decomposition generalizes eigendecomposition to any matrix (not just square ones). It is the most powerful decomposition in applied mathematics. This lesson covers:

  • SVD: the fundamental theorem of linear algebra
  • Geometric interpretation: rotate → scale → rotate
  • Truncated SVD: dimensionality reduction without computing full eigendecomposition
  • LU, QR, and Cholesky decompositions
  • How to compress an image using k singular values

Unlocks: Understanding collaborative filtering (Netflix Prize), LSA for text, image compression, and why sklearn.decomposition.PCA actually uses SVD internally.

Lesson 05: Linear Transformations

A function between vector spaces that preserves addition and scalar multiplication is called a linear map. Every layer of a neural network is one. This lesson covers:

  • The two defining properties of linearity
  • Kernel (null space): what the transformation destroys
  • Image (column space): what the transformation can produce
  • Rank-nullity theorem: the fundamental constraint on information flow
  • Change of basis: same transformation, different coordinate system

Unlocks: Understanding why residual connections in ResNets work (they preserve the identity linear map). Understanding what a neural network layer is geometrically doing to its inputs.

Lesson 06: PCA from Linear Algebra

PCA is not a black box. It is the direct application of eigendecomposition to the covariance matrix of centered data. This lesson covers:

  • What PCA is trying to do: find directions of maximum variance
  • The covariance matrix: what it encodes about your data distribution
  • Eigendecomposition → principal components
  • Explained variance ratio and the scree plot
  • When to use PCA and when NOT to
  • PCA via SVD - the numerically stable path

Unlocks: Knowing what sklearn.decomposition.PCA actually computes. Being able to implement PCA from scratch. Understanding Eigenfaces (face recognition). Knowing why PCA fails on nonlinear manifolds.

Lesson 07: Dot Products and Projections

The dot product measures alignment between two vectors. Projection takes one vector and finds its shadow along another direction. These two operations are behind regression, attention, and retrieval. This lesson covers:

  • Algebraic vs. geometric definition of the dot product
  • Orthogonality: when dot product = 0 and why it matters for independence
  • Vector projection and projection matrices
  • Gram-Schmidt orthogonalization: building an orthonormal basis
  • Least squares via projection: the cleanest derivation

Unlocks: Understanding why scaled dot-product attention works geometrically. Deriving the normal equations for linear regression. Understanding why Gram-Schmidt is behind QR decomposition.

Lesson 08: Norms and Distance Metrics

A norm measures the size of a vector. Different norms induce different geometries, and different geometries produce different ML behaviors. L1 norms make models sparse. L2 norms make models smooth. This lesson covers:

  • The three axioms that define a norm
  • L1 geometry (diamond shape) and why it induces sparsity
  • L2 geometry (sphere shape) and why it induces smoothness
  • Frobenius norm for matrices
  • Nuclear norm: the convex relaxation of rank
  • Distance metrics from norms: Euclidean, Manhattan, Chebyshev
  • When to use cosine similarity vs. Euclidean distance for embeddings

Unlocks: Understanding Lasso (L1) vs. Ridge (L2) regularization geometrically. Knowing when to use L2 distance vs. cosine similarity in vector search.

Lesson 09: Tensors for Deep Learning

A tensor is a generalization of scalars, vectors, and matrices to arbitrary dimensions. Everything in deep learning is tensor algebra. This lesson covers:

  • Tensors as generalized arrays: shapes, axes, and how to read them
  • Tensor contractions: generalizing matrix multiplication
  • Einstein summation notation: the compact language of tensor ops
  • Broadcasting: how NumPy and PyTorch extend operations across dimensions
  • Vectorization: why loops are slow and tensor ops are fast (SIMD, GPU)
  • Implementing scaled dot-product attention using einsum

Unlocks: Understanding batch matrix multiplication in transformers. Understanding how convolution is a tensor contraction. Reading and writing PyTorch code that manipulates 4D tensors.

Lesson 10: Linear Algebra in NumPy

NumPy is the linear algebra engine underneath sklearn, PyTorch (CPU), TensorFlow, and JAX. This lesson is a complete engineering reference. It covers:

  • np.linalg module: every function explained with ML context
  • Solving linear systems correctly (not with inv)
  • Performance: memory layout, vectorization, avoiding Python loops
  • Numerical stability: condition number, floating-point pitfalls
  • Common ML patterns: Gram matrix, covariance, whitening, rotation
  • PyTorch torch.linalg: the GPU-accelerated equivalent

Unlocks: Implementing any ML algorithm from scratch. Debugging numerical instability. Writing fast, vectorized ML code.

Part 3 - How to Use This Module

If you are time-constrained

Study in this priority order:

:::tip Priority Path (4 lessons)

  1. Lesson 01 (Vectors) - foundational for everything
  2. Lesson 02 (Matrices) - needed for forward pass reasoning
  3. Lesson 07 (Dot Products) - needed for attention understanding
  4. Lesson 06 (PCA) - most commonly needed in practice :::

If you are preparing for ML interviews

Focus on:

  • Lessons 01–03: core mathematical definitions
  • Lesson 06: PCA from scratch (very common interview question)
  • Lesson 08: L1 vs L2 regularization (appears in almost every ML interview)
  • Lesson 10: NumPy implementation patterns

If you are building production ML systems

Focus on:

  • Lesson 04: SVD (used in recommender systems, dimensionality reduction)
  • Lesson 08: Norms and distances (used in vector search, embedding similarity)
  • Lesson 09: Tensors (needed for efficient batch processing)
  • Lesson 10: NumPy performance and stability

Part 4 - Prerequisites

This module assumes:

  • Comfort with Python and NumPy arrays
  • High school algebra (variables, functions, equations)
  • Some exposure to ML (you know what training and inference are)

This module does not assume:

  • Prior university linear algebra coursework
  • Deep calculus knowledge (we introduce what we need)
  • Advanced mathematical maturity

Part 5 - What You Will Be Able to Do

After completing this module, you will be able to:

  1. Read ML papers: When a paper writes Attention(Q,K,V) = softmax(QKᵀ/√d)V, you will understand every symbol geometrically.

  2. Implement from scratch: PCA, cosine similarity, least squares regression, and the attention mechanism - all from NumPy primitives.

  3. Debug ML systems: When your embedding search returns wrong results, you will know whether it's a norm issue, a distance metric issue, or a high-dimensional geometry issue.

  4. Reason about model capacity: Rank deficiency in a weight matrix means information is being lost. You will know when this is a problem and when it is a feature.

  5. Write efficient ML code: Broadcasting, einsum, and vectorization instead of Python loops.

  6. Pass ML interviews: Every major ML interview includes linear algebra. You will be able to derive, not just recite.

Quick Reference: Linear Algebra in ML Systems

ML ConceptLinear Algebra Behind It
Word/document embeddingsVectors in high-dimensional space
Cosine similarityInner product / (L2 norm × L2 norm)
Neural network layerMatrix multiplication + nonlinearity
Attention mechanismScaled dot product: softmax(QKᵀ/√d)V
BackpropagationChain rule = Jacobian matrix multiplication
PCAEigendecomposition of covariance matrix
Recommender systemsMatrix factorization via SVD
L1 regularization (Lasso)L1 norm constraint on weight vector
L2 regularization (Ridge)L2 norm constraint on weight vector
Least squares regressionProjection onto column space of X
Batch normalizationCentering + scaling (whitening)
Convolutional layerTensor contraction with filter tensor

Key Takeaways

  • Linear algebra is not abstract mathematics - it is the computational substrate of every ML algorithm
  • Vectors represent data points, embeddings, and features in high-dimensional spaces
  • Matrices represent linear transformations, weight matrices, and attention scores
  • Eigenvalues and SVD reveal the intrinsic structure of data and transformations
  • Norms define what "small" means and determine the geometry of regularization
  • Tensors generalize everything to the batch dimensions required for GPU-accelerated ML

Next: Vectors and Vector Spaces →

© 2026 EngineersOfAI. All rights reserved.