Module 02: Linear Models
Why Linear Models Still Matter in 2026
Every ML engineer eventually realizes the same thing: linear models are not a stepping stone - they are the foundation. When a feature team at Google wants to understand why their ranking model changed behavior, they reach for linear probes. When a production system at Stripe needs to explain a fraud decision, they fall back to logistic regression. When PyTorch implements a neural network layer, it is doing y = Wx + b - a linear model at its core.
Linear models teach you three things that no other model class can:
- Optimization intuition - gradient descent, convergence, learning rates. You cannot understand training deep networks without understanding these on linear models first.
- Statistical grounding - what assumptions are you making? What breaks when they're violated? Linear models make the implicit explicit.
- Interpretability baseline - in production, you often need to justify predictions. Linear models give you exact feature contributions.
This module builds linear models from first principles, derives every formula, and connects theory to production engineering decisions.
Module Map
Lesson Table
| # | Lesson | Core Concept | You Will Build |
|---|---|---|---|
| 01 | Linear Regression Internals | OLS, Normal Equation, Geometric Interpretation | Regression from scratch + residual diagnostics |
| 02 | Gradient Descent From Scratch | ∇L = (2/n)Xᵀ(Xw-y), convergence | Full GD with loss curves |
| 03 | Stochastic and Mini-Batch GD | SGD noise, epoch dynamics | Three-variant comparison |
| 04 | Logistic Regression Deep Dive | Sigmoid, cross-entropy, softmax | Binary + multi-class classifier |
| 05 | Regularization: L1, L2, ElasticNet | Sparsity, shrinkage, paths | Regularization path plots |
| 06 | Polynomial Features and Kernels | Feature expansion, kernel trick | Kernel SVM classifier |
| 07 | Maximum Likelihood Estimation | MLE → OLS → Cross-entropy | MLE from scratch for Gaussian + Bernoulli |
| 08 | Generalized Linear Models | Link functions, exponential family | Poisson + Gamma regression |
How Linear Models Connect to the Rest of ML
Linear Models → Neural Networks
A single-layer neural network with no activation function is linear regression. Add sigmoid - it becomes logistic regression. Add softmax - multi-class classification. The entire deep learning stack is linear models composed with nonlinearities.
Understanding gradient descent on linear models (Lesson 02) directly transfers to understanding backpropagation.
Regularization → Weight Decay
L2 regularization in linear models is identical to weight decay in neural network training. L1 regularization becomes the intuition behind sparse attention and pruning. These are not separate ideas.
Logistic Regression → Softmax Classifier → LLM Output Head
The output layer of every language model is a linear layer followed by softmax - exactly logistic regression for multi-class. The math in Lesson 04 is the math inside GPT-4.
GLMs → Probabilistic ML
Understanding that linear regression assumes Gaussian noise (Lesson 07) and that you can swap that noise distribution (Lesson 08) opens the door to Bayesian ML, variational inference, and probabilistic forecasting.
Prerequisites
- Calculus: partial derivatives, chain rule
- Linear algebra: matrix multiply, transpose, inverse
- Python: NumPy comfortable, sklearn basics
Production Context
Linear models appear in production in ways that might surprise you:
Ranking systems: LinkedIn and Twitter use linear models as the first-stage scorer in multi-stage ranking pipelines. They run at billions of QPS where even a 10ms latency budget is tight.
Calibration: After training a complex tree or neural model, ML teams fit a logistic regression on top to calibrate probabilities. The entire calibration layer is Lesson 04.
Interpretability under regulation: In finance and healthcare, regulators require model explanations. Logistic regression gives exact feature coefficients. Even when the underlying model is complex, teams often maintain a "shadow" linear model for compliance.
Feature selection: Lasso (L1 regularization, Lesson 05) is used in genomics, drug discovery, and high-dimensional settings to identify the ~50 predictive features out of 50,000 candidates.
This module will make you fluent in all of it.
