Skip to main content

Module 02: Linear Models

Why Linear Models Still Matter in 2026

Every ML engineer eventually realizes the same thing: linear models are not a stepping stone - they are the foundation. When a feature team at Google wants to understand why their ranking model changed behavior, they reach for linear probes. When a production system at Stripe needs to explain a fraud decision, they fall back to logistic regression. When PyTorch implements a neural network layer, it is doing y = Wx + b - a linear model at its core.

Linear models teach you three things that no other model class can:

  1. Optimization intuition - gradient descent, convergence, learning rates. You cannot understand training deep networks without understanding these on linear models first.
  2. Statistical grounding - what assumptions are you making? What breaks when they're violated? Linear models make the implicit explicit.
  3. Interpretability baseline - in production, you often need to justify predictions. Linear models give you exact feature contributions.

This module builds linear models from first principles, derives every formula, and connects theory to production engineering decisions.


Module Map


Lesson Table

#LessonCore ConceptYou Will Build
01Linear Regression InternalsOLS, Normal Equation, Geometric InterpretationRegression from scratch + residual diagnostics
02Gradient Descent From Scratch∇L = (2/n)Xᵀ(Xw-y), convergenceFull GD with loss curves
03Stochastic and Mini-Batch GDSGD noise, epoch dynamicsThree-variant comparison
04Logistic Regression Deep DiveSigmoid, cross-entropy, softmaxBinary + multi-class classifier
05Regularization: L1, L2, ElasticNetSparsity, shrinkage, pathsRegularization path plots
06Polynomial Features and KernelsFeature expansion, kernel trickKernel SVM classifier
07Maximum Likelihood EstimationMLE → OLS → Cross-entropyMLE from scratch for Gaussian + Bernoulli
08Generalized Linear ModelsLink functions, exponential familyPoisson + Gamma regression

How Linear Models Connect to the Rest of ML

Linear Models → Neural Networks

A single-layer neural network with no activation function is linear regression. Add sigmoid - it becomes logistic regression. Add softmax - multi-class classification. The entire deep learning stack is linear models composed with nonlinearities.

Neural net layer: h=σ(Wx+b)\text{Neural net layer: } \mathbf{h} = \sigma(\mathbf{W}\mathbf{x} + \mathbf{b})

Understanding gradient descent on linear models (Lesson 02) directly transfers to understanding backpropagation.

Regularization → Weight Decay

L2 regularization in linear models is identical to weight decay in neural network training. L1 regularization becomes the intuition behind sparse attention and pruning. These are not separate ideas.

Logistic Regression → Softmax Classifier → LLM Output Head

The output layer of every language model is a linear layer followed by softmax - exactly logistic regression for multi-class. The math in Lesson 04 is the math inside GPT-4.

GLMs → Probabilistic ML

Understanding that linear regression assumes Gaussian noise (Lesson 07) and that you can swap that noise distribution (Lesson 08) opens the door to Bayesian ML, variational inference, and probabilistic forecasting.


Prerequisites

  • Calculus: partial derivatives, chain rule
  • Linear algebra: matrix multiply, transpose, inverse
  • Python: NumPy comfortable, sklearn basics

Production Context

Linear models appear in production in ways that might surprise you:

Ranking systems: LinkedIn and Twitter use linear models as the first-stage scorer in multi-stage ranking pipelines. They run at billions of QPS where even a 10ms latency budget is tight.

Calibration: After training a complex tree or neural model, ML teams fit a logistic regression on top to calibrate probabilities. The entire calibration layer is Lesson 04.

Interpretability under regulation: In finance and healthcare, regulators require model explanations. Logistic regression gives exact feature coefficients. Even when the underlying model is complex, teams often maintain a "shadow" linear model for compliance.

Feature selection: Lasso (L1 regularization, Lesson 05) is used in genomics, drug discovery, and high-dimensional settings to identify the ~50 predictive features out of 50,000 candidates.

This module will make you fluent in all of it.

© 2026 EngineersOfAI. All rights reserved.