What is machine learning?

Deep dive into linear regression - OLS derivation, normal equations, geometric interpretation as projection, Gauss-Markov theorem, residual diagnostics, Cook's distance, VIF, multicollinearity, and full NumPy implementation.

How does linear regression work in practice?

Linear Regression Internals covers machine learning, linear regression, ordinary least squares from first principles with code examples. Free lesson at https://engineersofai.com/docs/ml/linear-models/linear-regression-internals

What is the difference between machine learning and ordinary least squares?

See the full breakdown at https://engineersofai.com/docs/ml/linear-models/linear-regression-internals

Linear Regression Internals

The Real Interview Moment

You are sitting across from a senior ML engineer at a top-tier fintech company. The role is a machine learning engineer position. The whiteboard behind the interviewer has exactly one equation written on it: $\hat{\mathbf{w}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}$ .

The question arrives: "Walk me through how linear regression actually works - not just the sklearn API, but what happens mathematically when you call .fit()."

Most candidates describe the MSE loss. A few mention gradient descent. Some sketch a line through data points. Almost none can derive the normal equation from scratch, explain why the geometric interpretation is a projection, prove that OLS is the Best Linear Unbiased Estimator under Gauss-Markov, or explain what happens to coefficient estimates when features are collinear. Those who can do all of the above - and connect it to practical diagnostics like Cook's distance and VIF - are the candidates who get offers.

This is also the lesson where you realize that calling sklearn.linear_model.LinearRegression().fit() is executing a sophisticated mathematical procedure: projecting your target vector onto the column space of your feature matrix. Understanding why unlocks every advanced topic in statistical machine learning.

By the end of this lesson you will understand what .fit() is actually computing, why it can fail silently, and how to diagnose every major failure mode before your model makes it into production.

Why This Exists

Before linear regression, practitioners fit curves by hand or used cumbersome graphical methods. In 1805, Adrien-Marie Legendre proposed minimizing the sum of squared residuals as a criterion for fitting a line to astronomical data. In 1809, Carl Friedrich Gauss published a probabilistic derivation - he proved that under Gaussian noise, this estimator is the maximum likelihood solution. Gauss also claimed he had used the method since 1795, creating one of the most famous priority disputes in scientific history.

Linear regression exists because it answers a fundamental question: given a set of features and a numeric target, what linear combination of features best predicts that target? "Best" has a precise mathematical meaning: minimum sum of squared errors. Everything else in this lesson flows from that single definition - the derivation, the geometry, the optimality theorem, the diagnostics.

Historical Context

Legendre published the method of least squares in 1805 as an appendix to a paper on comet trajectories. Gauss published his probabilistic derivation in 1809 in Theoria Motus Corporum Coelestium, connecting least squares to the Gaussian distribution for the first time. Francis Galton applied regression to human data in the 1870s while studying the relationship between parents' and children's heights - coining the term "regression to the mean." Karl Pearson formalized the statistical framework around 1900. By the 1950s, matrix notation made multivariate regression tractable. Today, any call to sklearn.linear_model.LinearRegression().fit() executes LAPACK's divide-and-conquer SVD - a numerically refined descendant of Gauss's 1809 derivation.

The Least Squares Objective

Given a dataset of $n$ observations with $d$ features:

Feature matrix $\mathbf{X} \in \mathbb{R}^{n \times d}$ (with a bias column of ones prepended, so effectively $d+1$ columns)
Target vector $\mathbf{y} \in \mathbb{R}^n$
Weight vector $\mathbf{w} \in \mathbb{R}^{d+1}$ (what we want to find)

The prediction for observation $i$ is:

$\hat{y}_i = \mathbf{w}^\top \mathbf{x}_i = w_0 + w_1 x_{i1} + w_2 x_{i2} + \cdots + w_d x_{id}$

The residual for observation $i$ is $e_i = y_i - \hat{y}_i$ .

The Ordinary Least Squares (OLS) objective minimizes the sum of squared residuals:

$\mathcal{L}(\mathbf{w}) = \sum_{i=1}^{n} (y_i - \mathbf{w}^\top \mathbf{x}_i)^2 = \|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2$

In matrix form, expanding the squared norm:

$\mathcal{L}(\mathbf{w}) = (\mathbf{y} - \mathbf{X}\mathbf{w})^\top (\mathbf{y} - \mathbf{X}\mathbf{w}) = \mathbf{y}^\top\mathbf{y} - 2\mathbf{w}^\top\mathbf{X}^\top\mathbf{y} + \mathbf{w}^\top\mathbf{X}^\top\mathbf{X}\mathbf{w}$

:::note Why squared residuals and not absolute residuals? Three reasons: (1) squaring penalizes large errors disproportionately - a residual of 10 is penalized 100x more than a residual of 1, not 10x; (2) the squared function is differentiable everywhere, making calculus tractable - the absolute value function is not differentiable at zero; (3) it is the maximum likelihood estimator under Gaussian noise - Lesson 07 on MLE derives this connection precisely. When you want robustness to outliers, use MAE or Huber loss instead. :::

Deriving the Normal Equation

To find the optimal weights, take the gradient of $\mathcal{L}$ with respect to $\mathbf{w}$ and set it to zero. Using matrix calculus identities $\frac{\partial}{\partial\mathbf{w}}\mathbf{a}^\top\mathbf{w} = \mathbf{a}$ and $\frac{\partial}{\partial\mathbf{w}}\mathbf{w}^\top\mathbf{A}\mathbf{w} = 2\mathbf{A}\mathbf{w}$ (for symmetric $\mathbf{A}$ ):

$\frac{\partial \mathcal{L}}{\partial \mathbf{w}} = -2\mathbf{X}^\top\mathbf{y} + 2\mathbf{X}^\top\mathbf{X}\mathbf{w} = \mathbf{0}$

Rearranging gives the normal equations:

$\mathbf{X}^\top\mathbf{X}\mathbf{w} = \mathbf{X}^\top\mathbf{y}$

If $\mathbf{X}^\top\mathbf{X}$ is invertible (which requires full column rank of $\mathbf{X}$ ), the unique solution is:

$\boxed{\hat{\mathbf{w}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}}$

The matrix $\mathbf{H} = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top$ is called the hat matrix because it "puts a hat on" $\mathbf{y}$ :

$\hat{\mathbf{y}} = \mathbf{X}\hat{\mathbf{w}} = \mathbf{H}\mathbf{y}$

The hat matrix appears again in leverage and Cook's distance calculations. Note that $\mathbf{H}$ is $n \times n$ - for large $n$ , you never form it explicitly.

Step-by-Step Scalar Derivation (Intuition)

For intuition, derive it for a single weight $w$ with prediction $\hat{y}_i = w x_i$ :

$\mathcal{L}(w) = \sum_{i=1}^n (y_i - w x_i)^2$

$\frac{d\mathcal{L}}{dw} = \sum_{i=1}^n 2(y_i - w x_i)(-x_i) = -2\sum_{i=1}^n x_i(y_i - w x_i) = 0$

$w^* = \frac{\sum x_i y_i}{\sum x_i^2}$

This is the ratio of the covariance of $x$ and $y$ to the variance of $x$ - exactly what you get from the normal equation in the univariate case.

When NOT to Use the Normal Equation

The normal equation gives an exact solution in one shot with no iteration. So why does everyone use gradient descent?

Scenario	Problem with Normal Equation
$n > 100{,}000$ rows	Computing $\mathbf{X}^\top\mathbf{X}$ is $O(nd^2)$ , inverting is $O(d^3)$
$d > 10{,}000$ features	Inverting a $d \times d$ matrix becomes numerically unstable and slow
Multicollinear features	$\mathbf{X}^\top\mathbf{X}$ is singular - inversion fails
Online/streaming data	Normal equation requires all data at once
Lasso/nonlinear models	No closed form exists

:::warning Numerical stability - never use np.linalg.inv directly Never invert $\mathbf{X}^\top\mathbf{X}$ directly in code. Use np.linalg.lstsq or solve the linear system np.linalg.solve(X.T @ X, X.T @ y). Direct inversion amplifies floating-point errors. sklearn's LinearRegression uses LAPACK's _gelsd - a divide-and-conquer SVD approach that is numerically rock-solid even when $\mathbf{X}^\top\mathbf{X}$ is nearly singular. :::

Geometric Interpretation: Projection onto Column Space

This is the insight that makes linear algebra click for machine learning.

The columns of $\mathbf{X}$ span a subspace of $\mathbb{R}^n$ called the column space $\text{col}(\mathbf{X})$ . The vector $\mathbf{y}$ (our targets) generally does not lie in this column space - it is a point somewhere in $\mathbb{R}^n$ . Linear regression finds the point $\hat{\mathbf{y}} = \mathbf{X}\hat{\mathbf{w}}$ in $\text{col}(\mathbf{X})$ that is closest to $\mathbf{y}$ in Euclidean distance.

The residual vector $\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}}$ must be perpendicular to every column of $\mathbf{X}$ - i.e., perpendicular to the entire column space:

$\mathbf{X}^\top(\mathbf{y} - \mathbf{X}\hat{\mathbf{w}}) = \mathbf{0}$

This is exactly the normal equation. The geometric picture: we are projecting $\mathbf{y}$ onto $\text{col}(\mathbf{X})$ .

      y (true targets, lives in R^n)
      |
      |  e = y - ŷ  (perpendicular to column space)
      |
 -----+--------  col(X)  (the subspace spanned by feature columns)
      ŷ  (the projection - our predictions)

This geometry immediately answers several questions:

Why does adding a collinear feature not change predictions? Because it does not expand $\text{col}(\mathbf{X})$ , so the projection is unchanged.
Why do more features never decrease training R²? Because a larger column space can only move $\hat{\mathbf{y}}$ closer to $\mathbf{y}$ or keep it the same.
Why is the hat matrix idempotent ( $\mathbf{H}^2 = \mathbf{H}$ )? Because projecting twice is the same as projecting once.
Why does OLS minimize squared error? Because the closest point in a subspace to an external point is the orthogonal projection - this is a fundamental theorem of Euclidean geometry.

The Gauss-Markov Theorem

The Gauss-Markov theorem is the theoretical foundation for OLS. It states:

Under the LINE assumptions (see below), the OLS estimator is the Best Linear Unbiased Estimator (BLUE) - it has the minimum variance among all linear unbiased estimators.

Formally, the OLS estimator $\hat{\mathbf{w}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}$ is linear in $\mathbf{y}$ . For any other linear unbiased estimator $\tilde{\mathbf{w}} = \mathbf{C}\mathbf{y}$ where $\mathbb{E}[\tilde{\mathbf{w}}] = \mathbf{w}$ :

$\text{Var}(\hat{w}_j) \leq \text{Var}(\tilde{w}_j) \quad \forall j$

What Gauss-Markov requires:

Linearity: true model is $\mathbf{y} = \mathbf{X}\mathbf{w} + \boldsymbol{\epsilon}$
Independence: errors $\epsilon_i$ are uncorrelated: $\text{Cov}(\epsilon_i, \epsilon_j) = 0$ for $i \neq j$
Zero mean: $\mathbb{E}[\boldsymbol{\epsilon}] = \mathbf{0}$
Homoscedasticity: $\text{Var}(\epsilon_i) = \sigma^2$ (constant across all observations)

What Gauss-Markov does NOT require:

Normality of errors (only needed for hypothesis testing and confidence intervals)
Independence across different features (multicollinearity affects precision, not unbiasedness)

This is why OLS remains the workhorse of statistics even when residuals are not perfectly Gaussian. When homoscedasticity fails, Weighted Least Squares (WLS) becomes BLUE instead.

The LINE Assumptions

The four OLS assumptions - LINE - are what you need to check before trusting any linear regression model in production.

L - Linearity

The relationship between features and target is truly linear in the parameters:

$\mathbb{E}[y | \mathbf{x}] = \mathbf{w}^\top\mathbf{x}$

What breaks: Predictions become systematically biased in a predictable direction. Residuals show a curved pattern when plotted against fitted values - you can see the model consistently over-predicting in one range and under-predicting in another.

Detection: Plot residuals vs. fitted values. A U-shaped or inverted-U curve reveals nonlinearity. The smooth trend line through the residuals should be flat at zero.

Fix: Add polynomial features (Lesson 06), apply log/sqrt transforms to skewed predictors, or switch to a nonlinear model. For physical processes with known functional forms, derive the correct transformation analytically.

I - Independence

Observations are independent of each other:

$\text{Cov}(\epsilon_i, \epsilon_j) = 0 \text{ for } i \neq j$

What breaks: Standard errors are underestimated. Confidence intervals are too narrow. Hypothesis tests give false significance - you think coefficients are statistically significant when they are not. In time series, consecutive residuals have a wave pattern: above zero for a stretch, then below zero for a stretch.

Detection: Durbin-Watson test (value of 2 = no autocorrelation; near 0 = strong positive autocorrelation; near 4 = strong negative autocorrelation). Plot the ACF (autocorrelation function) of residuals - significant spikes at non-zero lags indicate autocorrelation.

Fix: Use time-series models (ARIMA), add lag features as predictors, use GLS (Generalized Least Squares) with an autocorrelation model, or cluster standard errors.

N - Normality of Residuals

Residuals are approximately normally distributed:

$\epsilon_i \sim \mathcal{N}(0, \sigma^2)$

What breaks: Prediction intervals become inaccurate. The OLS estimator is still unbiased (Gauss-Markov holds without normality for point estimates), but confidence intervals, hypothesis tests, and prediction intervals lose validity. In very non-normal cases, outliers have excessive influence.

Detection: Q-Q plot of residuals - points should fall close to the diagonal. Shapiro-Wilk test for small samples ( $n < 50$ ). In large samples, by CLT, even non-Gaussian residuals give approximately valid inference.

Fix: Log-transform the target variable (common for right-skewed targets like salaries, prices, counts), use a Generalized Linear Model (Lesson 08) appropriate for the data distribution, or use bootstrapped confidence intervals.

:::note Gauss-Markov and normality The OLS estimator is BLUE even without Gaussian residuals. Normality only matters for the validity of t-tests, F-tests, confidence intervals, and prediction intervals - not for the quality of the point estimate. In large samples, the Central Limit Theorem ensures normality of the coefficient estimates approximately regardless of the residual distribution. :::

E - Equal Variance (Homoscedasticity)

The variance of residuals is constant across all values of the predictors:

$\text{Var}(\epsilon_i | \mathbf{x}_i) = \sigma^2 \quad \text{(constant for all } i\text{)}$

What breaks: OLS is no longer BLUE - Weighted Least Squares is better. More importantly, standard errors are wrong. If variance increases with the fitted value (fan shape), OLS underestimates standard errors for large predictions and overestimates them for small predictions. This creates false precision exactly where it matters most.

Detection: Plot $\sqrt{|\text{standardized residuals}|}$ against fitted values (the scale-location plot). A horizontal band = homoscedasticity; an upward slope = heteroscedasticity.

Fix: Log-transform the target (if variance scales with mean - common in economics), use Weighted Least Squares (WLS) with weights proportional to $1/\hat{\sigma}^2_i$ , or switch to a GLM with appropriate variance function (Lesson 08).

Heteroscedasticity in Detail

Heteroscedasticity is common whenever the magnitude of a quantity scales with its value: revenue, salaries, house prices, biological measurements.

The classic formal test is the Breusch-Pagan test: regress the squared residuals $e_i^2$ on the original features. If this regression is statistically significant (the features predict the variance), heteroscedasticity is present.

import numpy as np
from scipy import stats

def breusch_pagan_test(X, residuals):
    """
    Breusch-Pagan test for heteroscedasticity.
    H0: homoscedastic (constant variance)
    H1: heteroscedastic (variance depends on X)
    Returns: (LM_statistic, p_value)
    """
    n = len(residuals)
    e_sq = residuals ** 2  # squared residuals
    X_aug = np.column_stack([np.ones(n), X])

    # Regress e^2 on X: does variance depend on the predictors?
    w, _, _, _ = np.linalg.lstsq(X_aug, e_sq, rcond=None)
    e_sq_hat = X_aug @ w

    # R^2 of the auxiliary regression
    ss_res = np.sum((e_sq - e_sq_hat) ** 2)
    ss_tot = np.sum((e_sq - np.mean(e_sq)) ** 2)
    r2_aux = 1 - ss_res / ss_tot

    # Test statistic: n * R^2 ~ chi^2(d) under H0
    lm_stat = n * r2_aux
    d = X.shape[1]
    p_value = 1 - stats.chi2.cdf(lm_stat, df=d)

    return lm_stat, p_value


# Example: heteroscedastic data (variance grows with x)
np.random.seed(42)
n = 300
X_het = np.random.uniform(0, 10, n).reshape(-1, 1)
# Variance proportional to X: classic heteroscedastic pattern
y_het = 2 * X_het.ravel() + np.random.randn(n) * X_het.ravel()

X_aug_het = np.column_stack([np.ones(n), X_het])
w_het, _, _, _ = np.linalg.lstsq(X_aug_het, y_het, rcond=None)
residuals_het = y_het - X_aug_het @ w_het

lm, pval = breusch_pagan_test(X_het, residuals_het)
print(f"Breusch-Pagan test: LM={lm:.2f}, p={pval:.4f}")
if pval < 0.05:
    print("  Heteroscedasticity detected! Consider log-transform or WLS.")
else:
    print("  No significant heteroscedasticity detected.")

Outlier Influence: Leverage and Cook's Distance

Not all data points influence the model equally. A point with extreme feature values that fits the overall trend (high leverage, small residual) has little influence on model fit. A point with extreme feature values that violates the overall trend (high leverage, large residual) has enormous influence and can completely distort the regression.

Leverage

How unusual is a point's feature vector relative to the rest of the data? The leverage of point $i$ is the $i$ -th diagonal of the hat matrix:

$h_{ii} = \mathbf{x}_i^\top (\mathbf{X}^\top\mathbf{X})^{-1} \mathbf{x}_i$

Leverage ranges from $0$ to $1$ . The average leverage is $(d+1)/n$ . A common threshold for high leverage is:

$h_{ii} > \frac{2(d+1)}{n}$

High leverage means the point is far from the center of the feature space. It has potential influence on the model, but only becomes actually influential if it also has a large residual.

Cook's Distance

Cook's distance $D_i$ measures how much all fitted values $\hat{\mathbf{y}}$ would change if point $i$ were removed from the dataset:

$D_i = \frac{\sum_{j=1}^n (\hat{y}_j - \hat{y}_{j(i)})^2}{(d+1) \cdot \hat{\sigma}^2}$

where $\hat{y}_{j(i)}$ is the prediction for observation $j$ from the model fit without observation $i$ . This can be computed efficiently from the hat matrix and residuals:

$D_i = \frac{e_i^2}{(d+1)\hat{\sigma}^2} \cdot \frac{h_{ii}}{(1-h_{ii})^2}$

Common threshold: $D_i > 4/n$ flags potentially influential points. $D_i > 1$ is a severe warning.

import numpy as np
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split

def compute_leverage_cooks(X, y, y_pred):
    """
    Compute leverage (h_ii) and Cook's distance (D_i) for all points.
    Efficient computation without forming the full n×n hat matrix.

    Returns:
        h: leverage values, shape (n,)
        cooks_d: Cook's distance, shape (n,)
    """
    n, d = X.shape
    residuals = y - y_pred

    # Augment with intercept column
    X_aug = np.column_stack([np.ones(n), X])

    # Compute (X^T X)^{-1} - shape (d+1) x (d+1), manageable
    XtXinv = np.linalg.pinv(X_aug.T @ X_aug)

    # h_ii = x_i^T (X^T X)^{-1} x_i for each row
    # Efficient: h_ii = diag(X_aug @ XtXinv @ X_aug.T)
    # Compute row-by-row to avoid forming n×n matrix
    h = np.einsum('ij,jk,ik->i', X_aug, XtXinv, X_aug)

    # Residual variance estimate: MSE using n-d-1 degrees of freedom
    mse = np.sum(residuals ** 2) / (n - d - 1)

    # Cook's distance: D_i = (e_i^2 / ((d+1) * MSE)) * (h_ii / (1-h_ii)^2)
    safe_h = np.clip(h, 0, 0.9999)
    cooks_d = (residuals ** 2 * safe_h) / ((d + 1) * mse * (1 - safe_h) ** 2)

    return h, cooks_d


# Generate regression data and inject known outliers
np.random.seed(42)
X, y = make_regression(n_samples=200, n_features=5, noise=10.0, random_state=42)

# High-leverage, low-influence outlier: extreme X, normal y for that X
X_hl = np.random.randn(3, 5) * 8   # very far from data center
y_hl = np.zeros(3)                   # target also unusual (won't fit trend)

# Low-leverage, high-residual outlier: normal X, extreme y
X_lr = np.random.randn(3, 5)
y_lr = np.ones(3) * 500              # massive outlier in y

X_all = np.vstack([X, X_hl, X_lr])
y_all = np.concatenate([y, y_hl, y_lr])

# Fit OLS
X_aug = np.column_stack([np.ones(len(y_all)), X_all])
w_ols, _, _, _ = np.linalg.lstsq(X_aug, y_all, rcond=None)
y_pred = X_aug @ w_ols

h, cooks_d = compute_leverage_cooks(X_all, y_all, y_pred)

n_all = len(y_all)
threshold_h = 2 * (X_all.shape[1] + 1) / n_all
threshold_d = 4 / n_all

print(f"Sample size n = {n_all}")
print(f"Leverage threshold 2(d+1)/n = {threshold_h:.3f}")
print(f"Cook's D threshold 4/n = {threshold_d:.4f}")
print(f"High leverage points: {np.sum(h > threshold_h)}")
print(f"High Cook's D points: {np.sum(cooks_d > threshold_d)}")

top10 = np.argsort(cooks_d)[::-1][:10]
print("\nTop 10 most influential points:")
for idx in top10:
    print(f"  i={idx:3d}: Cook's D={cooks_d[idx]:.4f}, "
          f"leverage={h[idx]:.4f}, residual={y_all[idx]-y_pred[idx]:.2f}")

Full NumPy Implementation from Scratch

import numpy as np
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

class LinearRegressionOLS:
    """
    OLS linear regression via the normal equation.
    Uses lstsq (SVD-based) for numerical stability - never directly inverts X^TX.
    Includes full summary statistics and diagnostic metrics.
    """

    def __init__(self, fit_intercept=True):
        self.fit_intercept = fit_intercept
        self.weights_ = None
        self.intercept_ = None
        self.coef_ = None
        self._singular_values = None
        self._rank = None

    def _augment(self, X):
        """Prepend a column of ones for the intercept term."""
        if self.fit_intercept:
            return np.column_stack([np.ones(X.shape[0]), X])
        return X

    def fit(self, X, y):
        X_aug = self._augment(X)
        # lstsq uses SVD decomposition - numerically stable, handles near-singular cases
        self.weights_, residuals, self._rank, self._singular_values = (
            np.linalg.lstsq(X_aug, y, rcond=None)
        )
        if self.fit_intercept:
            self.intercept_ = self.weights_[0]
            self.coef_ = self.weights_[1:]
        else:
            self.intercept_ = 0.0
            self.coef_ = self.weights_
        self._X_train = X_aug  # store for diagnostics
        self._y_train = y
        return self

    def predict(self, X):
        return self._augment(X) @ self.weights_

    def score(self, X, y):
        """Compute R² on the given data."""
        y_pred = self.predict(X)
        ss_res = np.sum((y - y_pred) ** 2)
        ss_tot = np.sum((y - np.mean(y)) ** 2)
        return 1 - ss_res / ss_tot

    def summary(self, X, y):
        """Print an R-style model summary with all key statistics."""
        y_pred = self.predict(X)
        n, d = X.shape

        residuals = y - y_pred
        mse = np.mean(residuals ** 2)
        rmse = np.sqrt(mse)
        mae = np.mean(np.abs(residuals))

        ss_res = np.sum(residuals ** 2)
        ss_tot = np.sum((y - np.mean(y)) ** 2)
        r2 = 1 - ss_res / ss_tot
        # Adjusted R² penalizes for number of features
        adj_r2 = 1 - (1 - r2) * (n - 1) / (n - d - 1)

        # AIC: penalizes for model complexity (2k penalty per parameter)
        sigma2_mle = ss_res / n
        aic = n * np.log(sigma2_mle) + 2 * (d + 2)

        # Condition number: ratio of largest to smallest singular value
        cond_num = (self._singular_values.max() / self._singular_values.min()
                    if len(self._singular_values) > 1 else 1.0)

        print("=" * 55)
        print("OLS Linear Regression Summary")
        print("=" * 55)
        print(f"  n (samples):      {n}")
        print(f"  d (features):     {d}")
        print(f"  Matrix rank:      {self._rank}")
        print(f"  Intercept:        {self.intercept_:.4f}")
        print(f"  Coefficients:     {np.round(self.coef_, 4)}")
        print(f"  R²:               {r2:.4f}")
        print(f"  Adjusted R²:      {adj_r2:.4f}")
        print(f"  RMSE:             {rmse:.4f}")
        print(f"  MAE:              {mae:.4f}")
        print(f"  AIC:              {aic:.2f}")
        print(f"  Condition number: {cond_num:.1f}")
        if cond_num > 1000:
            print("  *** WARNING: High condition number - possible multicollinearity ***")
        print("=" * 55)
        return r2, adj_r2, rmse


# ---- Experiment ----
np.random.seed(42)
X, y = make_regression(n_samples=500, n_features=5, noise=15.0, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegressionOLS()
model.fit(X_train, y_train)
model.summary(X_train, y_train)

print(f"\nTest R²: {model.score(X_test, y_test):.4f}")

Residual Diagnostic Plots

Every linear model needs four canonical diagnostic plots before deployment. These match R's plot.lm() output:

import matplotlib.pyplot as plt
from scipy import stats

def plot_residual_diagnostics(y_true, y_pred, title="Residual Diagnostics"):
    """
    Four canonical OLS diagnostic plots:
    1. Residuals vs Fitted - checks linearity
    2. Normal Q-Q - checks normality
    3. Scale-Location - checks homoscedasticity
    4. Residual histogram - visual normality check
    """
    residuals = y_true - y_pred
    standardized = (residuals - residuals.mean()) / residuals.std()

    fig, axes = plt.subplots(2, 2, figsize=(12, 10))
    fig.suptitle(title, fontsize=14, fontweight='bold')

    # 1. Residuals vs Fitted - key diagnostic for linearity
    ax = axes[0, 0]
    ax.scatter(y_pred, residuals, alpha=0.4, s=20, color='steelblue')
    ax.axhline(0, color='red', linestyle='--', linewidth=1.5)
    # Smooth trend via low-degree polynomial
    z = np.polyfit(y_pred, residuals, 2)
    p = np.poly1d(z)
    x_line = np.linspace(y_pred.min(), y_pred.max(), 300)
    ax.plot(x_line, p(x_line), 'orange', linewidth=2, alpha=0.9, label='Smooth trend')
    ax.set_xlabel("Fitted values")
    ax.set_ylabel("Residuals")
    ax.set_title("Residuals vs Fitted\n(check: random scatter around zero = linearity OK)")
    ax.legend()

    # 2. Normal Q-Q plot - checks normality of residuals
    ax = axes[0, 1]
    (osm, osr), (slope, intercept, r) = stats.probplot(residuals, dist="norm")
    ax.scatter(osm, osr, alpha=0.4, s=20, color='steelblue')
    ax.plot(osm, slope * np.array(osm) + intercept, 'r--', linewidth=1.5, label=f'R={r:.3f}')
    ax.set_xlabel("Theoretical Quantiles (Standard Normal)")
    ax.set_ylabel("Sample Quantiles")
    ax.set_title("Normal Q-Q Plot\n(check: points on line = normality OK)")
    ax.legend()

    # 3. Scale-Location - checks homoscedasticity
    ax = axes[1, 0]
    ax.scatter(y_pred, np.sqrt(np.abs(standardized)), alpha=0.4, s=20, color='steelblue')
    z2 = np.polyfit(y_pred, np.sqrt(np.abs(standardized)), 1)
    p2 = np.poly1d(z2)
    ax.plot(x_line, p2(x_line), 'orange', linewidth=2, alpha=0.9)
    ax.set_xlabel("Fitted values")
    ax.set_ylabel("√|Standardized Residuals|")
    ax.set_title("Scale-Location\n(check: horizontal band = homoscedasticity OK)")

    # 4. Residual distribution - visual normality check
    ax = axes[1, 1]
    ax.hist(residuals, bins=30, edgecolor='black', alpha=0.7, color='steelblue',
            density=True)
    x_norm = np.linspace(residuals.min(), residuals.max(), 300)
    ax.plot(x_norm, stats.norm.pdf(x_norm, residuals.mean(), residuals.std()),
            'r-', linewidth=2, label='Normal PDF')
    ax.set_xlabel("Residual value")
    ax.set_ylabel("Density")
    ax.set_title("Residual Distribution\n(check: symmetric bell shape = normality OK)")
    ax.legend()

    plt.tight_layout()
    plt.savefig("residual_diagnostics.png", dpi=150, bbox_inches="tight")
    plt.show()
    print("Saved: residual_diagnostics.png")


# Generate diagnostics
y_pred_train = model.predict(X_train)
plot_residual_diagnostics(y_train, y_pred_train, "OLS Residual Diagnostics")

R² and Adjusted R²

$R^2 = 1 - \frac{SS_{\text{res}}}{SS_{\text{tot}}} = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}$

R² tells you the proportion of variance in $y$ explained by the model. However, R² always increases when you add more features - even random noise features improve training R². This is because more features expand the column space and bring $\hat{\mathbf{y}}$ closer to $\mathbf{y}$ . Use Adjusted R² to penalize for model complexity:

$\bar{R}^2 = 1 - (1 - R^2) \cdot \frac{n-1}{n-d-1}$

where $n$ is the number of samples and $d$ is the number of features (excluding the intercept). Adding a useless feature increases $d$ , which increases the penalty term $(n-1)/(n-d-1)$ , potentially decreasing adjusted R² even though R² increased.

def regression_metrics(y_true, y_pred, n_features):
    """Compute all standard regression metrics for reporting."""
    n = len(y_true)
    ss_res = np.sum((y_true - y_pred) ** 2)
    ss_tot = np.sum((y_true - np.mean(y_true)) ** 2)
    r2 = 1 - ss_res / ss_tot
    adj_r2 = 1 - (1 - r2) * (n - 1) / (n - n_features - 1)
    rmse = np.sqrt(ss_res / n)
    mae = np.mean(np.abs(y_true - y_pred))

    # Corrected AIC (AICc) for small samples
    sigma2 = ss_res / n
    k = n_features + 2  # features + intercept + sigma
    aic = n * np.log(sigma2) + 2 * k
    aicc = aic + (2 * k * (k + 1)) / (n - k - 1) if n > k + 1 else float('inf')

    return {
        "R2": round(r2, 5),
        "Adjusted_R2": round(adj_r2, 5),
        "RMSE": round(rmse, 4),
        "MAE": round(mae, 4),
        "AIC": round(aic, 2),
        "AICc": round(aicc, 2),
    }


y_pred_test = model.predict(X_test)
metrics = regression_metrics(y_test, y_pred_test, n_features=5)
print("Test metrics:")
for k, v in metrics.items():
    print(f"  {k}: {v}")

Multiple Regression: Interpreting Coefficients

In simple regression ( $d=1$ ), $w_1$ is the slope: for every unit increase in $x_1$ , $y$ increases by $w_1$ .

In multiple regression ( $d > 1$ ), $w_j$ is the partial regression coefficient: holding all other features constant, a unit increase in $x_j$ changes $y$ by $w_j$ . This conditional interpretation is powerful but fragile - it only holds if the features are truly independent of each other.

:::warning Coefficient interpretation with correlated features If features are correlated, partial regression coefficients are unreliable. Adding a correlated feature can flip the sign of an existing coefficient. This is Simpson's paradox in regression form - a phenomenon where a relationship reverses when a lurking variable is included. Always examine the correlation matrix and VIF before interpreting coefficients. :::

Detecting and Handling Multicollinearity

When features are correlated, $\mathbf{X}^\top\mathbf{X}$ becomes near-singular. The normal equation still produces a solution, but the coefficients are large, unstable, and have inflated standard errors. Diagnose with the Variance Inflation Factor (VIF):

$\text{VIF}_j = \frac{1}{1 - R^2_j}$

where $R^2_j$ is the R² obtained by regressing feature $j$ on all other features. VIF measures how much the variance of $\hat{w}_j$ is inflated by multicollinearity relative to orthogonal features:

VIF = 1: feature is completely uncorrelated with all others
VIF = 5: moderate multicollinearity - coefficient estimates are somewhat unstable
VIF > 10: severe multicollinearity - coefficients are unreliable

import pandas as pd
from sklearn.linear_model import LinearRegression

def compute_vif(X_df):
    """
    Compute VIF for each feature column in a DataFrame.
    VIF_j = 1/(1 - R^2_j) where R^2_j comes from regressing
    feature j on all other features.

    Parameters:
        X_df: pd.DataFrame with named feature columns

    Returns:
        pd.DataFrame with VIF for each feature, sorted descending
    """
    vif_results = {}
    cols = X_df.columns.tolist()

    for col in cols:
        y_vif = X_df[col].values
        X_vif = X_df.drop(columns=[col]).values

        m = LinearRegression().fit(X_vif, y_vif)
        r2 = m.score(X_vif, y_vif)

        # VIF = 1/(1-R^2); cap at 1e6 to avoid division by near-zero
        vif = 1 / (1 - r2) if r2 < 0.9999 else 1e6
        vif_results[col] = round(vif, 3)

    return (pd.DataFrame.from_dict(vif_results, orient='index', columns=['VIF'])
            .sort_values('VIF', ascending=False))


# Demonstrate multicollinearity
np.random.seed(42)
n = 300
x1 = np.random.randn(n)
x2 = x1 + 0.05 * np.random.randn(n)  # x2 nearly identical to x1 - correlation ~0.998
x3 = np.random.randn(n)               # x3 is independent

X_mc = pd.DataFrame({'x1': x1, 'x2': x2, 'x3': x3})
print("VIF analysis:")
print(compute_vif(X_mc))
print("\nInterpretation:")
print("  x1 and x2 have very high VIF (near-perfect collinearity)")
print("  x3 has VIF ≈ 1 (uncorrelated with others)")
print("  Fix: drop x2, use Ridge regression, or apply PCA")

Polynomial Regression

Linear regression generalizes to polynomial regression by expanding the feature set with powers of $x$ . For a single feature $x$ , degree-3 polynomial regression:

$\hat{y} = w_0 + w_1 x + w_2 x^2 + w_3 x^3$

This is still linear regression - the model is linear in the parameters $\mathbf{w}$ , not in $x$ . All OLS theory applies.

from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

# Nonlinear 1D data - true quadratic relationship
np.random.seed(42)
n = 150
X_1d = np.random.uniform(-3, 3, n).reshape(-1, 1)
y_nl = X_1d.ravel() ** 2 - 2 * X_1d.ravel() + np.random.randn(n) * 0.5
# True: y = x^2 - 2x + noise

print("Polynomial degree selection via 5-fold CV R²:")
print(f"{'Degree':>8} | {'CV R²':>10} | {'CV Std':>8} | {'Features':>10}")
print("-" * 45)

for degree in [1, 2, 3, 5, 8, 12]:
    pipe = Pipeline([
        ('poly', PolynomialFeatures(degree=degree, include_bias=True)),
        ('lr', LinearRegression()),
    ])
    cv_scores = cross_val_score(pipe, X_1d, y_nl, cv=5, scoring='r2')
    n_features = PolynomialFeatures(degree=degree).fit_transform(X_1d).shape[1]
    print(f"{degree:>8} | {cv_scores.mean():>10.4f} | {cv_scores.std():>8.4f} | {n_features:>10}")

# Expected: degree 2 should win (matches true quadratic)
# Degree 1: underfits. Degree 5+: overfits (CV R² drops)

scikit-learn with Full Diagnostics

from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# sklearn uses LAPACK gelsd (divide-and-conquer SVD) - the gold standard
sk_model = LinearRegression(fit_intercept=True)
sk_model.fit(X_train, y_train)

print(f"sklearn R² (test): {sk_model.score(X_test, y_test):.5f}")
print(f"Intercept:         {sk_model.intercept_:.4f}")
print(f"Coefficients:      {sk_model.coef_}")

# Verify our implementation matches sklearn exactly
our_pred = model.predict(X_test)
sk_pred  = sk_model.predict(X_test)
max_diff  = np.max(np.abs(our_pred - sk_pred))
print(f"\nMax prediction difference (ours vs sklearn): {max_diff:.2e}")
# Should be < 1e-8 - only floating-point precision differences

# Production best practice: always scale features in a pipeline
# Note: scaling does NOT change OLS predictions mathematically,
# but it is crucial for gradient descent (Lesson 02) and Lasso (Lesson 05)
pipe_lr = Pipeline([
    ('scaler', StandardScaler()),
    ('lr', LinearRegression()),
])
pipe_lr.fit(X_train, y_train)
print(f"\nPipeline (scaled) test R²: {pipe_lr.score(X_test, y_test):.5f}")
# Should be identical to unscaled OLS (scale-equivariant)

Production Engineering Notes

Memory and Compute Scaling

The normal equation requires $O(nd^2)$ to form $\mathbf{X}^\top\mathbf{X}$ and $O(d^3)$ to solve. For $n=10^6, d=1000$ : forming the matrix requires $10^9$ multiply-adds. sklearn automatically uses LAPACK routines that can leverage multi-core CPUs, but for very large $n$ , gradient descent (Lesson 02) or SGDRegressor with mini-batches is the only viable approach.

When OLS Fails Silently

The most dangerous failure mode is when OLS runs successfully but produces misleading results:

Multicollinearity: R² looks fine but individual coefficient interpretations are meaningless. Large, unstable coefficients with wrong signs.
Outlier dominance: One high-leverage point pulls the entire regression line. Remove it and the slope changes dramatically.
Omitted variable bias: A feature correlated with both predictor and outcome is missing. Coefficients are biased - but there is no warning or error.
Extrapolation: Model deployed on feature distributions different from training data. Predictions become unreliable outside the training range.

Common Mistakes

:::danger Direct matrix inversion - always use lstsq or solve

# WRONG - numerically unstable, fails with singular or near-singular XtX
w = np.linalg.inv(X.T @ X) @ X.T @ y

# CORRECT - numerically stable via SVD
w, _, _, _ = np.linalg.lstsq(X, y, rcond=None)

# ALSO CORRECT - solve the system rather than inverting
w = np.linalg.solve(X.T @ X, X.T @ y)

Direct inversion amplifies floating-point errors. The condition number of $\mathbf{X}^\top\mathbf{X}$ is the square of the condition number of $\mathbf{X}$ - any ill-conditioning is squared. With near-collinear features, np.linalg.inv will return wrong results without any warning. :::

:::danger Not checking VIF before interpreting coefficients R² can be 0.95 with VIFs in the hundreds. The model predicts well but individual coefficients are meaningless - they reflect the specific correlations in the training data, not the underlying causal structure. Always compute VIF before reporting coefficient interpretations to stakeholders, business partners, or anyone making decisions. :::

:::warning Evaluating linear models only on training R² R² on training data is essentially useless - it always increases with more features. Use adjusted R², cross-validated R², or hold-out RMSE. A model with training R² = 0.99 and test R² = 0.30 is completely overfit. Report multiple metrics including both train and test performance, and flag large gaps as warning signs. :::

:::warning Forgetting that OLS coefficients are not scale-invariant A coefficient of 0.001 for an income feature in dollars is large; 0.001 for a proportion is tiny. Features on different scales produce coefficients that cannot be compared. Always standardize features before comparing coefficient magnitudes, or report standardized beta coefficients alongside raw coefficients. :::

YouTube Resources

Resource	Channel	Why Watch
Linear Regression, Clearly Explained!	StatQuest with Josh Starmer	Best visual intuition for OLS and R² - start here
MIT 18.650 Statistics for Applications	MIT OpenCourseWare	Rigorous treatment of normal equations and Gauss-Markov theorem
Essence of Linear Algebra: Projections	3Blue1Brown	The geometric interpretation - column space and projection visualized beautifully
Residual Analysis in Linear Regression	StatQuest	Practical diagnostic plots - what to look for and what to do
Multiple Regression and Cook's Distance	StatQuest	Leverage, influence, and outlier detection with worked examples

Interview Q&A

Q1: Derive the normal equation from first principles. When would you not use it?

Starting from $\mathcal{L}(\mathbf{w}) = \|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2$ , expand to get $\mathbf{y}^\top\mathbf{y} - 2\mathbf{w}^\top\mathbf{X}^\top\mathbf{y} + \mathbf{w}^\top\mathbf{X}^\top\mathbf{X}\mathbf{w}$ . Take gradient using matrix calculus: $\nabla_\mathbf{w}\mathcal{L} = -2\mathbf{X}^\top\mathbf{y} + 2\mathbf{X}^\top\mathbf{X}\mathbf{w}$ . Set to zero and solve: $\hat{\mathbf{w}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}$ .

Avoid the normal equation when: (1) $n > 100{,}000$ - forming $\mathbf{X}^\top\mathbf{X}$ costs $O(nd^2)$ and inversion costs $O(d^3)$ ; (2) $d > 10{,}000$ - inversion of a $d \times d$ matrix is numerically unstable and slow; (3) features are highly multicollinear - $\mathbf{X}^\top\mathbf{X}$ is near-singular; (4) online/streaming data - normal equation requires all data at once; (5) Lasso - no closed form exists.

Q2: What does it mean geometrically to fit a linear regression?

Fitting a linear regression is projecting the target vector $\mathbf{y} \in \mathbb{R}^n$ onto the column space of the feature matrix $\mathbf{X}$ . The column space is the subspace of $\mathbb{R}^n$ spanned by the feature columns. The prediction $\hat{\mathbf{y}} = \mathbf{X}\hat{\mathbf{w}}$ is the closest point in that subspace to $\mathbf{y}$ , measured by Euclidean distance. The residual $\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}}$ is perpendicular to the column space - this is exactly the normal equation $\mathbf{X}^\top\mathbf{e} = \mathbf{0}$ . Adding a feature that is collinear with existing ones does not expand the column space, so predictions are unchanged but coefficient estimates become unstable.

Q3: What is the Gauss-Markov theorem and why does it matter?

Gauss-Markov states that under the LINE assumptions - linearity, independence, zero-mean errors, and homoscedasticity (but NOT necessarily Gaussian errors) - OLS is BLUE: the Best Linear Unbiased Estimator. It has minimum variance among all linear unbiased estimators. It matters because it provides a theoretical guarantee that no linear method can do better than OLS in terms of variance without introducing bias or using nonlinear estimation. Practically: when assumptions hold, OLS is provably optimal. When homoscedasticity fails, Weighted Least Squares (WLS) becomes BLUE because it weights observations inversely proportional to their error variance, giving less weight to noisy observations.

Q4: How does multicollinearity affect linear regression and what do you do about it?

Multicollinearity means features are highly correlated, making $\mathbf{X}^\top\mathbf{X}$ near-singular. Effects: (1) coefficients become large and unstable - small data changes cause large coefficient swings; (2) standard errors inflate, making individual coefficients statistically insignificant even when the joint model predicts well; (3) coefficient signs may flip relative to the true direction; (4) overall predictions remain fine - the model still projects $\mathbf{y}$ correctly onto the (unchanged) column space, but individual feature attributions are meaningless.

Detection: VIF > 10 signals severe multicollinearity. Correlation matrix shows pairwise correlations. Condition number of $\mathbf{X}^\top\mathbf{X}$ above 1000 indicates ill-conditioning.

Fix: Ridge regression adds $\lambda\mathbf{I}$ to $\mathbf{X}^\top\mathbf{X}$ , ensuring invertibility and shrinking unstable directions; or drop one of the highly correlated features; or apply PCA to create orthogonal principal components.

Q5: What is Cook's distance and when would you act on it?

Cook's distance $D_i = \frac{e_i^2}{(d+1)\hat{\sigma}^2} \cdot \frac{h_{ii}}{(1-h_{ii})^2}$ combines a point's residual size ( $e_i^2$ ) with its leverage ( $h_{ii}$ , how far the point's feature values are from the center of the training data). It measures how much all fitted coefficients would change if observation $i$ were removed.

A point with high leverage but small residual (fits the trend at an unusual feature location) has low Cook's D - it is not influential. A point with high leverage AND a large residual pulls the regression line toward it and has high Cook's D. Threshold: $D_i > 4/n$ flags potential influence; $D_i > 1$ is a serious warning.

Action: investigate flagged points first - is it a data entry error? A legitimate extreme case? Then either remove it (if it is a clear error), keep it and apply Huber loss for robustness, or use RANSAC which explicitly handles outliers by fitting the model to the inlier subset.

Q6: Your model has high training R² but poor test R². Walk me through your diagnostic process.

Systematic investigation: (1) Check residual plots for nonlinearity (curved pattern in residuals vs fitted) - may need polynomial features or a GLM; (2) Check for distribution shift between train and production by comparing feature histograms; (3) Compute VIF - if severe multicollinearity, the model is overfit to specific training correlations that may not hold in new data; (4) Examine Cook's distances - high-leverage outliers in training may distort the model; (5) Look for data leakage - features that inadvertently encode information about the target in training but not in production; (6) Check whether any categorical encodings or feature distributions have changed between training and test sets.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Regression Explorer demo on the EngineersOfAI Playground - no code required.

:::

The Real Interview Moment​

Why This Exists​

Historical Context​

The Least Squares Objective​

Deriving the Normal Equation​

Step-by-Step Scalar Derivation (Intuition)​

When NOT to Use the Normal Equation​

Geometric Interpretation: Projection onto Column Space​

The Gauss-Markov Theorem​

The LINE Assumptions​

L - Linearity​

I - Independence​

N - Normality of Residuals​

E - Equal Variance (Homoscedasticity)​

Heteroscedasticity in Detail​

Outlier Influence: Leverage and Cook's Distance​

Leverage​

Cook's Distance​

Full NumPy Implementation from Scratch​

Residual Diagnostic Plots​

R² and Adjusted R²​

Multiple Regression: Interpreting Coefficients​

Detecting and Handling Multicollinearity​

Polynomial Regression​

scikit-learn with Full Diagnostics​

Production Engineering Notes​

Memory and Compute Scaling​

When OLS Fails Silently​

Common Mistakes​

YouTube Resources​

Interview Q&A​