Regression Analysis: The Statistical Foundation of ML Models
Reading time: ~45 min | Interview relevance: Very High | Target roles: MLE, Data Scientist, Research Engineer, AI Engineer
The Production Scenario
A junior ML engineer replaces your carefully tuned neural network with a simple linear regression and achieves nearly identical performance on your tabular sales forecasting task. A more senior engineer explains: "Linear models are often competitive on tabular data with good features, and they are much more interpretable."
But do you know why linear regression works at all? Can you derive its solution from first principles? Do you understand when it will fail? Do you know why adding an penalty to the loss changes the statistical properties of the estimator?
These are not just academic questions. Understanding regression deeply gives you the statistical intuition to reason about all of ML: neural networks are nonlinear regression, logistic regression is the building block of classification, and the bias-variance tradeoff of regression estimators explains why regularisation is necessary.
Simple Linear Regression
The Model
We want to estimate (intercept) and (slope) from observations .
Statistical assumptions (GAUSS-MARKOV):
- Linearity:
- Independence: observations are independent
- Homoscedasticity: (constant variance)
- Normality of errors: (for inference)
OLS Derivation
Ordinary Least Squares (OLS) minimises the sum of squared residuals:
Setting partial derivatives to zero:
Substituting (1) into (2):
This is the sample covariance over sample variance - the slope is how much changes per unit change in , standardised by the spread of .
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(42)
n = 100
x = np.linspace(0, 10, n)
beta_0_true, beta_1_true, sigma_true = 2.0, 1.5, 1.5
y = beta_0_true + beta_1_true * x + np.random.normal(0, sigma_true, n)
# OLS estimation from scratch
x_bar, y_bar = np.mean(x), np.mean(y)
beta_1_hat = np.sum((x - x_bar) * (y - y_bar)) / np.sum((x - x_bar)**2)
beta_0_hat = y_bar - beta_1_hat * x_bar
print(f"True: beta_0={beta_0_true}, beta_1={beta_1_true}")
print(f"Estimated: beta_0={beta_0_hat:.4f}, beta_1={beta_1_hat:.4f}")
# Residuals
y_hat = beta_0_hat + beta_1_hat * x
residuals = y - y_hat
rss = np.sum(residuals**2)
print(f"RSS: {rss:.4f}")
# Verify with numpy least squares
A = np.column_stack([np.ones(n), x])
coeffs, _, _, _ = np.linalg.lstsq(A, y, rcond=None)
print(f"NumPy lstsq: beta_0={coeffs[0]:.4f}, beta_1={coeffs[1]:.4f}")
Multiple Linear Regression: Matrix Form
For predictors, the model is:
where , (design matrix with intercept column), .
OLS in Matrix Form
Minimise
Taking the gradient and setting to zero:
The Normal Equations. Solving:
This is the OLS closed-form solution - the direct solution to linear regression.
import numpy as np
def ols_fit(X, y):
"""OLS via normal equations. X should include intercept column."""
return np.linalg.solve(X.T @ X, X.T @ y)
def ols_fit_qr(X, y):
"""OLS via QR decomposition - numerically stable, preferred."""
Q, R = np.linalg.qr(X)
return np.linalg.solve(R, Q.T @ y)
np.random.seed(42)
n, p = 200, 5
X_raw = np.random.randn(n, p)
X = np.column_stack([np.ones(n), X_raw]) # add intercept
beta_true = np.array([1.0, 2.0, -1.5, 0.8, 0.3, -0.7])
y = X @ beta_true + np.random.normal(0, 0.5, n)
beta_ols = ols_fit(X, y)
beta_qr = ols_fit_qr(X, y)
print(f"True beta: {beta_true}")
print(f"OLS (normal eq): {beta_ols.round(4)}")
print(f"OLS (QR): {beta_qr.round(4)}")
# Standard errors using statsmodels
import statsmodels.api as sm
model = sm.OLS(y, X).fit()
print(f"\nStatsmodels summary:")
print(model.summary().tables[1])
Properties of the OLS Estimator
Under the Gauss-Markov assumptions, OLS is the Best Linear Unbiased Estimator (BLUE):
- Unbiased:
- Variance:
- Best: No other linear unbiased estimator has lower variance (Gauss-Markov theorem)
The covariance matrix reveals the precision of estimates - diagonal elements give the variance of each coefficient, off-diagonals show correlations between estimates.
Goodness of Fit: and Adjusted
- Coefficient of Determination
is the fraction of variance in explained by the model. means the model is no better than predicting the mean; means perfect fit.
Problem: never decreases when you add more predictors, even useless ones. This can mislead you into thinking more features always help.
Adjusted
Penalises for the number of predictors:
Adjusted can decrease when you add uninformative features - it is a better model selection criterion than raw .
import numpy as np
def r_squared(y_true, y_pred):
ss_res = np.sum((y_true - y_pred)**2)
ss_tot = np.sum((y_true - np.mean(y_true))**2)
return 1 - ss_res / ss_tot
def adjusted_r_squared(y_true, y_pred, n_features):
n = len(y_true)
r2 = r_squared(y_true, y_pred)
return 1 - (1 - r2) * (n - 1) / (n - n_features - 1)
np.random.seed(42)
n = 100
X_true = np.random.randn(n, 3)
X_noise = np.random.randn(n, 20) # 20 useless features
beta_true = np.array([1.0, -0.5, 0.8])
y = X_true @ beta_true + np.random.normal(0, 1.0, n)
# Model with just the true features
X_good = np.column_stack([np.ones(n), X_true])
beta_good, _, _, _ = np.linalg.lstsq(X_good, y, rcond=None)
y_hat_good = X_good @ beta_good
# Model with true features + noise
X_all = np.column_stack([np.ones(n), X_true, X_noise])
beta_all, _, _, _ = np.linalg.lstsq(X_all, y, rcond=None)
y_hat_all = X_all @ beta_all
r2_good = r_squared(y, y_hat_good)
r2_all = r_squared(y, y_hat_all)
adj_r2_good = adjusted_r_squared(y, y_hat_good, 3)
adj_r2_all = adjusted_r_squared(y, y_hat_all, 23)
print(f"Model with 3 true features: R² = {r2_good:.4f}, Adj R² = {adj_r2_good:.4f}")
print(f"Model with 23 features: R² = {r2_all:.4f}, Adj R² = {adj_r2_all:.4f}")
print("R² increases with more features; Adjusted R² reveals the correct model")
Residual Diagnostics
Good regression requires checking that assumptions are met. The residuals should be:
- Normally distributed (for valid inference)
- Homoscedastic (constant variance)
- Independent (no patterns)
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
def regression_diagnostics(X, y, title=""):
"""
Four standard regression diagnostic plots.
"""
# Fit model
beta, _, _, _ = np.linalg.lstsq(X, y, rcond=None)
y_hat = X @ beta
residuals = y - y_hat
n, p = X.shape
# Standardised residuals
mse = np.sum(residuals**2) / (n - p)
leverage = np.diag(X @ np.linalg.pinv(X.T @ X) @ X.T)
std_resid = residuals / np.sqrt(mse * (1 - leverage))
print(f"Regression Diagnostics: {title}")
# Shapiro-Wilk normality test
stat, p_shapiro = stats.shapiro(residuals)
print(f"Shapiro-Wilk normality test: p={p_shapiro:.4f} "
f"({'Normal' if p_shapiro > 0.05 else 'Non-normal'})")
# Breusch-Pagan heteroscedasticity test (simplified)
fitted_vals = y_hat - np.mean(y_hat)
resid_sq = residuals**2
_, p_bp = stats.pearsonr(fitted_vals, resid_sq)
print(f"Heteroscedasticity (|corr| of fitted vs resid²): |r|={abs(p_bp):.4f}")
return residuals, std_resid
np.random.seed(42)
n = 100
x = np.random.randn(n)
y_good = 2 + 1.5*x + np.random.normal(0, 1, n) # well-behaved
X = np.column_stack([np.ones(n), x])
diagnostics = regression_diagnostics(X, y_good, "Good model")
Logistic Regression as a Generalised Linear Model
Linear regression assumes continuous outputs with Gaussian noise. Logistic regression extends this to binary outcomes by modelling the probability of class membership.
The Logistic Model
Instead of modelling directly, we model :
where is the sigmoid function. Taking the log-odds (logit transform):
The model is linear in the log-odds space. This makes logistic regression a linear model - it only creates a linear decision boundary.
Why the Sigmoid? A GLM Perspective
Logistic regression belongs to the Generalised Linear Model (GLM) family:
where is the link function. For logistic regression, is the logit function. For linear regression, is the identity. For Poisson regression (count data), is the log.
| Model | Distribution | Link Function | Use Case |
|---|---|---|---|
| Linear regression | Gaussian | Identity | Continuous outputs |
| Logistic regression | Bernoulli | Logit | Binary classification |
| Poisson regression | Poisson | Log | Count data |
| Softmax regression | Categorical | Softmax | Multi-class |
Fitting via MLE (Newton-Raphson)
Logistic regression has no closed-form solution - we use iterative optimisation. The log-likelihood:
The gradient:
The Hessian (used in Newton-Raphson):
where is the weight matrix.
import numpy as np
from scipy.special import expit # numerically stable sigmoid
def logistic_regression_fit(X, y, n_iter=100, tol=1e-6):
"""
Logistic regression via Newton-Raphson (IRLS).
This is what sklearn LogisticRegression uses internally.
"""
n, p = X.shape
beta = np.zeros(p)
for iteration in range(n_iter):
# Forward pass
p_hat = expit(X @ beta)
# Gradient of log-likelihood
gradient = X.T @ (y - p_hat)
# Hessian (negative definite)
W = p_hat * (1 - p_hat)
hessian = -(X.T * W) @ X # X^T W X
# Newton-Raphson update: beta = beta - H^{-1} * grad
# Equivalent to: beta = beta + (X^T W X)^{-1} X^T (y - p_hat)
delta = np.linalg.solve(-hessian, gradient)
beta += delta
if np.max(np.abs(delta)) < tol:
print(f"Converged at iteration {iteration+1}")
break
return beta
# Generate binary classification data
np.random.seed(42)
n = 300
X_raw = np.random.randn(n, 3)
X = np.column_stack([np.ones(n), X_raw])
beta_true = np.array([0.5, 1.2, -0.8, 0.6])
p_true = expit(X @ beta_true)
y = (np.random.rand(n) < p_true).astype(float)
beta_fit = logistic_regression_fit(X, y)
print(f"True beta: {beta_true}")
print(f"Fitted beta: {beta_fit.round(4)}")
# Verify with sklearn
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(fit_intercept=False, C=1e10, max_iter=1000) # no regularisation
lr.fit(X, y)
print(f"Sklearn beta: {lr.coef_[0].round(4)}")
:::tip ML Engineering Connection Understanding logistic regression as a GLM with a Bernoulli likelihood tells you why the sigmoid output of a neural network's final layer is meaningful as a probability - it is the canonical link function for binary outcomes. A binary classifier's output layer (linear + sigmoid + cross-entropy loss) is literally logistic regression applied to the neural network's penultimate feature representation. :::
Ridge Regression: Statistical Perspective
Ridge regression (L2 regularisation) modifies the OLS objective:
The closed-form solution:
Adding to before inverting:
- Makes the problem well-conditioned even when features are collinear
- Shrinks coefficients toward zero - introduces bias but reduces variance
- Is MAP estimation with a Gaussian prior
import numpy as np
import matplotlib.pyplot as plt
def ridge_regression(X, y, lambda_reg):
"""Ridge regression: (X^T X + lambda I)^{-1} X^T y"""
p = X.shape[1]
A = X.T @ X + lambda_reg * np.eye(p)
b = X.T @ y
return np.linalg.solve(A, b)
# Illustrate coefficient shrinkage as lambda increases
np.random.seed(42)
n, p = 100, 10
X = np.random.randn(n, p)
X = np.column_stack([np.ones(n), X])
beta_true = np.array([1.0, 2.0, -1.5, 0.8, 0.3, -0.7, 1.2, -0.4, 0.9, -1.1, 0.6])
y = X @ beta_true + np.random.normal(0, 1.0, n)
lambdas = np.logspace(-3, 3, 100)
coef_paths = np.array([ridge_regression(X, y, l) for l in lambdas])
print("Ridge coefficient shrinkage:")
print(f"{'Lambda':>10} | {'||beta||':>10} | {'Train R²':>10}")
print("-" * 38)
for lam in [0.001, 0.1, 1.0, 10.0, 100.0]:
beta = ridge_regression(X, y, lam)
y_hat = X @ beta
r2 = 1 - np.sum((y - y_hat)**2) / np.sum((y - np.mean(y))**2)
print(f"{lam:>10.3f} | {np.linalg.norm(beta):>10.4f} | {r2:>10.4f}")
Lasso Regression: Sparsity-Inducing
Lasso (Least Absolute Shrinkage and Selection Operator) uses L1 regularisation:
No closed-form solution (L1 norm is not differentiable at 0). Solved with coordinate descent or subgradient methods.
Key difference from Ridge: Lasso sets some coefficients exactly to zero - it performs automatic feature selection. This comes from the geometry of the L1 ball, which has corners at the axes.
from sklearn.linear_model import Ridge, Lasso, ElasticNet
# Compare OLS, Ridge, and Lasso
np.random.seed(42)
n, p_true, p_noise = 100, 5, 45
X_raw = np.random.randn(n, p_true + p_noise)
X = np.column_stack([np.ones(n), X_raw])
beta_sparse = np.zeros(p_true + p_noise + 1)
beta_sparse[:6] = [0.5, 2.0, -1.5, 1.0, 0.8, -1.2] # only first 5 features matter
y = X @ beta_sparse + np.random.normal(0, 0.5, n)
X_feat = X_raw # without intercept for sklearn
# Fit models
from sklearn.linear_model import LinearRegression
ols = LinearRegression().fit(X_feat, y)
ridge = Ridge(alpha=1.0).fit(X_feat, y)
lasso = Lasso(alpha=0.1, max_iter=10000).fit(X_feat, y)
n_nonzero_ols = np.sum(np.abs(ols.coef_) > 1e-10)
n_nonzero_ridge = np.sum(np.abs(ridge.coef_) > 1e-10)
n_nonzero_lasso = np.sum(np.abs(lasso.coef_) > 1e-10)
print(f"True non-zero features: 5")
print(f"OLS non-zero coefs: {n_nonzero_ols}")
print(f"Ridge non-zero coefs: {n_nonzero_ridge}")
print(f"Lasso non-zero coefs: {n_nonzero_lasso} <- Lasso selects features!")
Ridge vs Lasso vs ElasticNet
| Method | Penalty | Solution | Feature Selection | Use Case |
|---|---|---|---|---|
| OLS | None | Closed-form | No | Low-dim, no collinearity |
| Ridge | Closed-form | No (shrinks, not zeros) | Collinear features | |
| Lasso | Iterative | Yes (exact zeros) | Feature selection needed | |
| ElasticNet | Iterative | Yes | Groups of correlated features |
The Geometric View
L2 ball (Ridge): L1 ball (Lasso):
smooth corners at axes
┌──────┐ /\
/ \ / \
│ ● │ / \
\ / ◄ ● ►
└──────┘ \ /
\ /
\/
Minimum of (OLS loss) on the constraint set:
- Ridge: tangent point is rarely at axis → rarely zero coefficients
- Lasso: tangent often hits a corner → exact zeros → sparsity
Complete Regression Pipeline with Statsmodels
import numpy as np
import statsmodels.api as sm
np.random.seed(42)
n = 200
x1 = np.random.randn(n)
x2 = np.random.randn(n)
y = 1.5 + 2.0*x1 - 0.8*x2 + np.random.normal(0, 1.0, n)
X = sm.add_constant(np.column_stack([x1, x2]))
model = sm.OLS(y, X).fit()
print(model.summary())
# Key statistics to understand:
print(f"\nR-squared: {model.rsquared:.4f}")
print(f"Adjusted R-squared: {model.rsquared_adj:.4f}")
print(f"F-statistic: {model.fvalue:.4f} (p = {model.f_pvalue:.6f})")
print(f"\nCoefficients:")
for name, coef, se, t, p in zip(
['const', 'x1', 'x2'],
model.params, model.bse, model.tvalues, model.pvalues
):
sig = "***" if p < 0.001 else "**" if p < 0.01 else "*" if p < 0.05 else ""
print(f" {name:6s}: {coef:7.4f} SE={se:.4f} t={t:7.3f} p={p:.4f} {sig}")
Interview Q&A
Q1: Derive the OLS estimator for simple linear regression.
We minimise . Taking gives . Taking and substituting gives . This equals the covariance of and divided by the variance of - how much varies with relative to how much varies. Under the Gauss-Markov assumptions, OLS is BLUE (Best Linear Unbiased Estimator).
Q2: What is the difference between and Adjusted ? When should you use each?
measures the fraction of variance explained. It never decreases when adding predictors, even irrelevant ones. Adjusted penalises for the number of parameters . It can decrease when adding uninformative features. Use to measure a fixed model's explanatory power. Use adjusted to compare models with different numbers of features. For model selection, AIC and BIC are often better than adjusted .
Q3: Why is logistic regression called "regression" if it does classification?
Logistic regression models the log-odds of the outcome as a linear function: . It is a generalised linear model (GLM) where the link function is the logit. The "regression" refers to the linear model in log-odds space. The predicted probability is , and classification is done by thresholding this probability. The decision boundary is linear in the original feature space. It is "regression" because it regresses the log-odds on the features.
Q4: What is the statistical interpretation of Ridge and Lasso regularisation?
Ridge (L2) regularisation is MAP estimation with a Gaussian prior . The penalty is the negative log of this Gaussian prior. Ridge shrinks all coefficients but keeps all features. Lasso (L1) regularisation is MAP with a Laplace prior . The Laplace distribution has a sharp peak at zero, which promotes sparsity - Lasso sets some coefficients exactly to zero, performing feature selection. The L1 ball's corners at the axes cause this sparsity geometrically.
Q5: What are the Gauss-Markov assumptions, and what happens when they are violated in ML settings?
The Gauss-Markov assumptions for OLS to be BLUE are: (1) Linearity - the true model is linear in parameters; (2) Exogeneity - (errors are uncorrelated with features); (3) Spherical errors - (homoscedastic, uncorrelated errors); (4) No perfect multicollinearity. Violations in ML: if errors are heteroscedastic (variance changes with ), coefficient estimates are still unbiased but confidence intervals are wrong - use robust standard errors. If features are correlated with errors (omitted variable bias), OLS is biased - this is the causal inference problem. If features are collinear, becomes ill-conditioned - use Ridge to stabilise.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Regression Explorer demo on the EngineersOfAI Playground - no code required.
:::
