Skip to main content

Regression Analysis: The Statistical Foundation of ML Models

Reading time: ~45 min | Interview relevance: Very High | Target roles: MLE, Data Scientist, Research Engineer, AI Engineer

The Production Scenario

A junior ML engineer replaces your carefully tuned neural network with a simple linear regression and achieves nearly identical performance on your tabular sales forecasting task. A more senior engineer explains: "Linear models are often competitive on tabular data with good features, and they are much more interpretable."

But do you know why linear regression works at all? Can you derive its solution from first principles? Do you understand when it will fail? Do you know why adding an L2L^2 penalty to the loss changes the statistical properties of the estimator?

These are not just academic questions. Understanding regression deeply gives you the statistical intuition to reason about all of ML: neural networks are nonlinear regression, logistic regression is the building block of classification, and the bias-variance tradeoff of regression estimators explains why regularisation is necessary.

Simple Linear Regression

The Model

yi=β0+β1xi+ϵi,ϵiN(0,σ2)y_i = \beta_0 + \beta_1 x_i + \epsilon_i, \quad \epsilon_i \sim \mathcal{N}(0, \sigma^2)

We want to estimate β0\beta_0 (intercept) and β1\beta_1 (slope) from nn observations {(xi,yi)}i=1n\{(x_i, y_i)\}_{i=1}^n.

Statistical assumptions (GAUSS-MARKOV):

  1. Linearity: E[yx]=β0+β1x\mathbb{E}[y|x] = \beta_0 + \beta_1 x
  2. Independence: observations are independent
  3. Homoscedasticity: Var(ϵi)=σ2\text{Var}(\epsilon_i) = \sigma^2 (constant variance)
  4. Normality of errors: ϵiN(0,σ2)\epsilon_i \sim \mathcal{N}(0, \sigma^2) (for inference)

OLS Derivation

Ordinary Least Squares (OLS) minimises the sum of squared residuals:

RSS(β0,β1)=i=1n(yiβ0β1xi)2\text{RSS}(\beta_0, \beta_1) = \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i)^2

Setting partial derivatives to zero:

RSSβ0=2i=1n(yiβ0β1xi)=0\frac{\partial \text{RSS}}{\partial \beta_0} = -2\sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i) = 0 β^0=yˉβ^1xˉ...(1)\Rightarrow \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x} \quad \text{...(1)}

RSSβ1=2i=1nxi(yiβ0β1xi)=0...(2)\frac{\partial \text{RSS}}{\partial \beta_1} = -2\sum_{i=1}^n x_i(y_i - \beta_0 - \beta_1 x_i) = 0 \quad \text{...(2)}

Substituting (1) into (2):

β^1=i=1n(xixˉ)(yiyˉ)i=1n(xixˉ)2=Cov(x,y)Var(x)\hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2} = \frac{\text{Cov}(x, y)}{\text{Var}(x)}

This is the sample covariance over sample variance - the slope is how much yy changes per unit change in xx, standardised by the spread of xx.

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)
n = 100
x = np.linspace(0, 10, n)
beta_0_true, beta_1_true, sigma_true = 2.0, 1.5, 1.5
y = beta_0_true + beta_1_true * x + np.random.normal(0, sigma_true, n)

# OLS estimation from scratch
x_bar, y_bar = np.mean(x), np.mean(y)
beta_1_hat = np.sum((x - x_bar) * (y - y_bar)) / np.sum((x - x_bar)**2)
beta_0_hat = y_bar - beta_1_hat * x_bar

print(f"True: beta_0={beta_0_true}, beta_1={beta_1_true}")
print(f"Estimated: beta_0={beta_0_hat:.4f}, beta_1={beta_1_hat:.4f}")

# Residuals
y_hat = beta_0_hat + beta_1_hat * x
residuals = y - y_hat
rss = np.sum(residuals**2)
print(f"RSS: {rss:.4f}")

# Verify with numpy least squares
A = np.column_stack([np.ones(n), x])
coeffs, _, _, _ = np.linalg.lstsq(A, y, rcond=None)
print(f"NumPy lstsq: beta_0={coeffs[0]:.4f}, beta_1={coeffs[1]:.4f}")

Multiple Linear Regression: Matrix Form

For pp predictors, the model is:

y=Xβ+ϵ\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}

where yRn\mathbf{y} \in \mathbb{R}^n, XRn×(p+1)\mathbf{X} \in \mathbb{R}^{n \times (p+1)} (design matrix with intercept column), βRp+1\boldsymbol{\beta} \in \mathbb{R}^{p+1}.

OLS in Matrix Form

Minimise RSS=yXβ2=(yXβ)(yXβ)\text{RSS} = \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2 = (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^\top(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})

Taking the gradient and setting to zero:

βRSS=2X(yXβ)=0\nabla_\beta \text{RSS} = -2\mathbf{X}^\top(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) = 0

XXβ^=Xy\mathbf{X}^\top\mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{X}^\top\mathbf{y}

The Normal Equations. Solving:

β^=(XX)1Xy\boxed{\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}}

This is the OLS closed-form solution - the direct solution to linear regression.

import numpy as np

def ols_fit(X, y):
"""OLS via normal equations. X should include intercept column."""
return np.linalg.solve(X.T @ X, X.T @ y)

def ols_fit_qr(X, y):
"""OLS via QR decomposition - numerically stable, preferred."""
Q, R = np.linalg.qr(X)
return np.linalg.solve(R, Q.T @ y)

np.random.seed(42)
n, p = 200, 5
X_raw = np.random.randn(n, p)
X = np.column_stack([np.ones(n), X_raw]) # add intercept

beta_true = np.array([1.0, 2.0, -1.5, 0.8, 0.3, -0.7])
y = X @ beta_true + np.random.normal(0, 0.5, n)

beta_ols = ols_fit(X, y)
beta_qr = ols_fit_qr(X, y)

print(f"True beta: {beta_true}")
print(f"OLS (normal eq): {beta_ols.round(4)}")
print(f"OLS (QR): {beta_qr.round(4)}")

# Standard errors using statsmodels
import statsmodels.api as sm
model = sm.OLS(y, X).fit()
print(f"\nStatsmodels summary:")
print(model.summary().tables[1])

Properties of the OLS Estimator

Under the Gauss-Markov assumptions, OLS is the Best Linear Unbiased Estimator (BLUE):

  • Unbiased: E[β^]=β\mathbb{E}[\hat{\boldsymbol{\beta}}] = \boldsymbol{\beta}
  • Variance: Var(β^)=σ2(XX)1\text{Var}(\hat{\boldsymbol{\beta}}) = \sigma^2(\mathbf{X}^\top\mathbf{X})^{-1}
  • Best: No other linear unbiased estimator has lower variance (Gauss-Markov theorem)

The covariance matrix (XX)1(\mathbf{X}^\top\mathbf{X})^{-1} reveals the precision of estimates - diagonal elements give the variance of each coefficient, off-diagonals show correlations between estimates.

Goodness of Fit: R2R^2 and Adjusted R2R^2

R2R^2 - Coefficient of Determination

R2=1RSSTSS=1i(yiy^i)2i(yiyˉ)2R^2 = 1 - \frac{\text{RSS}}{\text{TSS}} = 1 - \frac{\sum_i (y_i - \hat{y}_i)^2}{\sum_i (y_i - \bar{y})^2}

R2R^2 is the fraction of variance in yy explained by the model. R2=0R^2 = 0 means the model is no better than predicting the mean; R2=1R^2 = 1 means perfect fit.

Problem: R2R^2 never decreases when you add more predictors, even useless ones. This can mislead you into thinking more features always help.

Adjusted R2R^2

Penalises for the number of predictors:

Rˉ2=1RSS/(np1)TSS/(n1)=1(1R2)n1np1\bar{R}^2 = 1 - \frac{\text{RSS}/(n-p-1)}{\text{TSS}/(n-1)} = 1 - (1-R^2)\frac{n-1}{n-p-1}

Adjusted R2R^2 can decrease when you add uninformative features - it is a better model selection criterion than raw R2R^2.

import numpy as np

def r_squared(y_true, y_pred):
ss_res = np.sum((y_true - y_pred)**2)
ss_tot = np.sum((y_true - np.mean(y_true))**2)
return 1 - ss_res / ss_tot

def adjusted_r_squared(y_true, y_pred, n_features):
n = len(y_true)
r2 = r_squared(y_true, y_pred)
return 1 - (1 - r2) * (n - 1) / (n - n_features - 1)

np.random.seed(42)
n = 100
X_true = np.random.randn(n, 3)
X_noise = np.random.randn(n, 20) # 20 useless features

beta_true = np.array([1.0, -0.5, 0.8])
y = X_true @ beta_true + np.random.normal(0, 1.0, n)

# Model with just the true features
X_good = np.column_stack([np.ones(n), X_true])
beta_good, _, _, _ = np.linalg.lstsq(X_good, y, rcond=None)
y_hat_good = X_good @ beta_good

# Model with true features + noise
X_all = np.column_stack([np.ones(n), X_true, X_noise])
beta_all, _, _, _ = np.linalg.lstsq(X_all, y, rcond=None)
y_hat_all = X_all @ beta_all

r2_good = r_squared(y, y_hat_good)
r2_all = r_squared(y, y_hat_all)
adj_r2_good = adjusted_r_squared(y, y_hat_good, 3)
adj_r2_all = adjusted_r_squared(y, y_hat_all, 23)

print(f"Model with 3 true features: R² = {r2_good:.4f}, Adj R² = {adj_r2_good:.4f}")
print(f"Model with 23 features: R² = {r2_all:.4f}, Adj R² = {adj_r2_all:.4f}")
print("R² increases with more features; Adjusted R² reveals the correct model")

Residual Diagnostics

Good regression requires checking that assumptions are met. The residuals ϵ^i=yiy^i\hat{\epsilon}_i = y_i - \hat{y}_i should be:

  1. Normally distributed (for valid inference)
  2. Homoscedastic (constant variance)
  3. Independent (no patterns)
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt

def regression_diagnostics(X, y, title=""):
"""
Four standard regression diagnostic plots.
"""
# Fit model
beta, _, _, _ = np.linalg.lstsq(X, y, rcond=None)
y_hat = X @ beta
residuals = y - y_hat
n, p = X.shape

# Standardised residuals
mse = np.sum(residuals**2) / (n - p)
leverage = np.diag(X @ np.linalg.pinv(X.T @ X) @ X.T)
std_resid = residuals / np.sqrt(mse * (1 - leverage))

print(f"Regression Diagnostics: {title}")

# Shapiro-Wilk normality test
stat, p_shapiro = stats.shapiro(residuals)
print(f"Shapiro-Wilk normality test: p={p_shapiro:.4f} "
f"({'Normal' if p_shapiro > 0.05 else 'Non-normal'})")

# Breusch-Pagan heteroscedasticity test (simplified)
fitted_vals = y_hat - np.mean(y_hat)
resid_sq = residuals**2
_, p_bp = stats.pearsonr(fitted_vals, resid_sq)
print(f"Heteroscedasticity (|corr| of fitted vs resid²): |r|={abs(p_bp):.4f}")

return residuals, std_resid

np.random.seed(42)
n = 100
x = np.random.randn(n)
y_good = 2 + 1.5*x + np.random.normal(0, 1, n) # well-behaved

X = np.column_stack([np.ones(n), x])
diagnostics = regression_diagnostics(X, y_good, "Good model")

Logistic Regression as a Generalised Linear Model

Linear regression assumes continuous outputs with Gaussian noise. Logistic regression extends this to binary outcomes by modelling the probability of class membership.

The Logistic Model

Instead of modelling yy directly, we model P(y=1x)P(y=1|x):

P(y=1x)=σ(xβ)=11+exβP(y=1 | \mathbf{x}) = \sigma(\mathbf{x}^\top\boldsymbol{\beta}) = \frac{1}{1 + e^{-\mathbf{x}^\top\boldsymbol{\beta}}}

where σ\sigma is the sigmoid function. Taking the log-odds (logit transform):

logP(y=1x)P(y=0x)=xβ\log\frac{P(y=1|\mathbf{x})}{P(y=0|\mathbf{x})} = \mathbf{x}^\top\boldsymbol{\beta}

The model is linear in the log-odds space. This makes logistic regression a linear model - it only creates a linear decision boundary.

Why the Sigmoid? A GLM Perspective

Logistic regression belongs to the Generalised Linear Model (GLM) family:

g(E[yx])=xβg(\mathbb{E}[y|x]) = \mathbf{x}^\top\boldsymbol{\beta}

where gg is the link function. For logistic regression, gg is the logit function. For linear regression, gg is the identity. For Poisson regression (count data), gg is the log.

ModelDistributionLink FunctionUse Case
Linear regressionGaussianIdentityContinuous outputs
Logistic regressionBernoulliLogitBinary classification
Poisson regressionPoissonLogCount data
Softmax regressionCategoricalSoftmaxMulti-class

Fitting via MLE (Newton-Raphson)

Logistic regression has no closed-form solution - we use iterative optimisation. The log-likelihood:

(β)=i=1n[yilogp^i+(1yi)log(1p^i)]\ell(\boldsymbol{\beta}) = \sum_{i=1}^n \left[y_i \log\hat{p}_i + (1-y_i)\log(1-\hat{p}_i)\right]

The gradient:

β=X(yp^)\nabla_\beta \ell = \mathbf{X}^\top(\mathbf{y} - \hat{\mathbf{p}})

The Hessian (used in Newton-Raphson):

H=XWX\mathbf{H} = -\mathbf{X}^\top \mathbf{W} \mathbf{X}

where W=diag(p^i(1p^i))\mathbf{W} = \text{diag}(\hat{p}_i(1-\hat{p}_i)) is the weight matrix.

import numpy as np
from scipy.special import expit # numerically stable sigmoid

def logistic_regression_fit(X, y, n_iter=100, tol=1e-6):
"""
Logistic regression via Newton-Raphson (IRLS).
This is what sklearn LogisticRegression uses internally.
"""
n, p = X.shape
beta = np.zeros(p)

for iteration in range(n_iter):
# Forward pass
p_hat = expit(X @ beta)

# Gradient of log-likelihood
gradient = X.T @ (y - p_hat)

# Hessian (negative definite)
W = p_hat * (1 - p_hat)
hessian = -(X.T * W) @ X # X^T W X

# Newton-Raphson update: beta = beta - H^{-1} * grad
# Equivalent to: beta = beta + (X^T W X)^{-1} X^T (y - p_hat)
delta = np.linalg.solve(-hessian, gradient)
beta += delta

if np.max(np.abs(delta)) < tol:
print(f"Converged at iteration {iteration+1}")
break

return beta

# Generate binary classification data
np.random.seed(42)
n = 300
X_raw = np.random.randn(n, 3)
X = np.column_stack([np.ones(n), X_raw])
beta_true = np.array([0.5, 1.2, -0.8, 0.6])
p_true = expit(X @ beta_true)
y = (np.random.rand(n) < p_true).astype(float)

beta_fit = logistic_regression_fit(X, y)
print(f"True beta: {beta_true}")
print(f"Fitted beta: {beta_fit.round(4)}")

# Verify with sklearn
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(fit_intercept=False, C=1e10, max_iter=1000) # no regularisation
lr.fit(X, y)
print(f"Sklearn beta: {lr.coef_[0].round(4)}")

:::tip ML Engineering Connection Understanding logistic regression as a GLM with a Bernoulli likelihood tells you why the sigmoid output of a neural network's final layer is meaningful as a probability - it is the canonical link function for binary outcomes. A binary classifier's output layer (linear + sigmoid + cross-entropy loss) is literally logistic regression applied to the neural network's penultimate feature representation. :::

Ridge Regression: Statistical Perspective

Ridge regression (L2 regularisation) modifies the OLS objective:

β^Ridge=argminβyXβ2+λβ2\hat{\boldsymbol{\beta}}_{\text{Ridge}} = \arg\min_\beta \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2 + \lambda\|\boldsymbol{\beta}\|^2

The closed-form solution:

β^Ridge=(XX+λI)1Xy\hat{\boldsymbol{\beta}}_{\text{Ridge}} = (\mathbf{X}^\top\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^\top\mathbf{y}

Adding λI\lambda\mathbf{I} to XX\mathbf{X}^\top\mathbf{X} before inverting:

  1. Makes the problem well-conditioned even when features are collinear
  2. Shrinks coefficients toward zero - introduces bias but reduces variance
  3. Is MAP estimation with a Gaussian prior βN(0,σ2/λI)\beta \sim \mathcal{N}(0, \sigma^2/\lambda \cdot I)
import numpy as np
import matplotlib.pyplot as plt

def ridge_regression(X, y, lambda_reg):
"""Ridge regression: (X^T X + lambda I)^{-1} X^T y"""
p = X.shape[1]
A = X.T @ X + lambda_reg * np.eye(p)
b = X.T @ y
return np.linalg.solve(A, b)

# Illustrate coefficient shrinkage as lambda increases
np.random.seed(42)
n, p = 100, 10
X = np.random.randn(n, p)
X = np.column_stack([np.ones(n), X])
beta_true = np.array([1.0, 2.0, -1.5, 0.8, 0.3, -0.7, 1.2, -0.4, 0.9, -1.1, 0.6])
y = X @ beta_true + np.random.normal(0, 1.0, n)

lambdas = np.logspace(-3, 3, 100)
coef_paths = np.array([ridge_regression(X, y, l) for l in lambdas])

print("Ridge coefficient shrinkage:")
print(f"{'Lambda':>10} | {'||beta||':>10} | {'Train R²':>10}")
print("-" * 38)
for lam in [0.001, 0.1, 1.0, 10.0, 100.0]:
beta = ridge_regression(X, y, lam)
y_hat = X @ beta
r2 = 1 - np.sum((y - y_hat)**2) / np.sum((y - np.mean(y))**2)
print(f"{lam:>10.3f} | {np.linalg.norm(beta):>10.4f} | {r2:>10.4f}")

Lasso Regression: Sparsity-Inducing

Lasso (Least Absolute Shrinkage and Selection Operator) uses L1 regularisation:

β^Lasso=argminβyXβ2+λβ1\hat{\boldsymbol{\beta}}_{\text{Lasso}} = \arg\min_\beta \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2 + \lambda\|\boldsymbol{\beta}\|_1

No closed-form solution (L1 norm is not differentiable at 0). Solved with coordinate descent or subgradient methods.

Key difference from Ridge: Lasso sets some coefficients exactly to zero - it performs automatic feature selection. This comes from the geometry of the L1 ball, which has corners at the axes.

from sklearn.linear_model import Ridge, Lasso, ElasticNet

# Compare OLS, Ridge, and Lasso
np.random.seed(42)
n, p_true, p_noise = 100, 5, 45
X_raw = np.random.randn(n, p_true + p_noise)
X = np.column_stack([np.ones(n), X_raw])
beta_sparse = np.zeros(p_true + p_noise + 1)
beta_sparse[:6] = [0.5, 2.0, -1.5, 1.0, 0.8, -1.2] # only first 5 features matter
y = X @ beta_sparse + np.random.normal(0, 0.5, n)

X_feat = X_raw # without intercept for sklearn

# Fit models
from sklearn.linear_model import LinearRegression
ols = LinearRegression().fit(X_feat, y)
ridge = Ridge(alpha=1.0).fit(X_feat, y)
lasso = Lasso(alpha=0.1, max_iter=10000).fit(X_feat, y)

n_nonzero_ols = np.sum(np.abs(ols.coef_) > 1e-10)
n_nonzero_ridge = np.sum(np.abs(ridge.coef_) > 1e-10)
n_nonzero_lasso = np.sum(np.abs(lasso.coef_) > 1e-10)

print(f"True non-zero features: 5")
print(f"OLS non-zero coefs: {n_nonzero_ols}")
print(f"Ridge non-zero coefs: {n_nonzero_ridge}")
print(f"Lasso non-zero coefs: {n_nonzero_lasso} <- Lasso selects features!")

Ridge vs Lasso vs ElasticNet

MethodPenaltySolutionFeature SelectionUse Case
OLSNoneClosed-formNoLow-dim, no collinearity
Ridgeλβ22\lambda\|\beta\|_2^2Closed-formNo (shrinks, not zeros)Collinear features
Lassoλβ1\lambda\|\beta\|_1IterativeYes (exact zeros)Feature selection needed
ElasticNetλ1β1+λ2β22\lambda_1\|\beta\|_1 + \lambda_2\|\beta\|_2^2IterativeYesGroups of correlated features

The Geometric View

L2 ball (Ridge): L1 ball (Lasso):
smooth corners at axes

┌──────┐ /\
/ \ / \
│ ● │ / \
\ / ◄ ● ►
└──────┘ \ /
\ /
\/

Minimum of (OLS loss) on the constraint set:
- Ridge: tangent point is rarely at axis → rarely zero coefficients
- Lasso: tangent often hits a corner → exact zeros → sparsity

Complete Regression Pipeline with Statsmodels

import numpy as np
import statsmodels.api as sm

np.random.seed(42)
n = 200
x1 = np.random.randn(n)
x2 = np.random.randn(n)
y = 1.5 + 2.0*x1 - 0.8*x2 + np.random.normal(0, 1.0, n)

X = sm.add_constant(np.column_stack([x1, x2]))
model = sm.OLS(y, X).fit()

print(model.summary())

# Key statistics to understand:
print(f"\nR-squared: {model.rsquared:.4f}")
print(f"Adjusted R-squared: {model.rsquared_adj:.4f}")
print(f"F-statistic: {model.fvalue:.4f} (p = {model.f_pvalue:.6f})")
print(f"\nCoefficients:")
for name, coef, se, t, p in zip(
['const', 'x1', 'x2'],
model.params, model.bse, model.tvalues, model.pvalues
):
sig = "***" if p < 0.001 else "**" if p < 0.01 else "*" if p < 0.05 else ""
print(f" {name:6s}: {coef:7.4f} SE={se:.4f} t={t:7.3f} p={p:.4f} {sig}")

Interview Q&A

Q1: Derive the OLS estimator for simple linear regression.

We minimise RSS=i(yiβ0β1xi)2\text{RSS} = \sum_i (y_i - \beta_0 - \beta_1 x_i)^2. Taking /β0=0\partial/\partial\beta_0 = 0 gives β^0=yˉβ^1xˉ\hat{\beta}_0 = \bar{y} - \hat{\beta}_1\bar{x}. Taking /β1=0\partial/\partial\beta_1 = 0 and substituting gives β^1=(xixˉ)(yiyˉ)(xixˉ)2=Cov(x,y)/Var(x)\hat{\beta}_1 = \frac{\sum(x_i-\bar{x})(y_i-\bar{y})}{\sum(x_i-\bar{x})^2} = \text{Cov}(x,y)/\text{Var}(x). This equals the covariance of xx and yy divided by the variance of xx - how much yy varies with xx relative to how much xx varies. Under the Gauss-Markov assumptions, OLS is BLUE (Best Linear Unbiased Estimator).

Q2: What is the difference between R2R^2 and Adjusted R2R^2? When should you use each?

R2=1RSS/TSSR^2 = 1 - \text{RSS}/\text{TSS} measures the fraction of variance explained. It never decreases when adding predictors, even irrelevant ones. Adjusted R2=1(1R2)(n1)/(np1)R^2 = 1 - (1-R^2)(n-1)/(n-p-1) penalises for the number of parameters pp. It can decrease when adding uninformative features. Use R2R^2 to measure a fixed model's explanatory power. Use adjusted R2R^2 to compare models with different numbers of features. For model selection, AIC and BIC are often better than adjusted R2R^2.

Q3: Why is logistic regression called "regression" if it does classification?

Logistic regression models the log-odds of the outcome as a linear function: log(p/(1p))=xβ\log(p/(1-p)) = \mathbf{x}^\top\boldsymbol{\beta}. It is a generalised linear model (GLM) where the link function is the logit. The "regression" refers to the linear model in log-odds space. The predicted probability is p^=σ(xβ)\hat{p} = \sigma(\mathbf{x}^\top\boldsymbol{\beta}), and classification is done by thresholding this probability. The decision boundary is linear in the original feature space. It is "regression" because it regresses the log-odds on the features.

Q4: What is the statistical interpretation of Ridge and Lasso regularisation?

Ridge (L2) regularisation is MAP estimation with a Gaussian prior βjN(0,1/λ)\beta_j \sim \mathcal{N}(0, 1/\lambda). The penalty λβ22\lambda\|\beta\|_2^2 is the negative log of this Gaussian prior. Ridge shrinks all coefficients but keeps all features. Lasso (L1) regularisation is MAP with a Laplace prior βjLaplace(0,1/λ)\beta_j \sim \text{Laplace}(0, 1/\lambda). The Laplace distribution has a sharp peak at zero, which promotes sparsity - Lasso sets some coefficients exactly to zero, performing feature selection. The L1 ball's corners at the axes cause this sparsity geometrically.

Q5: What are the Gauss-Markov assumptions, and what happens when they are violated in ML settings?

The Gauss-Markov assumptions for OLS to be BLUE are: (1) Linearity - the true model is linear in parameters; (2) Exogeneity - E[ϵX]=0\mathbb{E}[\epsilon|X] = 0 (errors are uncorrelated with features); (3) Spherical errors - Cov(ϵ)=σ2I\text{Cov}(\epsilon) = \sigma^2 I (homoscedastic, uncorrelated errors); (4) No perfect multicollinearity. Violations in ML: if errors are heteroscedastic (variance changes with xx), coefficient estimates are still unbiased but confidence intervals are wrong - use robust standard errors. If features are correlated with errors (omitted variable bias), OLS is biased - this is the causal inference problem. If features are collinear, XX\mathbf{X}^\top\mathbf{X} becomes ill-conditioned - use Ridge to stabilise.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Regression Explorer demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.