Regression Analysis: The Statistical Foundation of ML Models

Reading time: ~45 min | Interview relevance: Very High | Target roles: MLE, Data Scientist, Research Engineer, AI Engineer

The Production Scenario

A junior ML engineer replaces your carefully tuned neural network with a simple linear regression and achieves nearly identical performance on your tabular sales forecasting task. A more senior engineer explains: "Linear models are often competitive on tabular data with good features, and they are much more interpretable."

But do you know why linear regression works at all? Can you derive its solution from first principles? Do you understand when it will fail? Do you know why adding an $L^2$ penalty to the loss changes the statistical properties of the estimator?

These are not just academic questions. Understanding regression deeply gives you the statistical intuition to reason about all of ML: neural networks are nonlinear regression, logistic regression is the building block of classification, and the bias-variance tradeoff of regression estimators explains why regularisation is necessary.

Simple Linear Regression

The Model

$y_i = \beta_0 + \beta_1 x_i + \epsilon_i, \quad \epsilon_i \sim \mathcal{N}(0, \sigma^2)$

We want to estimate $\beta_0$ (intercept) and $\beta_1$ (slope) from $n$ observations $\{(x_i, y_i)\}_{i=1}^n$ .

Statistical assumptions (GAUSS-MARKOV):

Linearity: $\mathbb{E}[y|x] = \beta_0 + \beta_1 x$
Independence: observations are independent
Homoscedasticity: $\text{Var}(\epsilon_i) = \sigma^2$ (constant variance)
Normality of errors: $\epsilon_i \sim \mathcal{N}(0, \sigma^2)$ (for inference)

OLS Derivation

Ordinary Least Squares (OLS) minimises the sum of squared residuals:

$\text{RSS}(\beta_0, \beta_1) = \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i)^2$

Setting partial derivatives to zero:

$\frac{\partial \text{RSS}}{\partial \beta_0} = -2\sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i) = 0$ $\Rightarrow \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x} \quad \text{...(1)}$

$\frac{\partial \text{RSS}}{\partial \beta_1} = -2\sum_{i=1}^n x_i(y_i - \beta_0 - \beta_1 x_i) = 0 \quad \text{...(2)}$

Substituting (1) into (2):

$\hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2} = \frac{\text{Cov}(x, y)}{\text{Var}(x)}$

This is the sample covariance over sample variance - the slope is how much $y$ changes per unit change in $x$ , standardised by the spread of $x$ .

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)
n = 100
x = np.linspace(0, 10, n)
beta_0_true, beta_1_true, sigma_true = 2.0, 1.5, 1.5
y = beta_0_true + beta_1_true * x + np.random.normal(0, sigma_true, n)

# OLS estimation from scratch
x_bar, y_bar = np.mean(x), np.mean(y)
beta_1_hat = np.sum((x - x_bar) * (y - y_bar)) / np.sum((x - x_bar)**2)
beta_0_hat = y_bar - beta_1_hat * x_bar

print(f"True:      beta_0={beta_0_true}, beta_1={beta_1_true}")
print(f"Estimated: beta_0={beta_0_hat:.4f}, beta_1={beta_1_hat:.4f}")

# Residuals
y_hat = beta_0_hat + beta_1_hat * x
residuals = y - y_hat
rss = np.sum(residuals**2)
print(f"RSS: {rss:.4f}")

# Verify with numpy least squares
A = np.column_stack([np.ones(n), x])
coeffs, _, _, _ = np.linalg.lstsq(A, y, rcond=None)
print(f"NumPy lstsq: beta_0={coeffs[0]:.4f}, beta_1={coeffs[1]:.4f}")

Multiple Linear Regression: Matrix Form

For $p$ predictors, the model is:

$\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}$

where $\mathbf{y} \in \mathbb{R}^n$ , $\mathbf{X} \in \mathbb{R}^{n \times (p+1)}$ (design matrix with intercept column), $\boldsymbol{\beta} \in \mathbb{R}^{p+1}$ .

OLS in Matrix Form

Minimise $\text{RSS} = \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2 = (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^\top(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})$

Taking the gradient and setting to zero:

$\nabla_\beta \text{RSS} = -2\mathbf{X}^\top(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) = 0$

$\mathbf{X}^\top\mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{X}^\top\mathbf{y}$

The Normal Equations. Solving:

$\boxed{\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}}$

This is the OLS closed-form solution - the direct solution to linear regression.

import numpy as np

def ols_fit(X, y):
    """OLS via normal equations. X should include intercept column."""
    return np.linalg.solve(X.T @ X, X.T @ y)

def ols_fit_qr(X, y):
    """OLS via QR decomposition - numerically stable, preferred."""
    Q, R = np.linalg.qr(X)
    return np.linalg.solve(R, Q.T @ y)

np.random.seed(42)
n, p = 200, 5
X_raw = np.random.randn(n, p)
X = np.column_stack([np.ones(n), X_raw])  # add intercept

beta_true = np.array([1.0, 2.0, -1.5, 0.8, 0.3, -0.7])
y = X @ beta_true + np.random.normal(0, 0.5, n)

beta_ols = ols_fit(X, y)
beta_qr  = ols_fit_qr(X, y)

print(f"True beta:  {beta_true}")
print(f"OLS (normal eq): {beta_ols.round(4)}")
print(f"OLS (QR):        {beta_qr.round(4)}")

# Standard errors using statsmodels
import statsmodels.api as sm
model = sm.OLS(y, X).fit()
print(f"\nStatsmodels summary:")
print(model.summary().tables[1])

Properties of the OLS Estimator

Under the Gauss-Markov assumptions, OLS is the Best Linear Unbiased Estimator (BLUE):

Unbiased: $\mathbb{E}[\hat{\boldsymbol{\beta}}] = \boldsymbol{\beta}$
Variance: $\text{Var}(\hat{\boldsymbol{\beta}}) = \sigma^2(\mathbf{X}^\top\mathbf{X})^{-1}$
Best: No other linear unbiased estimator has lower variance (Gauss-Markov theorem)

The covariance matrix $(\mathbf{X}^\top\mathbf{X})^{-1}$ reveals the precision of estimates - diagonal elements give the variance of each coefficient, off-diagonals show correlations between estimates.

Goodness of Fit: $R^2$ and Adjusted $R^2$

$R^2$ - Coefficient of Determination

$R^2 = 1 - \frac{\text{RSS}}{\text{TSS}} = 1 - \frac{\sum_i (y_i - \hat{y}_i)^2}{\sum_i (y_i - \bar{y})^2}$

$R^2$ is the fraction of variance in $y$ explained by the model. $R^2 = 0$ means the model is no better than predicting the mean; $R^2 = 1$ means perfect fit.

Problem: $R^2$ never decreases when you add more predictors, even useless ones. This can mislead you into thinking more features always help.

Adjusted $R^2$

Penalises for the number of predictors:

$\bar{R}^2 = 1 - \frac{\text{RSS}/(n-p-1)}{\text{TSS}/(n-1)} = 1 - (1-R^2)\frac{n-1}{n-p-1}$

Adjusted $R^2$ can decrease when you add uninformative features - it is a better model selection criterion than raw $R^2$ .

import numpy as np

def r_squared(y_true, y_pred):
    ss_res = np.sum((y_true - y_pred)**2)
    ss_tot = np.sum((y_true - np.mean(y_true))**2)
    return 1 - ss_res / ss_tot

def adjusted_r_squared(y_true, y_pred, n_features):
    n = len(y_true)
    r2 = r_squared(y_true, y_pred)
    return 1 - (1 - r2) * (n - 1) / (n - n_features - 1)

np.random.seed(42)
n = 100
X_true = np.random.randn(n, 3)
X_noise = np.random.randn(n, 20)  # 20 useless features

beta_true = np.array([1.0, -0.5, 0.8])
y = X_true @ beta_true + np.random.normal(0, 1.0, n)

# Model with just the true features
X_good = np.column_stack([np.ones(n), X_true])
beta_good, _, _, _ = np.linalg.lstsq(X_good, y, rcond=None)
y_hat_good = X_good @ beta_good

# Model with true features + noise
X_all = np.column_stack([np.ones(n), X_true, X_noise])
beta_all, _, _, _ = np.linalg.lstsq(X_all, y, rcond=None)
y_hat_all = X_all @ beta_all

r2_good = r_squared(y, y_hat_good)
r2_all  = r_squared(y, y_hat_all)
adj_r2_good = adjusted_r_squared(y, y_hat_good, 3)
adj_r2_all  = adjusted_r_squared(y, y_hat_all, 23)

print(f"Model with 3 true features:  R² = {r2_good:.4f}, Adj R² = {adj_r2_good:.4f}")
print(f"Model with 23 features:      R² = {r2_all:.4f}, Adj R² = {adj_r2_all:.4f}")
print("R² increases with more features; Adjusted R² reveals the correct model")

Residual Diagnostics

Good regression requires checking that assumptions are met. The residuals $\hat{\epsilon}_i = y_i - \hat{y}_i$ should be:

Normally distributed (for valid inference)
Homoscedastic (constant variance)
Independent (no patterns)

import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt

def regression_diagnostics(X, y, title=""):
    """
    Four standard regression diagnostic plots.
    """
    # Fit model
    beta, _, _, _ = np.linalg.lstsq(X, y, rcond=None)
    y_hat = X @ beta
    residuals = y - y_hat
    n, p = X.shape

    # Standardised residuals
    mse = np.sum(residuals**2) / (n - p)
    leverage = np.diag(X @ np.linalg.pinv(X.T @ X) @ X.T)
    std_resid = residuals / np.sqrt(mse * (1 - leverage))

    print(f"Regression Diagnostics: {title}")

    # Shapiro-Wilk normality test
    stat, p_shapiro = stats.shapiro(residuals)
    print(f"Shapiro-Wilk normality test: p={p_shapiro:.4f} "
          f"({'Normal' if p_shapiro > 0.05 else 'Non-normal'})")

    # Breusch-Pagan heteroscedasticity test (simplified)
    fitted_vals = y_hat - np.mean(y_hat)
    resid_sq = residuals**2
    _, p_bp = stats.pearsonr(fitted_vals, resid_sq)
    print(f"Heteroscedasticity (|corr| of fitted vs resid²): |r|={abs(p_bp):.4f}")

    return residuals, std_resid

np.random.seed(42)
n = 100
x = np.random.randn(n)
y_good = 2 + 1.5*x + np.random.normal(0, 1, n)  # well-behaved

X = np.column_stack([np.ones(n), x])
diagnostics = regression_diagnostics(X, y_good, "Good model")

Logistic Regression as a Generalised Linear Model

Linear regression assumes continuous outputs with Gaussian noise. Logistic regression extends this to binary outcomes by modelling the probability of class membership.

The Logistic Model

Instead of modelling $y$ directly, we model $P(y=1|x)$ :

$P(y=1 | \mathbf{x}) = \sigma(\mathbf{x}^\top\boldsymbol{\beta}) = \frac{1}{1 + e^{-\mathbf{x}^\top\boldsymbol{\beta}}}$

where $\sigma$ is the sigmoid function. Taking the log-odds (logit transform):

$\log\frac{P(y=1|\mathbf{x})}{P(y=0|\mathbf{x})} = \mathbf{x}^\top\boldsymbol{\beta}$

The model is linear in the log-odds space. This makes logistic regression a linear model - it only creates a linear decision boundary.

Why the Sigmoid? A GLM Perspective

Logistic regression belongs to the Generalised Linear Model (GLM) family:

$g(\mathbb{E}[y|x]) = \mathbf{x}^\top\boldsymbol{\beta}$

where $g$ is the link function. For logistic regression, $g$ is the logit function. For linear regression, $g$ is the identity. For Poisson regression (count data), $g$ is the log.

Model	Distribution	Link Function	Use Case
Linear regression	Gaussian	Identity	Continuous outputs
Logistic regression	Bernoulli	Logit	Binary classification
Poisson regression	Poisson	Log	Count data
Softmax regression	Categorical	Softmax	Multi-class

Fitting via MLE (Newton-Raphson)

Logistic regression has no closed-form solution - we use iterative optimisation. The log-likelihood:

$\ell(\boldsymbol{\beta}) = \sum_{i=1}^n \left[y_i \log\hat{p}_i + (1-y_i)\log(1-\hat{p}_i)\right]$

The gradient:

$\nabla_\beta \ell = \mathbf{X}^\top(\mathbf{y} - \hat{\mathbf{p}})$

The Hessian (used in Newton-Raphson):

$\mathbf{H} = -\mathbf{X}^\top \mathbf{W} \mathbf{X}$

where $\mathbf{W} = \text{diag}(\hat{p}_i(1-\hat{p}_i))$ is the weight matrix.

import numpy as np
from scipy.special import expit  # numerically stable sigmoid

def logistic_regression_fit(X, y, n_iter=100, tol=1e-6):
    """
    Logistic regression via Newton-Raphson (IRLS).
    This is what sklearn LogisticRegression uses internally.
    """
    n, p = X.shape
    beta = np.zeros(p)

    for iteration in range(n_iter):
        # Forward pass
        p_hat = expit(X @ beta)

        # Gradient of log-likelihood
        gradient = X.T @ (y - p_hat)

        # Hessian (negative definite)
        W = p_hat * (1 - p_hat)
        hessian = -(X.T * W) @ X  # X^T W X

        # Newton-Raphson update: beta = beta - H^{-1} * grad
        # Equivalent to: beta = beta + (X^T W X)^{-1} X^T (y - p_hat)
        delta = np.linalg.solve(-hessian, gradient)
        beta += delta

        if np.max(np.abs(delta)) < tol:
            print(f"Converged at iteration {iteration+1}")
            break

    return beta

# Generate binary classification data
np.random.seed(42)
n = 300
X_raw = np.random.randn(n, 3)
X = np.column_stack([np.ones(n), X_raw])
beta_true = np.array([0.5, 1.2, -0.8, 0.6])
p_true = expit(X @ beta_true)
y = (np.random.rand(n) < p_true).astype(float)

beta_fit = logistic_regression_fit(X, y)
print(f"True beta:     {beta_true}")
print(f"Fitted beta:   {beta_fit.round(4)}")

# Verify with sklearn
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(fit_intercept=False, C=1e10, max_iter=1000)  # no regularisation
lr.fit(X, y)
print(f"Sklearn beta:  {lr.coef_[0].round(4)}")

:::tip ML Engineering Connection Understanding logistic regression as a GLM with a Bernoulli likelihood tells you why the sigmoid output of a neural network's final layer is meaningful as a probability - it is the canonical link function for binary outcomes. A binary classifier's output layer (linear + sigmoid + cross-entropy loss) is literally logistic regression applied to the neural network's penultimate feature representation. :::

Ridge Regression: Statistical Perspective

Ridge regression (L2 regularisation) modifies the OLS objective:

$\hat{\boldsymbol{\beta}}_{\text{Ridge}} = \arg\min_\beta \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2 + \lambda\|\boldsymbol{\beta}\|^2$

The closed-form solution:

$\hat{\boldsymbol{\beta}}_{\text{Ridge}} = (\mathbf{X}^\top\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^\top\mathbf{y}$

Adding $\lambda\mathbf{I}$ to $\mathbf{X}^\top\mathbf{X}$ before inverting:

Makes the problem well-conditioned even when features are collinear
Shrinks coefficients toward zero - introduces bias but reduces variance
Is MAP estimation with a Gaussian prior $\beta \sim \mathcal{N}(0, \sigma^2/\lambda \cdot I)$

import numpy as np
import matplotlib.pyplot as plt

def ridge_regression(X, y, lambda_reg):
    """Ridge regression: (X^T X + lambda I)^{-1} X^T y"""
    p = X.shape[1]
    A = X.T @ X + lambda_reg * np.eye(p)
    b = X.T @ y
    return np.linalg.solve(A, b)

# Illustrate coefficient shrinkage as lambda increases
np.random.seed(42)
n, p = 100, 10
X = np.random.randn(n, p)
X = np.column_stack([np.ones(n), X])
beta_true = np.array([1.0, 2.0, -1.5, 0.8, 0.3, -0.7, 1.2, -0.4, 0.9, -1.1, 0.6])
y = X @ beta_true + np.random.normal(0, 1.0, n)

lambdas = np.logspace(-3, 3, 100)
coef_paths = np.array([ridge_regression(X, y, l) for l in lambdas])

print("Ridge coefficient shrinkage:")
print(f"{'Lambda':>10} | {'||beta||':>10} | {'Train R²':>10}")
print("-" * 38)
for lam in [0.001, 0.1, 1.0, 10.0, 100.0]:
    beta = ridge_regression(X, y, lam)
    y_hat = X @ beta
    r2 = 1 - np.sum((y - y_hat)**2) / np.sum((y - np.mean(y))**2)
    print(f"{lam:>10.3f} | {np.linalg.norm(beta):>10.4f} | {r2:>10.4f}")

Lasso Regression: Sparsity-Inducing

Lasso (Least Absolute Shrinkage and Selection Operator) uses L1 regularisation:

$\hat{\boldsymbol{\beta}}_{\text{Lasso}} = \arg\min_\beta \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2 + \lambda\|\boldsymbol{\beta}\|_1$

No closed-form solution (L1 norm is not differentiable at 0). Solved with coordinate descent or subgradient methods.

Key difference from Ridge: Lasso sets some coefficients exactly to zero - it performs automatic feature selection. This comes from the geometry of the L1 ball, which has corners at the axes.

from sklearn.linear_model import Ridge, Lasso, ElasticNet

# Compare OLS, Ridge, and Lasso
np.random.seed(42)
n, p_true, p_noise = 100, 5, 45
X_raw = np.random.randn(n, p_true + p_noise)
X = np.column_stack([np.ones(n), X_raw])
beta_sparse = np.zeros(p_true + p_noise + 1)
beta_sparse[:6] = [0.5, 2.0, -1.5, 1.0, 0.8, -1.2]  # only first 5 features matter
y = X @ beta_sparse + np.random.normal(0, 0.5, n)

X_feat = X_raw  # without intercept for sklearn

# Fit models
from sklearn.linear_model import LinearRegression
ols = LinearRegression().fit(X_feat, y)
ridge = Ridge(alpha=1.0).fit(X_feat, y)
lasso = Lasso(alpha=0.1, max_iter=10000).fit(X_feat, y)

n_nonzero_ols   = np.sum(np.abs(ols.coef_) > 1e-10)
n_nonzero_ridge = np.sum(np.abs(ridge.coef_) > 1e-10)
n_nonzero_lasso = np.sum(np.abs(lasso.coef_) > 1e-10)

print(f"True non-zero features: 5")
print(f"OLS   non-zero coefs: {n_nonzero_ols}")
print(f"Ridge non-zero coefs: {n_nonzero_ridge}")
print(f"Lasso non-zero coefs: {n_nonzero_lasso}  <- Lasso selects features!")

Ridge vs Lasso vs ElasticNet

Method	Penalty	Solution	Feature Selection	Use Case
OLS	None	Closed-form	No	Low-dim, no collinearity
Ridge	$\lambda\\|\beta\\|_2^2$	Closed-form	No (shrinks, not zeros)	Collinear features
Lasso	$\lambda\\|\beta\\|_1$	Iterative	Yes (exact zeros)	Feature selection needed
ElasticNet	$\lambda_1\\|\beta\\|_1 + \lambda_2\\|\beta\\|_2^2$	Iterative	Yes	Groups of correlated features

The Geometric View

       L2 ball (Ridge):    L1 ball (Lasso):
          smooth            corners at axes

         ┌──────┐               /\
        /        \             /  \
       │    ●    │            /    \
        \        /           ◄  ●  ►
         └──────┘             \    /
                               \  /
                                \/

Minimum of (OLS loss) on the constraint set:
- Ridge: tangent point is rarely at axis → rarely zero coefficients
- Lasso: tangent often hits a corner → exact zeros → sparsity

Complete Regression Pipeline with Statsmodels

import numpy as np
import statsmodels.api as sm

np.random.seed(42)
n = 200
x1 = np.random.randn(n)
x2 = np.random.randn(n)
y = 1.5 + 2.0*x1 - 0.8*x2 + np.random.normal(0, 1.0, n)

X = sm.add_constant(np.column_stack([x1, x2]))
model = sm.OLS(y, X).fit()

print(model.summary())

# Key statistics to understand:
print(f"\nR-squared:          {model.rsquared:.4f}")
print(f"Adjusted R-squared: {model.rsquared_adj:.4f}")
print(f"F-statistic:        {model.fvalue:.4f}  (p = {model.f_pvalue:.6f})")
print(f"\nCoefficients:")
for name, coef, se, t, p in zip(
    ['const', 'x1', 'x2'],
    model.params, model.bse, model.tvalues, model.pvalues
):
    sig = "***" if p < 0.001 else "**" if p < 0.01 else "*" if p < 0.05 else ""
    print(f"  {name:6s}: {coef:7.4f}  SE={se:.4f}  t={t:7.3f}  p={p:.4f}  {sig}")

Interview Q&A

Q1: Derive the OLS estimator for simple linear regression.

We minimise $\text{RSS} = \sum_i (y_i - \beta_0 - \beta_1 x_i)^2$ . Taking $\partial/\partial\beta_0 = 0$ gives $\hat{\beta}_0 = \bar{y} - \hat{\beta}_1\bar{x}$ . Taking $\partial/\partial\beta_1 = 0$ and substituting gives $\hat{\beta}_1 = \frac{\sum(x_i-\bar{x})(y_i-\bar{y})}{\sum(x_i-\bar{x})^2} = \text{Cov}(x,y)/\text{Var}(x)$ . This equals the covariance of $x$ and $y$ divided by the variance of $x$ - how much $y$ varies with $x$ relative to how much $x$ varies. Under the Gauss-Markov assumptions, OLS is BLUE (Best Linear Unbiased Estimator).

Q2: What is the difference between $R^2$ and Adjusted $R^2$ ? When should you use each?

$R^2 = 1 - \text{RSS}/\text{TSS}$ measures the fraction of variance explained. It never decreases when adding predictors, even irrelevant ones. Adjusted $R^2 = 1 - (1-R^2)(n-1)/(n-p-1)$ penalises for the number of parameters $p$ . It can decrease when adding uninformative features. Use $R^2$ to measure a fixed model's explanatory power. Use adjusted $R^2$ to compare models with different numbers of features. For model selection, AIC and BIC are often better than adjusted $R^2$ .

Q3: Why is logistic regression called "regression" if it does classification?

Logistic regression models the log-odds of the outcome as a linear function: $\log(p/(1-p)) = \mathbf{x}^\top\boldsymbol{\beta}$ . It is a generalised linear model (GLM) where the link function is the logit. The "regression" refers to the linear model in log-odds space. The predicted probability is $\hat{p} = \sigma(\mathbf{x}^\top\boldsymbol{\beta})$ , and classification is done by thresholding this probability. The decision boundary is linear in the original feature space. It is "regression" because it regresses the log-odds on the features.

Q4: What is the statistical interpretation of Ridge and Lasso regularisation?

Ridge (L2) regularisation is MAP estimation with a Gaussian prior $\beta_j \sim \mathcal{N}(0, 1/\lambda)$ . The penalty $\lambda\|\beta\|_2^2$ is the negative log of this Gaussian prior. Ridge shrinks all coefficients but keeps all features. Lasso (L1) regularisation is MAP with a Laplace prior $\beta_j \sim \text{Laplace}(0, 1/\lambda)$ . The Laplace distribution has a sharp peak at zero, which promotes sparsity - Lasso sets some coefficients exactly to zero, performing feature selection. The L1 ball's corners at the axes cause this sparsity geometrically.

Q5: What are the Gauss-Markov assumptions, and what happens when they are violated in ML settings?

The Gauss-Markov assumptions for OLS to be BLUE are: (1) Linearity - the true model is linear in parameters; (2) Exogeneity - $\mathbb{E}[\epsilon|X] = 0$ (errors are uncorrelated with features); (3) Spherical errors - $\text{Cov}(\epsilon) = \sigma^2 I$ (homoscedastic, uncorrelated errors); (4) No perfect multicollinearity. Violations in ML: if errors are heteroscedastic (variance changes with $x$ ), coefficient estimates are still unbiased but confidence intervals are wrong - use robust standard errors. If features are correlated with errors (omitted variable bias), OLS is biased - this is the causal inference problem. If features are collinear, $\mathbf{X}^\top\mathbf{X}$ becomes ill-conditioned - use Ridge to stabilise.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Regression Explorer demo on the EngineersOfAI Playground - no code required.

:::

The Production Scenario​

Simple Linear Regression​

The Model​

OLS Derivation​

Multiple Linear Regression: Matrix Form​

OLS in Matrix Form​

Properties of the OLS Estimator​

Goodness of Fit: R2R^2R2 and Adjusted R2R^2R2​

R2R^2R2 - Coefficient of Determination​

Adjusted R2R^2R2​

Residual Diagnostics​

Logistic Regression as a Generalised Linear Model​

The Logistic Model​

Why the Sigmoid? A GLM Perspective​

Fitting via MLE (Newton-Raphson)​

Ridge Regression: Statistical Perspective​

Lasso Regression: Sparsity-Inducing​

Ridge vs Lasso vs ElasticNet​

The Geometric View​

Complete Regression Pipeline with Statsmodels​

Interview Q&A​