What is generalized linear models?

Understand the GLM framework - link functions, exponential family distributions, Poisson regression for count data, Gamma regression for positive continuous targets, IRLS algorithm, overdispersion, and deviance-based model comparison.

How does GLM work in practice?

Generalized Linear Models covers generalized linear models, GLM, Poisson regression from first principles with code examples. Free lesson at https://engineersofai.com/docs/ml/linear-models/generalized-linear-models

What is the difference between generalized linear models and Poisson regression?

See the full breakdown at https://engineersofai.com/docs/ml/linear-models/generalized-linear-models

Generalized Linear Models

The Real Interview Moment

The senior ML engineer leans back in her chair and asks: "Our revenue team predicted churn using linear regression. It worked okay, but predicted probabilities sometimes go above 1. They used logistic regression instead and that fixed it. Now the demand team predicts hourly ride counts and they get negative predictions. Logistic regression won't work here either. What framework unifies both problems?"

You answer: Generalized Linear Models.

The interviewer nods and follows up: "Great. Now Poisson regression is showing that our model's residual variance is nine times the mean. What does that tell you, and what do you do?"

This is a working knowledge question. The answer involves overdispersion, its causes, its consequences for inference, and the fix. Getting this right separates engineers who understand statistical machinery from those who just call .fit() and check accuracy.

Linear regression is a special case of a much larger family. The GLM framework - developed by Nelder and Wedderburn in 1972 - gives you a principled way to handle any response type: binary outcomes, counts, strictly positive measurements, proportions, or arbitrary continuous data. The key ingredients are always the same: choose the right distribution for your response, choose the right link function to connect your linear predictor to the response mean, and the IRLS algorithm handles the rest. Every model you already know - OLS, logistic regression, Poisson regression - is a GLM. Once you understand the framework, extending it to new response types costs nothing beyond identifying the correct exponential family member.

This lesson builds the GLM framework from first principles. It derives the exponential family form, explains the three GLM components, implements Poisson and Gamma regression from scratch and with statsmodels, covers overdispersion diagnosis and remediation, walks through the IRLS algorithm, and ends with a complete model-selection workflow using deviance and AIC.

Why This Exists - The Failure of Linear Regression

Ordinary linear regression assumes:

$y_i = \mathbf{w}^\top\mathbf{x}_i + \epsilon_i, \quad \epsilon_i \sim \mathcal{N}(0, \sigma^2)$

This works well when the response is continuous and unbounded. It fails systematically when:

Binary response ( $y \in \{0,1\}$ ): predictions can fall outside $[0,1]$ ; Gaussian residuals make no sense for binary data
Count response ( $y \in \{0, 1, 2, \ldots\}$ ): predictions can be negative; variance grows with the mean (not constant)
Positive continuous response ( $y > 0$ ): predictions can be negative; right-skewed distributions are common (costs, durations, sizes)
Proportional response ( $y \in [0,1]$ ): predictions can exit the unit interval

The ad hoc fix is transforming the target (log-transform for positive data, logit-transform for proportions). But transforming the target changes the noise structure: if $\log(y) \sim \mathcal{N}(\eta, \sigma^2)$ , then predictions of $\log(y)$ are unbiased, but back-transformed predictions $e^{\hat{\eta}}$ are biased for $E[y]$ (by Jensen's inequality). GLMs avoid this by modeling the mean directly through a link function, without transforming the response.

Historical Context

Year	Event
1972	Nelder & Wedderburn publish "Generalized Linear Models" in JRSS-A
1972	IRLS algorithm formalized for GLM fitting
1983	McCullagh & Nelder write the canonical GLM textbook
1984	R's GLM function becomes the reference implementation
1986	Logistic and Poisson regression widely adopted in epidemiology and biostats
1995	Negative Binomial extended to GLM framework for overdispersed counts
2000s	Statsmodels (Python) implements complete GLM suite
2010s	GLMs embedded in tree models (LightGBM, XGBoost support Poisson/Gamma objectives)

Nelder's central insight: all common regression models share the same estimation algorithm (IRLS) when expressed in exponential family form. Rather than deriving a custom algorithm for each distribution, one algorithm handles all of them.

The Exponential Family

Many probability distributions belong to the exponential family, which can be written in the form:

$P(y; \theta, \phi) = \exp\left(\frac{y\theta - b(\theta)}{a(\phi)} + c(y, \phi)\right)$

where:

$\theta$ is the natural (canonical) parameter
$\phi$ is the dispersion parameter (controls spread)
$b(\theta)$ is the cumulant function (log-normalizer, ensures valid distribution)
$a(\phi)$ is a dispersion scaling factor (usually $\phi/w_i$ for observation weight $w_i = 1$ )
$c(y, \phi)$ is a normalizing function of the data

Two properties link the cumulant function to the distribution moments:

$\mathbb{E}[y] = \mu = b'(\theta), \qquad \text{Var}(y) = a(\phi) \cdot b''(\theta) = a(\phi) \cdot V(\mu)$

where $V(\mu) = b''(\theta)$ is the variance function - it describes how variance changes with the mean.

Exponential Family Members

Gaussian (Linear Regression):

$P(y|\mu) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{(y-\mu)^2}{2\sigma^2}\right)$

Natural parameter: $\theta = \mu$ , Cumulant: $b(\theta) = \theta^2/2$ , Variance function: $V(\mu) = 1$ (constant)

Bernoulli (Logistic Regression):

$P(y|\mu) = \mu^y(1-\mu)^{1-y}$

Natural parameter: $\theta = \log\frac{\mu}{1-\mu}$ (log-odds), Cumulant: $b(\theta) = \log(1 + e^\theta)$ , Variance function: $V(\mu) = \mu(1-\mu)$

Poisson (Count Regression):

$P(y|\mu) = \frac{e^{-\mu}\mu^y}{y!}$

Natural parameter: $\theta = \log\mu$ , Cumulant: $b(\theta) = e^\theta$ , Variance function: $V(\mu) = \mu$ (equidispersion: mean = variance)

Gamma (Positive Continuous):

$P(y|\mu, \phi) = \frac{1}{\Gamma(1/\phi)}\left(\frac{y}{\phi\mu}\right)^{1/\phi}\frac{1}{y}\exp\left(-\frac{y}{\phi\mu}\right)$

Natural parameter: $\theta = -1/\mu$ , Cumulant: $b(\theta) = -\log(-\theta)$ , Variance function: $V(\mu) = \mu^2$ (coefficient of variation is constant)

The Three GLM Components

A Generalized Linear Model consists of three choices:

Component 1: Random Component (Distribution)

The response $y_i$ follows a distribution from the exponential family. This choice determines:

The support of $y$ (what values are possible)
The variance function $V(\mu)$ (how variance scales with the mean)
The appropriate range for predictions

Component 2: Systematic Component (Linear Predictor)

The linear predictor is always:

$\eta_i = \mathbf{w}^\top\mathbf{x}_i = w_0 + w_1 x_{i1} + \ldots + w_d x_{id}$

GLMs remain linear in parameters - the systematic component is the same as OLS. This is what makes them tractable and interpretable.

Component 3: Link Function

The link function $g$ connects the mean $\mu_i = \mathbb{E}[y_i | \mathbf{x}_i]$ to the linear predictor:

$g(\mu_i) = \eta_i = \mathbf{w}^\top\mathbf{x}_i$

Equivalently: $\mu_i = g^{-1}(\mathbf{w}^\top\mathbf{x}_i)$

The canonical link is the link that makes $\eta = \theta$ (natural parameter = linear predictor). Using canonical links gives the cleanest score equations and simplest IRLS weights.

Distribution	Support	Canonical Link $g(\mu)$	Inverse $g^{-1}(\eta)$	$V(\mu)$	Model name
Gaussian	$(-\infty, +\infty)$	Identity: $\mu$	$\eta$	1	Linear regression
Bernoulli	$\{0,1\}$	Logit: $\log\frac{\mu}{1-\mu}$	$\sigma(\eta) = \frac{1}{1+e^{-\eta}}$	$\mu(1-\mu)$	Logistic regression
Poisson	$\{0,1,2,\ldots\}$	Log: $\log\mu$	$e^\eta$	$\mu$	Poisson regression
Gamma	$(0,+\infty)$	Inverse: $1/\mu$	$1/\eta$	$\mu^2$	Gamma regression
Neg. Binomial	$\{0,1,2,\ldots\}$	Log: $\log\mu$	$e^\eta$	$\mu + \alpha\mu^2$	NB regression

The GLM Gradient: A Unified Formula

A crucial property of exponential families: the gradient of the log-likelihood has the same structure for all GLMs:

$\frac{\partial\ell}{\partial\mathbf{w}} = \sum_{i=1}^n \frac{(y_i - \mu_i)}{V(\mu_i) \cdot [g'(\mu_i)]^2} \cdot \frac{\partial\mu_i}{\partial\eta_i} \cdot \mathbf{x}_i = \mathbf{X}^\top \mathbf{D}(\mathbf{y} - \boldsymbol{\mu})$

where $\mathbf{D}$ is a diagonal matrix of weights. The gradient is always a weighted sum of residuals $(y_i - \mu_i)$ times features $\mathbf{x}_i$ - the same structural form as OLS gradient, generalized.

For canonical links, this simplifies to:

$\frac{\partial\ell}{\partial\mathbf{w}} = \mathbf{X}^\top(\mathbf{y} - \boldsymbol{\mu})$

For Poisson (log link, canonical): $\nabla_\mathbf{w}\ell = \mathbf{X}^\top(\mathbf{y} - \hat{\boldsymbol{\mu}})$ where $\hat{\mu}_i = e^{\mathbf{w}^\top\mathbf{x}_i}$ .

IRLS: The GLM Fitting Algorithm

GLMs are not fit by standard gradient descent. Instead, Newton-Raphson applied to the log-likelihood leads to Iteratively Reweighted Least Squares (IRLS).

Derivation

The Newton-Raphson update for maximizing $\ell(\mathbf{w})$ :

$\mathbf{w}^{(t+1)} = \mathbf{w}^{(t)} - \left[\nabla^2_\mathbf{w}\ell(\mathbf{w}^{(t)})\right]^{-1}\nabla_\mathbf{w}\ell(\mathbf{w}^{(t)})$

The Hessian of the GLM log-likelihood equals:

$\nabla^2_\mathbf{w}\ell = -\mathbf{X}^\top \mathbf{W} \mathbf{X}$

where $\mathbf{W} = \text{diag}(w_1, \ldots, w_n)$ with working weights:

$w_i = \frac{1}{V(\mu_i) \cdot [g'(\mu_i)]^2}$

The Newton step becomes:

$\mathbf{w}^{(t+1)} = \left(\mathbf{X}^\top\mathbf{W}^{(t)}\mathbf{X}\right)^{-1}\mathbf{X}^\top\mathbf{W}^{(t)}\mathbf{z}^{(t)}$

where $\mathbf{z}^{(t)}$ is the adjusted dependent variable (working response):

$z_i^{(t)} = \hat{\eta}_i^{(t)} + (y_i - \hat{\mu}_i^{(t)}) \cdot g'(\hat{\mu}_i^{(t)})$

This is exactly a weighted least squares problem with response $\mathbf{z}^{(t)}$ and weights $\mathbf{W}^{(t)}$ - solved at each iteration, hence "iteratively reweighted least squares."

IRLS vs Gradient Descent

Property	IRLS	Gradient Descent
Convergence	Quadratic (Newton-Raphson)	Linear
Iterations to convergence	~5–10	Hundreds to thousands
Per-iteration cost	$O(nd^2 + d^3)$ (matrix solve)	$O(nd)$
Memory	$O(d^2)$ for Gram matrix	$O(d)$
Best for	Small-medium $n$ , well-conditioned	Very large $n$ , high $d$
Used by	statsmodels, R `glm()`, sklearn GLM	Online/streaming settings

For small-to-medium datasets (n ≤ 100k, d ≤ 10k), IRLS is dramatically faster due to quadratic convergence. For large-scale problems, gradient descent variants (Adam, Adagrad) are preferred.

import numpy as np
from scipy.special import expit  # sigmoid


class PoissonGLMFromScratch:
    """
    Poisson GLM with log link, fit by IRLS.
    Model: log(mu_i) = w'x_i  =>  mu_i = exp(w'x_i)
    Variance function: V(mu) = mu
    Link derivative: g'(mu) = 1/mu  (log link)
    Working weight: w_i = 1 / (V(mu_i) * g'(mu_i)^2) = mu_i
    Working response: z_i = eta_i + (y_i - mu_i) / mu_i
    """

    def __init__(self, max_iter: int = 25, tol: float = 1e-8):
        self.max_iter = max_iter
        self.tol      = tol
        self.w_       = None
        self.ll_history_ = []

    def _log_likelihood(self, y: np.ndarray, mu: np.ndarray) -> float:
        # Poisson NLL: sum[ -mu + y*log(mu) ] (ignore log(y!) constant)
        return np.sum(-mu + y * np.log(np.clip(mu, 1e-12, None)))

    def fit(self, X: np.ndarray, y: np.ndarray, verbose: bool = True):
        n, d  = X.shape
        Xa    = np.column_stack([np.ones(n), X])    # add intercept

        # Initialization: start from eta = log(y_bar)
        eta   = np.full(n, np.log(np.mean(y) + 1e-6))
        mu    = np.exp(eta)
        w     = np.linalg.lstsq(Xa, eta, rcond=None)[0]

        for iteration in range(self.max_iter):
            # --- Compute IRLS components ---
            # Working weight for Poisson + log link: W_i = mu_i
            W  = mu                         # shape (n,)

            # Working response: z_i = eta_i + (y_i - mu_i) * g'(mu_i)
            #   g'(mu) = d(log mu)/dmu = 1/mu
            z  = eta + (y - mu) / mu        # shape (n,)

            # --- Weighted least squares step ---
            # Solve: (Xa' W Xa) w = Xa' W z
            Xw = Xa * W[:, np.newaxis]      # weighted X
            A  = Xw.T @ Xa                  # (d+1, d+1)
            b  = Xw.T @ z                   # (d+1,)

            w_new = np.linalg.solve(A + 1e-10 * np.eye(d + 1), b)

            # --- Update eta and mu ---
            eta_new = Xa @ w_new
            mu_new  = np.exp(eta_new)       # inverse log link

            # --- Check convergence ---
            ll      = self._log_likelihood(y, mu_new)
            self.ll_history_.append(ll)

            delta_w = np.max(np.abs(w_new - w))
            if verbose and (iteration % 5 == 0 or iteration < 5):
                print(f"  Iter {iteration+1:3d}: log-lik = {ll:.4f}, "
                      f"max|Δw| = {delta_w:.2e}")

            w, eta, mu = w_new, eta_new, mu_new

            if delta_w < self.tol:
                if verbose:
                    print(f"  Converged at iteration {iteration+1}")
                break

        self.w_      = w
        self.w_coef_ = w[1:]
        self.intercept_ = w[0]
        return self

    def predict(self, X: np.ndarray) -> np.ndarray:
        n  = X.shape[0]
        Xa = np.column_stack([np.ones(n), X])
        return np.exp(Xa @ self.w_)

    def deviance(self, X: np.ndarray, y: np.ndarray) -> float:
        mu   = self.predict(X)
        # Poisson deviance: 2 * sum[ y*log(y/mu) - (y - mu) ]
        safe = np.clip(mu, 1e-10, None)
        yc   = np.clip(y, 1e-10, None)
        return 2 * np.sum(y * np.log(yc / safe) - (y - safe))


# Demo
np.random.seed(42)
n  = 400
X_ = np.random.randn(n, 3)
w_true = np.array([0.5, -0.3, 0.7])
eta_true = 2.0 + X_ @ w_true
y_pois   = np.random.poisson(np.exp(eta_true))

print("IRLS Fitting Poisson GLM from scratch:")
poisson_glm = PoissonGLMFromScratch(max_iter=30, tol=1e-8)
poisson_glm.fit(X_, y_pois)

print(f"\nTrue intercept: 2.0,  Estimated: {poisson_glm.intercept_:.4f}")
print(f"True w: {w_true},  Estimated: {poisson_glm.w_coef_}")
print(f"Residual Deviance: {poisson_glm.deviance(X_, y_pois):.3f}")

Poisson Regression: Count Data

The Model

For count response $y_i \in \{0, 1, 2, \ldots\}$ :

$\mu_i = E[y_i | \mathbf{x}_i] = e^{\mathbf{w}^\top\mathbf{x}_i}$

The log link ensures $\mu_i > 0$ always. The loss function (Poisson negative log-likelihood):

$-\ell(\mathbf{w}) = \sum_{i=1}^n \left[e^{\mathbf{w}^\top\mathbf{x}_i} - y_i\mathbf{w}^\top\mathbf{x}_i\right] + \text{const}$

Coefficient interpretation: A one-unit increase in feature $j$ multiplies the expected count by $e^{w_j}$ (multiplicative, not additive). This is the key difference from linear regression.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf


def poisson_regression_demo():
    """
    Fit Poisson GLM to simulated ride demand data.
    Demonstrate coefficient interpretation, overdispersion check.
    """
    np.random.seed(42)
    n = 600

    # Features
    hour       = np.random.randint(0, 24, n)
    temp_c     = np.random.normal(18, 7, n)
    is_weekend = np.random.binomial(1, 0.3, n)
    rain       = np.random.binomial(1, 0.25, n)

    # True log-linear model
    hour_effect = 0.8 * np.sin(2 * np.pi * hour / 24)
    log_mu      = (2.8
                   + 0.025 * temp_c
                   + 0.3   * is_weekend
                   - 0.6   * rain
                   + hour_effect)
    y_rides = np.random.poisson(np.exp(log_mu))

    df = pd.DataFrame({
        'rides':      y_rides,
        'hour':       hour,
        'temp_c':     temp_c,
        'is_weekend': is_weekend,
        'rain':       rain,
        'hour_sin':   np.sin(2 * np.pi * hour / 24),
        'hour_cos':   np.cos(2 * np.pi * hour / 24),
    })

    # Fit Poisson GLM
    formula = 'rides ~ temp_c + is_weekend + rain + hour_sin + hour_cos'
    model   = smf.glm(formula, data=df,
                      family=sm.families.Poisson()).fit()

    print("Poisson GLM Summary:")
    print(model.summary())

    # Coefficient interpretation
    print("\nMultiplicative Effects (exp(coef)):")
    for name, coef in model.params.items():
        if name == 'Intercept':
            continue
        effect = np.exp(coef)
        pct    = (effect - 1) * 100
        sign   = "+" if pct >= 0 else ""
        print(f"  {name:15s}: e^{coef:6.3f} = {effect:.3f}  "
              f"({sign}{pct:.1f}% change in expected rides per unit)")

    # Overdispersion test
    mean_y = df['rides'].mean()
    var_y  = df['rides'].var()
    print(f"\nOverdispersion Check:")
    print(f"  Mean of y:          {mean_y:.3f}")
    print(f"  Variance of y:      {var_y:.3f}")
    print(f"  Variance/Mean:      {var_y/mean_y:.3f}")

    # Pearson chi-squared statistic / df
    mu_hat      = model.predict()
    pearson_chi2 = np.sum((df['rides'] - mu_hat)**2 / mu_hat)
    df_resid    = model.df_resid
    print(f"  Pearson chi2 / df:  {pearson_chi2/df_resid:.3f}"
          f"  (expect ≈1 if well-specified)")
    if pearson_chi2 / df_resid > 2:
        print("  WARNING: Possible overdispersion. Consider Negative Binomial.")

    return df, model


df_rides, poisson_model = poisson_regression_demo()

Overdispersion: When Poisson Fails

The Poisson distribution assumes equidispersion: $\text{Var}(y) = \mu$ . Real count data almost always exhibits overdispersion - variance exceeding the mean. Common causes:

Unobserved heterogeneity: Some individuals have systematically higher rates, not captured by features
Contagion / clustering: Events cluster (positive correlation within groups)
Zero inflation: Excess zeros from a mixture of structural zeros and Poisson process
Measurement aggregation: Counts measured over varying time periods

Detection

Three diagnostics for overdispersion in fitted Poisson models:

Raw variance/mean ratio: If observed $\text{Var}(y)/\bar{y} \gg 1$ , likely overdispersed
Pearson chi-squared: $\chi^2_P = \sum_i(y_i - \hat\mu_i)^2/\hat\mu_i$ . Under Poisson, $\chi^2_P/df \approx 1$ . Ratio $> 2$ is concerning.
Cameron-Trivedi test: Regress $(y_i - \hat\mu_i)^2 - y_i$ on $\hat\mu_i^2$ ; significant positive slope indicates overdispersion

The Negative Binomial Model

The Negative Binomial adds a per-observation random effect (Gamma mixing):

$y_i | \lambda_i \sim \text{Poisson}(\lambda_i), \quad \lambda_i \sim \text{Gamma}\left(\frac{1}{\alpha}, \frac{\mu_i}{\alpha}\right)$

Marginalizing over $\lambda_i$ gives:

$y_i | \mu_i, \alpha \sim \text{NegBin}(\mu_i, \alpha)$

with variance function:

$\text{Var}(y_i) = \mu_i + \alpha\mu_i^2$

The dispersion parameter $\alpha \geq 0$ : when $\alpha \to 0$ , NegBin → Poisson. The quadratic variance function accommodates arbitrarily large overdispersion by estimating $\alpha$ from data.

import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np
import pandas as pd


def overdispersion_demo():
    """
    Generate overdispersed count data, compare Poisson vs Negative Binomial.
    """
    np.random.seed(42)
    n     = 500
    X_    = np.random.randn(n, 2)
    w_true = np.array([0.4, -0.3])

    # Overdispersed: gamma mixing with shape=2 (alpha=0.5)
    alpha_true = 0.5
    mu_true    = np.exp(1.5 + X_ @ w_true)
    # Generate lambda from Gamma, then counts from Poisson(lambda)
    lam = np.random.gamma(shape=1/alpha_true, scale=mu_true * alpha_true)
    y   = np.random.poisson(lam)

    df_od = pd.DataFrame({'y': y, 'x1': X_[:, 0], 'x2': X_[:, 1]})

    # Dispersion check
    print(f"Overdispersion Diagnosis:")
    print(f"  Mean y: {y.mean():.2f},  Var y: {y.var():.2f}")
    print(f"  Raw Var/Mean ratio: {y.var()/y.mean():.2f}")

    # Fit Poisson
    pois_fit = smf.glm('y ~ x1 + x2', data=df_od,
                        family=sm.families.Poisson()).fit(disp=0)
    mu_pois  = pois_fit.predict()
    pearson  = np.sum((y - mu_pois)**2 / mu_pois)
    print(f"\nPoisson GLM:")
    print(f"  AIC:                    {pois_fit.aic:.2f}")
    print(f"  Residual Deviance:      {pois_fit.deviance:.2f}")
    print(f"  Pearson chi2 / df:      {pearson/pois_fit.df_resid:.3f}")

    # Fit Negative Binomial
    nb_fit   = smf.glm('y ~ x1 + x2', data=df_od,
                        family=sm.families.NegativeBinomial()).fit(disp=0)
    mu_nb    = nb_fit.predict()
    pearson_nb = np.sum((y - mu_nb)**2 / mu_nb)
    print(f"\nNegative Binomial GLM:")
    print(f"  AIC:                    {nb_fit.aic:.2f}")
    print(f"  Residual Deviance:      {nb_fit.deviance:.2f}")
    print(f"  Pearson chi2 / df:      {pearson_nb/nb_fit.df_resid:.3f}")
    print(f"  (Lower AIC + Pearson ≈ 1 confirms NegBin is the right model)")

    # Quasi-Poisson as alternative (inflate SEs, don't change point estimates)
    # In statsmodels, use Poisson with scale='X2' (Pearson chi2 dispersion)
    # quasi_fit = smf.glm('y ~ x1 + x2', data=df_od,
    #                      family=sm.families.Poisson()).fit(scale='X2')
    # This gives same coefficients as Poisson but wider confidence intervals

    return pois_fit, nb_fit


pois_fit, nb_fit = overdispersion_demo()

Gamma Regression: Positive Continuous Data

The Gamma distribution is appropriate for strictly positive, right-skewed continuous responses such as insurance claim amounts, trip durations, costs, or survival times.

Gamma GLM with log link (most common):

$\mu_i = E[y_i | \mathbf{x}_i] = e^{\mathbf{w}^\top\mathbf{x}_i}$

Variance function: $\text{Var}(y_i) = \phi\mu_i^2$ - variance grows quadratically with the mean. This means the coefficient of variation $\text{CV} = \sqrt{\text{Var}(y_i)}/\mu_i = \sqrt{\phi}$ is constant across observations - a property of many financial and engineering quantities.

The Gamma NLL (loss function) with log link:

$-\ell(\mathbf{w}) = \frac{1}{\phi}\sum_{i=1}^n \left[\frac{y_i}{\mu_i} + \log\mu_i\right] + \text{const}$

import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt


def gamma_regression_demo():
    """
    Fit Gamma GLM to simulated insurance claim amounts.
    Compare vs OLS on raw vs log-transformed target.
    """
    np.random.seed(42)
    n = 500

    # Features: driver age, vehicle value, risk score
    age       = np.random.uniform(18, 75, n)
    veh_value = np.random.exponential(20000, n)         # in $
    risk_score = np.random.uniform(0, 10, n)

    # True Gamma model: log(E[claim]) = linear predictor
    log_mu = (7.0
              - 0.01 * age
              + 0.000005 * veh_value
              + 0.15 * risk_score)
    mu_true = np.exp(log_mu)

    # Gamma distributed claims: shape k=4 (phi=1/k=0.25)
    k = 4
    y_claims = np.random.gamma(k, mu_true / k)          # shape=k, scale=mu/k

    df_ins = pd.DataFrame({
        'claim_amount': y_claims,
        'age':          age,
        'veh_value':    veh_value,
        'risk_score':   risk_score,
    })

    print(f"Claims summary:")
    print(f"  Mean: ${y_claims.mean():,.0f}")
    print(f"  Std:  ${y_claims.std():,.0f}")
    print(f"  CV:   {y_claims.std()/y_claims.mean():.3f}  (true: {1/np.sqrt(k):.3f})")

    # 1. Gamma GLM (correct model)
    gamma_model = smf.glm(
        'claim_amount ~ age + veh_value + risk_score',
        data=df_ins,
        family=sm.families.Gamma(link=sm.families.links.Log())
    ).fit()

    # 2. OLS on raw target (wrong model)
    X_ins = df_ins[['age', 'veh_value', 'risk_score']].values
    ols_raw = LinearRegression().fit(X_ins, y_claims)

    # 3. OLS on log-transformed target
    ols_log = LinearRegression().fit(X_ins, np.log(y_claims))

    def rmse(y, yhat): return np.sqrt(np.mean((y - yhat)**2))
    def mae(y, yhat):  return np.mean(np.abs(y - yhat))

    pred_gamma = gamma_model.predict()
    pred_ols   = ols_raw.predict(X_ins)
    pred_log   = np.exp(ols_log.predict(X_ins))  # back-transform (biased)

    print(f"\nModel Comparison:")
    print(f"{'Model':30s}  {'RMSE':10s}  {'MAE':10s}  {'AIC':10s}")
    print("-" * 65)
    print(f"{'Gamma GLM (log link)':30s}  {rmse(y_claims, pred_gamma):10.2f}  "
          f"{mae(y_claims, pred_gamma):10.2f}  {gamma_model.aic:10.2f}")
    print(f"{'OLS (raw target)':30s}  {rmse(y_claims, pred_ols):10.2f}  "
          f"{mae(y_claims, pred_ols):10.2f}  {'N/A':10s}")
    print(f"{'OLS (log-y, back-transformed)':30s}  {rmse(y_claims, pred_log):10.2f}  "
          f"{mae(y_claims, pred_log):10.2f}  {'N/A':10s}")

    print("\nGamma GLM Coefficients:")
    print(gamma_model.summary().tables[1])

    # OLS can predict negative claims; Gamma always positive
    negative_preds = np.sum(pred_ols < 0)
    print(f"\nOLS negative predictions: {negative_preds} (impossible for claims)")
    print(f"Gamma negative predictions: {np.sum(pred_gamma < 0)} (impossible by construction)")

    return gamma_model


gamma_model = gamma_regression_demo()

Probit vs Logit for Binary Classification

Both logit and probit model $P(y=1|\mathbf{x})$ but use different link functions derived from different latent variable models.

Latent variable interpretation: suppose there is an unobserved continuous propensity $y^* = \mathbf{w}^\top\mathbf{x} + \epsilon$ , and we observe $y = \mathbf{1}[y^* > 0]$ .

If $\epsilon \sim \text{Logistic}(0, 1)$ : $P(y=1|\mathbf{x}) = \sigma(\mathbf{w}^\top\mathbf{x})$ → logit model
If $\epsilon \sim \mathcal{N}(0, 1)$ : $P(y=1|\mathbf{x}) = \Phi(\mathbf{w}^\top\mathbf{x})$ → probit model

The Logistic and Normal CDFs are very similar in the central region but differ in the tails - Logistic has heavier tails. In practice, predictions are nearly identical.

Property	Logit	Probit
Link function	$\log\frac{\mu}{1-\mu}$	$\Phi^{-1}(\mu)$
Inverse link	$\sigma(\eta) = \frac{1}{1+e^{-\eta}}$	$\Phi(\eta)$
Coefficient scale	Larger (~1.7× probit)	Smaller
Interpretation	Log-odds ratio	Latent z-score
Tail behavior	Heavier tails	Lighter tails
Tradition	Machine learning	Econometrics
Computational cost	Cheap (exp)	Moderate (erf/erfc)

The coefficient ratio $\hat{w}_\text{logit} \approx \frac{\pi}{\sqrt{3}} \hat{w}_\text{probit} \approx 1.81\, \hat{w}_\text{probit}$ because the logistic distribution has standard deviation $\pi/\sqrt{3}$ while the standard normal has standard deviation 1.

import numpy as np
import statsmodels.api as sm
from scipy import stats


def probit_vs_logit():
    """
    Fit probit and logit to the same data; compare coefficients and predictions.
    """
    np.random.seed(42)
    n = 800

    # True probit model
    X_ = np.random.randn(n, 3)
    w_true = np.array([0.5, -0.4, 0.8])
    latent = X_ @ w_true       # N(0,1) errors implied by probit
    p_true = stats.norm.cdf(latent)
    y      = np.random.binomial(1, p_true)

    Xa = sm.add_constant(X_)

    # Fit probit
    probit_fit = sm.Probit(y, Xa).fit(disp=0)

    # Fit logit
    logit_fit  = sm.Logit(y, Xa).fit(disp=0)

    print("Probit vs Logit Coefficient Comparison:")
    print(f"{'Param':10s}  {'True (probit)':14s}  {'Probit est':12s}  {'Logit est':12s}  {'Logit/Probit':12s}")
    print("-" * 65)
    names = ['intercept', 'x1', 'x2', 'x3']
    true  = [0.0] + list(w_true)

    for i, name in enumerate(names):
        pc   = probit_fit.params[i]
        lc   = logit_fit.params[i]
        r    = lc / pc if abs(pc) > 0.01 else float('nan')
        print(f"{name:10s}  {true[i]:14.3f}  {pc:12.4f}  {lc:12.4f}  {r:12.3f}")

    print(f"\nTheoretical ratio π/√3 ≈ {np.pi/np.sqrt(3):.3f}")

    # Compare predicted probabilities
    p_probit = probit_fit.predict()
    p_logit  = logit_fit.predict()
    max_diff  = np.max(np.abs(p_probit - p_logit))
    print(f"\nMax predicted probability difference: {max_diff:.4f}")
    print("Probit and logit give nearly identical predictions in practice.")

    return probit_fit, logit_fit


probit_fit, logit_fit = probit_vs_logit()

Deviance and Model Comparison

For GLMs, the equivalent of RSS and R² is deviance - a likelihood-ratio-based measure.

Deviance Definition

The deviance of a fitted GLM is:

$D = 2\left[\ell(\text{saturated model}) - \ell(\hat{\boldsymbol{w}})\right]$

The saturated model fits each observation perfectly ( $\hat\mu_i = y_i$ ) and represents the maximum achievable log-likelihood. Deviance measures how far your model falls short.

Poisson deviance:

$D = 2\sum_{i=1}^n \left[y_i\log\frac{y_i}{\hat\mu_i} - (y_i - \hat\mu_i)\right]$

Gamma deviance:

$D = 2\sum_{i=1}^n \left[\frac{y_i - \hat\mu_i}{\hat\mu_i} - \log\frac{y_i}{\hat\mu_i}\right]$

Bernoulli deviance (= 2 × cross-entropy):

$D = -2\sum_{i=1}^n \left[y_i\log\hat\mu_i + (1-y_i)\log(1-\hat\mu_i)\right]$

Null vs Residual Deviance

Null deviance: deviance of intercept-only model ( $\hat\mu_i = \bar{y}$ for all $i$ ) - baseline performance
Residual deviance: deviance of your fitted model - performance with features

Pseudo-R²: $R^2_D = 1 - D_\text{resid}/D_\text{null}$ (fraction of deviance explained)

AIC for GLM Model Selection

$\text{AIC} = -2\ell(\hat{\boldsymbol{\theta}}) + 2k$

where $k$ = number of estimated parameters. Lower AIC = better model. Use AIC to:

Compare models with different numbers of features (AIC penalizes complexity)
Compare different GLM families (Poisson vs Negative Binomial vs Gaussian)
Compare different link functions (log vs identity for Poisson)

import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf


def glm_model_selection(df: pd.DataFrame, y_col: str, x_cols: list):
    """
    Compare GLM families on the same dataset using AIC, deviance, and pseudo-R².
    """
    formula = f"{y_col} ~ {' + '.join(x_cols)}"

    families = {
        'Poisson (log)':          sm.families.Poisson(sm.families.links.Log()),
        'NegBinomial (log)':      sm.families.NegativeBinomial(sm.families.links.Log()),
        'Gaussian (identity)':    sm.families.Gaussian(sm.families.links.identity()),
        'Gaussian (log)':         sm.families.Gaussian(sm.families.links.Log()),
    }

    results = []
    for name, family in families.items():
        try:
            fit = smf.glm(formula, data=df, family=family).fit(disp=0)
            pseudo_r2 = 1 - fit.deviance / fit.null_deviance
            results.append({
                'Model':        name,
                'AIC':          fit.aic,
                'Null Dev':     fit.null_deviance,
                'Resid Dev':    fit.deviance,
                'Pseudo R²':    pseudo_r2,
                'df_resid':     fit.df_resid,
            })
        except Exception as e:
            results.append({'Model': name, 'AIC': np.inf,
                            'Null Dev': np.nan, 'Resid Dev': np.nan,
                            'Pseudo R²': np.nan, 'df_resid': np.nan})

    results_df = pd.DataFrame(results).sort_values('AIC')
    print(f"\nGLM Model Comparison - Response: {y_col}")
    print(results_df.to_string(index=False, float_format='{:.2f}'.format))
    return results_df


# Simulate and compare on Poisson count data
np.random.seed(42)
n_demo   = 400
X_demo   = np.random.randn(n_demo, 3)
y_demo   = np.random.poisson(np.exp(1.5 + X_demo @ [0.4, -0.3, 0.5]))
df_demo  = pd.DataFrame({
    'y': y_demo, 'x1': X_demo[:, 0], 'x2': X_demo[:, 1], 'x3': X_demo[:, 2]
})

glm_model_selection(df_demo, 'y', ['x1', 'x2', 'x3'])

Production Notes

:::tip Offset terms for varying exposure When count observations have different exposure times (e.g., incidents per machine-hour, calls per employee), add an offset term: log(mu_i) = offset_i + w'x_i. In statsmodels: smf.glm(formula, ..., offset=np.log(exposure)). This ensures your model predicts rates, not raw counts, which is almost always the right quantity. :::

:::warning Check deviance/df before reporting A Poisson GLM reporting residual deviance/df = 8.3 is severely overdispersed. Coefficients are still consistent, but standard errors are too small (anti-conservative). Always report Pearson chi2/df from the fitted model. If > 2, switch to Negative Binomial or use quasi-Poisson (which rescales SEs by sqrt(dispersion) without changing point estimates). :::

:::danger Do not exponentiate Gamma GLM coefficients for interpretation For Poisson regression with log link, $e^{w_j}$ is the multiplicative rate ratio - a correct and meaningful interpretation. For Gamma regression with log link, $e^{w_j}$ is also the multiplicative effect on the expected value. But for Gamma regression with inverse link (canonical), $w_j$ represents the change in $1/\mu$ , which is rarely intuitive. Always specify which link function you used when reporting Gamma GLM coefficients. :::

:::warning Beta regression for proportions is not a standard GLM Beta regression (for responses strictly in $(0,1)$ ) is often discussed alongside GLMs but is not a standard exponential family GLM because the Beta distribution's parameterization by $(\mu, \phi)$ does not fit the canonical form. It requires a separate fitting procedure. Use betareg in R or statsmodels.othermod.BetaModel in Python. :::

YouTube Resources

Title	Channel	Why Watch
Generalized Linear Models - Complete Series	StatQuest	Visual, intuitive GLM intro covering Poisson and logistic
GLMs in Python with statsmodels	Data School	Practical coding walkthrough
IRLS Algorithm Explained	Mathematical Monk	Rigorous derivation of IRLS from Newton-Raphson
Count Data Regression: Poisson and Negative Binomial	ritvikmath	Overdispersion explained clearly
Gamma Regression for Skewed Data	Kevin P. Murphy lectures	When and why to use Gamma family

Interview Q&A

Q1: What are the three components of a GLM and what does each one do?

(1) Random component: specifies the probability distribution for the response $y$ . Must be a member of the exponential family (Gaussian, Bernoulli, Poisson, Gamma, etc.). This choice determines the support of $y$ (what values are possible) and the variance function $V(\mu)$ (how variance scales with the mean). (2) Systematic component: the linear predictor $\eta = \mathbf{w}^\top\mathbf{x}$ - always linear in the parameters. This is the same structure as OLS; GLMs keep linearity in parameter space while allowing non-linear mean-response relationships. (3) Link function $g$ : connects the mean $\mu = E[y|\mathbf{x}]$ to the linear predictor via $g(\mu) = \eta$ . The link ensures the mean stays in the valid range for the distribution - log link for positive means, logit for probabilities, identity for unrestricted means. The canonical link (log for Poisson, logit for Bernoulli, identity for Gaussian) makes the score equations simplest.

Q2: Why use Poisson regression instead of linear regression for count data?

Three reasons. First, support: Poisson regression uses the log link $\mu_i = e^{\mathbf{w}^\top\mathbf{x}_i}$ , guaranteeing $\mu_i > 0$ always. Linear regression can predict negative counts, which are impossible. Second, variance structure: Poisson assumes $\text{Var}(y_i) = \mu_i$ - variance grows with the mean. This reflects empirical reality for counts. Linear regression assumes constant variance (homoscedasticity), which leads to inefficient and miscalibrated predictions for count data. Third, coefficient interpretation: Poisson coefficients have a multiplicative interpretation ( $e^{w_j}$ = rate ratio), which is natural for rates and counts. Linear coefficients are additive, which can imply the same absolute increase regardless of baseline count - often unrealistic.

Q3: What is overdispersion, how do you detect it, and how do you handle it?

Overdispersion means the observed variance in count data exceeds the mean - violating Poisson's equidispersion assumption ( $\text{Var}(y) = \mu$ ). Detection: (1) Raw ratio $\text{Var}(y)/\bar{y}$ ; if much greater than 1, suspect overdispersion. (2) Pearson chi-squared statistic divided by residual degrees of freedom from the fitted Poisson: if $\chi^2_P/df \gg 1$ , the model is overdispersed. (3) Cameron-Trivedi test: auxiliary regression of $(y_i - \hat\mu_i)^2 - y_i$ on $\hat\mu_i^2$ . Consequences: Poisson standard errors are too small (anti-conservative), leading to falsely significant coefficients. Point estimates are still consistent under mild overdispersion. Handling: (1) Negative Binomial GLM - fits $\alpha$ explicitly, correct variance function $V(\mu) = \mu + \alpha\mu^2$ ; (2) Quasi-Poisson - keeps Poisson point estimates but scales SEs by $\sqrt{\chi^2_P/df}$ ; (3) Add omitted features to absorb heterogeneity.

Q4: Derive the IRLS update equation. Why is it quadratically convergent?

Start from Newton-Raphson: $\mathbf{w}^{(t+1)} = \mathbf{w}^{(t)} - (\nabla^2\ell)^{-1}\nabla\ell$ . For GLMs, the Hessian is $\nabla^2\ell = -\mathbf{X}^\top\mathbf{W}\mathbf{X}$ with diagonal weight matrix $W_{ii} = 1/[V(\mu_i) g'(\mu_i)^2]$ , and gradient $\nabla\ell = \mathbf{X}^\top\mathbf{D}(\mathbf{y} - \boldsymbol{\mu})$ . Substituting: $\mathbf{w}^{(t+1)} = (\mathbf{X}^\top\mathbf{W}\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{W}\mathbf{z}$ where $z_i = \eta_i + (y_i - \mu_i)g'(\mu_i)$ is the working response. This is a weighted least squares problem with weights $W$ and response $\mathbf{z}$ , solved freshly at each iteration. Quadratic convergence: Newton-Raphson converges quadratically because it fits a local quadratic approximation (second-order Taylor expansion) of the objective at each step. Near the MLE, the approximation is very accurate and each step approximately squares the error.

Q5: When would you use a Gamma GLM vs log-transforming the target and using OLS?

Both handle positive right-skewed continuous responses, but they make different assumptions and optimize different objectives. Log-OLS (OLS on $\log y$ ): assumes $\log(y_i) \sim \mathcal{N}(\eta_i, \sigma^2)$ , i.e., $y_i$ is lognormal. The fitted model minimizes squared errors in log-space. Back-transformation $e^{\hat\eta}$ estimates the median, not the mean ( $E[y] = e^{\mu + \sigma^2/2}$ for lognormal - a correction term is needed). Gamma GLM with log link: assumes $y_i \sim \text{Gamma}(\mu_i, \phi)$ , with $\text{Var}(y_i) = \phi\mu_i^2$ . The model directly minimizes the Gamma deviance and directly estimates the mean $E[y_i]$ - no back-transformation bias. Prefer Gamma GLM when: you want unbiased predictions of the mean; the coefficient-of-variation is roughly constant across observations; your loss function should be scale-invariant (proportional errors, not absolute); you need calibrated standard errors. Prefer log-OLS when: you genuinely believe the lognormal generative model; you need a simpler pipeline; you only care about relative rankings (median predictions), not absolute means.

Q6: How would you select among Poisson, Negative Binomial, and Gaussian families for the same count response?

Use a three-step process. Step 1 - domain check: counts are never negative, so Gaussian (which can predict negatives) is suspect unless counts are large and nearly symmetric. Step 2 - fit candidate models and compare AIC: lower AIC wins. AIC = $-2\ell + 2k$ , penalizing both poor fit and excess parameters. Negative Binomial has one extra parameter ( $\alpha$ ) over Poisson, so it needs to improve $-2\ell$ by at least 2 to win on AIC. Step 3 - residual diagnostics: (a) Pearson chi2/df should be near 1 for the winning model; (b) plot Pearson residuals vs fitted values - should show no trend; (c) QQ plot of quantile residuals (randomized for discrete distributions). If Poisson shows Pearson chi2/df ≈ 1 and Negative Binomial AIC is within 2, use Poisson for interpretability. If there is strong overdispersion, Negative Binomial is clearly better and you should report it. Never use Gaussian for count data with substantial mass at zero - the probability of negative predictions can be high.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Logistic Regression Decision Boundary demo on the EngineersOfAI Playground - no code required.

:::

The Real Interview Moment​

Why This Exists - The Failure of Linear Regression​

Historical Context​

The Exponential Family​

Exponential Family Members​

The Three GLM Components​

Component 1: Random Component (Distribution)​

Component 2: Systematic Component (Linear Predictor)​

Component 3: Link Function​

The GLM Gradient: A Unified Formula​

IRLS: The GLM Fitting Algorithm​

Derivation​

IRLS vs Gradient Descent​

Poisson Regression: Count Data​

The Model​

Overdispersion: When Poisson Fails​

Detection​

The Negative Binomial Model​

Gamma Regression: Positive Continuous Data​

Probit vs Logit for Binary Classification​

Deviance and Model Comparison​

Deviance Definition​

Null vs Residual Deviance​

AIC for GLM Model Selection​

Production Notes​

YouTube Resources​

Interview Q&A​