Skip to main content

Generalized Linear Models

The Real Interview Moment

The senior ML engineer leans back in her chair and asks: "Our revenue team predicted churn using linear regression. It worked okay, but predicted probabilities sometimes go above 1. They used logistic regression instead and that fixed it. Now the demand team predicts hourly ride counts and they get negative predictions. Logistic regression won't work here either. What framework unifies both problems?"

You answer: Generalized Linear Models.

The interviewer nods and follows up: "Great. Now Poisson regression is showing that our model's residual variance is nine times the mean. What does that tell you, and what do you do?"

This is a working knowledge question. The answer involves overdispersion, its causes, its consequences for inference, and the fix. Getting this right separates engineers who understand statistical machinery from those who just call .fit() and check accuracy.

Linear regression is a special case of a much larger family. The GLM framework - developed by Nelder and Wedderburn in 1972 - gives you a principled way to handle any response type: binary outcomes, counts, strictly positive measurements, proportions, or arbitrary continuous data. The key ingredients are always the same: choose the right distribution for your response, choose the right link function to connect your linear predictor to the response mean, and the IRLS algorithm handles the rest. Every model you already know - OLS, logistic regression, Poisson regression - is a GLM. Once you understand the framework, extending it to new response types costs nothing beyond identifying the correct exponential family member.

This lesson builds the GLM framework from first principles. It derives the exponential family form, explains the three GLM components, implements Poisson and Gamma regression from scratch and with statsmodels, covers overdispersion diagnosis and remediation, walks through the IRLS algorithm, and ends with a complete model-selection workflow using deviance and AIC.


Why This Exists - The Failure of Linear Regression

Ordinary linear regression assumes:

yi=wxi+ϵi,ϵiN(0,σ2)y_i = \mathbf{w}^\top\mathbf{x}_i + \epsilon_i, \quad \epsilon_i \sim \mathcal{N}(0, \sigma^2)

This works well when the response is continuous and unbounded. It fails systematically when:

  1. Binary response (y{0,1}y \in \{0,1\}): predictions can fall outside [0,1][0,1]; Gaussian residuals make no sense for binary data
  2. Count response (y{0,1,2,}y \in \{0, 1, 2, \ldots\}): predictions can be negative; variance grows with the mean (not constant)
  3. Positive continuous response (y>0y > 0): predictions can be negative; right-skewed distributions are common (costs, durations, sizes)
  4. Proportional response (y[0,1]y \in [0,1]): predictions can exit the unit interval

The ad hoc fix is transforming the target (log-transform for positive data, logit-transform for proportions). But transforming the target changes the noise structure: if log(y)N(η,σ2)\log(y) \sim \mathcal{N}(\eta, \sigma^2), then predictions of log(y)\log(y) are unbiased, but back-transformed predictions eη^e^{\hat{\eta}} are biased for E[y]E[y] (by Jensen's inequality). GLMs avoid this by modeling the mean directly through a link function, without transforming the response.


Historical Context

YearEvent
1972Nelder & Wedderburn publish "Generalized Linear Models" in JRSS-A
1972IRLS algorithm formalized for GLM fitting
1983McCullagh & Nelder write the canonical GLM textbook
1984R's GLM function becomes the reference implementation
1986Logistic and Poisson regression widely adopted in epidemiology and biostats
1995Negative Binomial extended to GLM framework for overdispersed counts
2000sStatsmodels (Python) implements complete GLM suite
2010sGLMs embedded in tree models (LightGBM, XGBoost support Poisson/Gamma objectives)

Nelder's central insight: all common regression models share the same estimation algorithm (IRLS) when expressed in exponential family form. Rather than deriving a custom algorithm for each distribution, one algorithm handles all of them.


The Exponential Family

Many probability distributions belong to the exponential family, which can be written in the form:

P(y;θ,ϕ)=exp(yθb(θ)a(ϕ)+c(y,ϕ))P(y; \theta, \phi) = \exp\left(\frac{y\theta - b(\theta)}{a(\phi)} + c(y, \phi)\right)

where:

  • θ\theta is the natural (canonical) parameter
  • ϕ\phi is the dispersion parameter (controls spread)
  • b(θ)b(\theta) is the cumulant function (log-normalizer, ensures valid distribution)
  • a(ϕ)a(\phi) is a dispersion scaling factor (usually ϕ/wi\phi/w_i for observation weight wi=1w_i = 1)
  • c(y,ϕ)c(y, \phi) is a normalizing function of the data

Two properties link the cumulant function to the distribution moments:

E[y]=μ=b(θ),Var(y)=a(ϕ)b(θ)=a(ϕ)V(μ)\mathbb{E}[y] = \mu = b'(\theta), \qquad \text{Var}(y) = a(\phi) \cdot b''(\theta) = a(\phi) \cdot V(\mu)

where V(μ)=b(θ)V(\mu) = b''(\theta) is the variance function - it describes how variance changes with the mean.

Exponential Family Members

Gaussian (Linear Regression):

P(yμ)=12πσ2exp((yμ)22σ2)P(y|\mu) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{(y-\mu)^2}{2\sigma^2}\right)

Natural parameter: θ=μ\theta = \mu, Cumulant: b(θ)=θ2/2b(\theta) = \theta^2/2, Variance function: V(μ)=1V(\mu) = 1 (constant)

Bernoulli (Logistic Regression):

P(yμ)=μy(1μ)1yP(y|\mu) = \mu^y(1-\mu)^{1-y}

Natural parameter: θ=logμ1μ\theta = \log\frac{\mu}{1-\mu} (log-odds), Cumulant: b(θ)=log(1+eθ)b(\theta) = \log(1 + e^\theta), Variance function: V(μ)=μ(1μ)V(\mu) = \mu(1-\mu)

Poisson (Count Regression):

P(yμ)=eμμyy!P(y|\mu) = \frac{e^{-\mu}\mu^y}{y!}

Natural parameter: θ=logμ\theta = \log\mu, Cumulant: b(θ)=eθb(\theta) = e^\theta, Variance function: V(μ)=μV(\mu) = \mu (equidispersion: mean = variance)

Gamma (Positive Continuous):

P(yμ,ϕ)=1Γ(1/ϕ)(yϕμ)1/ϕ1yexp(yϕμ)P(y|\mu, \phi) = \frac{1}{\Gamma(1/\phi)}\left(\frac{y}{\phi\mu}\right)^{1/\phi}\frac{1}{y}\exp\left(-\frac{y}{\phi\mu}\right)

Natural parameter: θ=1/μ\theta = -1/\mu, Cumulant: b(θ)=log(θ)b(\theta) = -\log(-\theta), Variance function: V(μ)=μ2V(\mu) = \mu^2 (coefficient of variation is constant)


The Three GLM Components

A Generalized Linear Model consists of three choices:

Component 1: Random Component (Distribution)

The response yiy_i follows a distribution from the exponential family. This choice determines:

  • The support of yy (what values are possible)
  • The variance function V(μ)V(\mu) (how variance scales with the mean)
  • The appropriate range for predictions

Component 2: Systematic Component (Linear Predictor)

The linear predictor is always:

ηi=wxi=w0+w1xi1++wdxid\eta_i = \mathbf{w}^\top\mathbf{x}_i = w_0 + w_1 x_{i1} + \ldots + w_d x_{id}

GLMs remain linear in parameters - the systematic component is the same as OLS. This is what makes them tractable and interpretable.

The link function gg connects the mean μi=E[yixi]\mu_i = \mathbb{E}[y_i | \mathbf{x}_i] to the linear predictor:

g(μi)=ηi=wxig(\mu_i) = \eta_i = \mathbf{w}^\top\mathbf{x}_i

Equivalently: μi=g1(wxi)\mu_i = g^{-1}(\mathbf{w}^\top\mathbf{x}_i)

The canonical link is the link that makes η=θ\eta = \theta (natural parameter = linear predictor). Using canonical links gives the cleanest score equations and simplest IRLS weights.

DistributionSupportCanonical Link g(μ)g(\mu)Inverse g1(η)g^{-1}(\eta)V(μ)V(\mu)Model name
Gaussian(,+)(-\infty, +\infty)Identity: μ\muη\eta1Linear regression
Bernoulli{0,1}\{0,1\}Logit: logμ1μ\log\frac{\mu}{1-\mu}σ(η)=11+eη\sigma(\eta) = \frac{1}{1+e^{-\eta}}μ(1μ)\mu(1-\mu)Logistic regression
Poisson{0,1,2,}\{0,1,2,\ldots\}Log: logμ\log\mueηe^\etaμ\muPoisson regression
Gamma(0,+)(0,+\infty)Inverse: 1/μ1/\mu1/η1/\etaμ2\mu^2Gamma regression
Neg. Binomial{0,1,2,}\{0,1,2,\ldots\}Log: logμ\log\mueηe^\etaμ+αμ2\mu + \alpha\mu^2NB regression

The GLM Gradient: A Unified Formula

A crucial property of exponential families: the gradient of the log-likelihood has the same structure for all GLMs:

w=i=1n(yiμi)V(μi)[g(μi)]2μiηixi=XD(yμ)\frac{\partial\ell}{\partial\mathbf{w}} = \sum_{i=1}^n \frac{(y_i - \mu_i)}{V(\mu_i) \cdot [g'(\mu_i)]^2} \cdot \frac{\partial\mu_i}{\partial\eta_i} \cdot \mathbf{x}_i = \mathbf{X}^\top \mathbf{D}(\mathbf{y} - \boldsymbol{\mu})

where D\mathbf{D} is a diagonal matrix of weights. The gradient is always a weighted sum of residuals (yiμi)(y_i - \mu_i) times features xi\mathbf{x}_i - the same structural form as OLS gradient, generalized.

For canonical links, this simplifies to:

w=X(yμ)\frac{\partial\ell}{\partial\mathbf{w}} = \mathbf{X}^\top(\mathbf{y} - \boldsymbol{\mu})

For Poisson (log link, canonical): w=X(yμ^)\nabla_\mathbf{w}\ell = \mathbf{X}^\top(\mathbf{y} - \hat{\boldsymbol{\mu}}) where μ^i=ewxi\hat{\mu}_i = e^{\mathbf{w}^\top\mathbf{x}_i}.


IRLS: The GLM Fitting Algorithm

GLMs are not fit by standard gradient descent. Instead, Newton-Raphson applied to the log-likelihood leads to Iteratively Reweighted Least Squares (IRLS).

Derivation

The Newton-Raphson update for maximizing (w)\ell(\mathbf{w}):

w(t+1)=w(t)[w2(w(t))]1w(w(t))\mathbf{w}^{(t+1)} = \mathbf{w}^{(t)} - \left[\nabla^2_\mathbf{w}\ell(\mathbf{w}^{(t)})\right]^{-1}\nabla_\mathbf{w}\ell(\mathbf{w}^{(t)})

The Hessian of the GLM log-likelihood equals:

w2=XWX\nabla^2_\mathbf{w}\ell = -\mathbf{X}^\top \mathbf{W} \mathbf{X}

where W=diag(w1,,wn)\mathbf{W} = \text{diag}(w_1, \ldots, w_n) with working weights:

wi=1V(μi)[g(μi)]2w_i = \frac{1}{V(\mu_i) \cdot [g'(\mu_i)]^2}

The Newton step becomes:

w(t+1)=(XW(t)X)1XW(t)z(t)\mathbf{w}^{(t+1)} = \left(\mathbf{X}^\top\mathbf{W}^{(t)}\mathbf{X}\right)^{-1}\mathbf{X}^\top\mathbf{W}^{(t)}\mathbf{z}^{(t)}

where z(t)\mathbf{z}^{(t)} is the adjusted dependent variable (working response):

zi(t)=η^i(t)+(yiμ^i(t))g(μ^i(t))z_i^{(t)} = \hat{\eta}_i^{(t)} + (y_i - \hat{\mu}_i^{(t)}) \cdot g'(\hat{\mu}_i^{(t)})

This is exactly a weighted least squares problem with response z(t)\mathbf{z}^{(t)} and weights W(t)\mathbf{W}^{(t)} - solved at each iteration, hence "iteratively reweighted least squares."

IRLS vs Gradient Descent

PropertyIRLSGradient Descent
ConvergenceQuadratic (Newton-Raphson)Linear
Iterations to convergence~5–10Hundreds to thousands
Per-iteration costO(nd2+d3)O(nd^2 + d^3) (matrix solve)O(nd)O(nd)
MemoryO(d2)O(d^2) for Gram matrixO(d)O(d)
Best forSmall-medium nn, well-conditionedVery large nn, high dd
Used bystatsmodels, R glm(), sklearn GLMOnline/streaming settings

For small-to-medium datasets (n ≤ 100k, d ≤ 10k), IRLS is dramatically faster due to quadratic convergence. For large-scale problems, gradient descent variants (Adam, Adagrad) are preferred.

import numpy as np
from scipy.special import expit # sigmoid


class PoissonGLMFromScratch:
"""
Poisson GLM with log link, fit by IRLS.
Model: log(mu_i) = w'x_i => mu_i = exp(w'x_i)
Variance function: V(mu) = mu
Link derivative: g'(mu) = 1/mu (log link)
Working weight: w_i = 1 / (V(mu_i) * g'(mu_i)^2) = mu_i
Working response: z_i = eta_i + (y_i - mu_i) / mu_i
"""

def __init__(self, max_iter: int = 25, tol: float = 1e-8):
self.max_iter = max_iter
self.tol = tol
self.w_ = None
self.ll_history_ = []

def _log_likelihood(self, y: np.ndarray, mu: np.ndarray) -> float:
# Poisson NLL: sum[ -mu + y*log(mu) ] (ignore log(y!) constant)
return np.sum(-mu + y * np.log(np.clip(mu, 1e-12, None)))

def fit(self, X: np.ndarray, y: np.ndarray, verbose: bool = True):
n, d = X.shape
Xa = np.column_stack([np.ones(n), X]) # add intercept

# Initialization: start from eta = log(y_bar)
eta = np.full(n, np.log(np.mean(y) + 1e-6))
mu = np.exp(eta)
w = np.linalg.lstsq(Xa, eta, rcond=None)[0]

for iteration in range(self.max_iter):
# --- Compute IRLS components ---
# Working weight for Poisson + log link: W_i = mu_i
W = mu # shape (n,)

# Working response: z_i = eta_i + (y_i - mu_i) * g'(mu_i)
# g'(mu) = d(log mu)/dmu = 1/mu
z = eta + (y - mu) / mu # shape (n,)

# --- Weighted least squares step ---
# Solve: (Xa' W Xa) w = Xa' W z
Xw = Xa * W[:, np.newaxis] # weighted X
A = Xw.T @ Xa # (d+1, d+1)
b = Xw.T @ z # (d+1,)

w_new = np.linalg.solve(A + 1e-10 * np.eye(d + 1), b)

# --- Update eta and mu ---
eta_new = Xa @ w_new
mu_new = np.exp(eta_new) # inverse log link

# --- Check convergence ---
ll = self._log_likelihood(y, mu_new)
self.ll_history_.append(ll)

delta_w = np.max(np.abs(w_new - w))
if verbose and (iteration % 5 == 0 or iteration < 5):
print(f" Iter {iteration+1:3d}: log-lik = {ll:.4f}, "
f"max|Δw| = {delta_w:.2e}")

w, eta, mu = w_new, eta_new, mu_new

if delta_w < self.tol:
if verbose:
print(f" Converged at iteration {iteration+1}")
break

self.w_ = w
self.w_coef_ = w[1:]
self.intercept_ = w[0]
return self

def predict(self, X: np.ndarray) -> np.ndarray:
n = X.shape[0]
Xa = np.column_stack([np.ones(n), X])
return np.exp(Xa @ self.w_)

def deviance(self, X: np.ndarray, y: np.ndarray) -> float:
mu = self.predict(X)
# Poisson deviance: 2 * sum[ y*log(y/mu) - (y - mu) ]
safe = np.clip(mu, 1e-10, None)
yc = np.clip(y, 1e-10, None)
return 2 * np.sum(y * np.log(yc / safe) - (y - safe))


# Demo
np.random.seed(42)
n = 400
X_ = np.random.randn(n, 3)
w_true = np.array([0.5, -0.3, 0.7])
eta_true = 2.0 + X_ @ w_true
y_pois = np.random.poisson(np.exp(eta_true))

print("IRLS Fitting Poisson GLM from scratch:")
poisson_glm = PoissonGLMFromScratch(max_iter=30, tol=1e-8)
poisson_glm.fit(X_, y_pois)

print(f"\nTrue intercept: 2.0, Estimated: {poisson_glm.intercept_:.4f}")
print(f"True w: {w_true}, Estimated: {poisson_glm.w_coef_}")
print(f"Residual Deviance: {poisson_glm.deviance(X_, y_pois):.3f}")

Poisson Regression: Count Data

The Model

For count response yi{0,1,2,}y_i \in \{0, 1, 2, \ldots\}:

μi=E[yixi]=ewxi\mu_i = E[y_i | \mathbf{x}_i] = e^{\mathbf{w}^\top\mathbf{x}_i}

The log link ensures μi>0\mu_i > 0 always. The loss function (Poisson negative log-likelihood):

(w)=i=1n[ewxiyiwxi]+const-\ell(\mathbf{w}) = \sum_{i=1}^n \left[e^{\mathbf{w}^\top\mathbf{x}_i} - y_i\mathbf{w}^\top\mathbf{x}_i\right] + \text{const}

Coefficient interpretation: A one-unit increase in feature jj multiplies the expected count by ewje^{w_j} (multiplicative, not additive). This is the key difference from linear regression.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf


def poisson_regression_demo():
"""
Fit Poisson GLM to simulated ride demand data.
Demonstrate coefficient interpretation, overdispersion check.
"""
np.random.seed(42)
n = 600

# Features
hour = np.random.randint(0, 24, n)
temp_c = np.random.normal(18, 7, n)
is_weekend = np.random.binomial(1, 0.3, n)
rain = np.random.binomial(1, 0.25, n)

# True log-linear model
hour_effect = 0.8 * np.sin(2 * np.pi * hour / 24)
log_mu = (2.8
+ 0.025 * temp_c
+ 0.3 * is_weekend
- 0.6 * rain
+ hour_effect)
y_rides = np.random.poisson(np.exp(log_mu))

df = pd.DataFrame({
'rides': y_rides,
'hour': hour,
'temp_c': temp_c,
'is_weekend': is_weekend,
'rain': rain,
'hour_sin': np.sin(2 * np.pi * hour / 24),
'hour_cos': np.cos(2 * np.pi * hour / 24),
})

# Fit Poisson GLM
formula = 'rides ~ temp_c + is_weekend + rain + hour_sin + hour_cos'
model = smf.glm(formula, data=df,
family=sm.families.Poisson()).fit()

print("Poisson GLM Summary:")
print(model.summary())

# Coefficient interpretation
print("\nMultiplicative Effects (exp(coef)):")
for name, coef in model.params.items():
if name == 'Intercept':
continue
effect = np.exp(coef)
pct = (effect - 1) * 100
sign = "+" if pct >= 0 else ""
print(f" {name:15s}: e^{coef:6.3f} = {effect:.3f} "
f"({sign}{pct:.1f}% change in expected rides per unit)")

# Overdispersion test
mean_y = df['rides'].mean()
var_y = df['rides'].var()
print(f"\nOverdispersion Check:")
print(f" Mean of y: {mean_y:.3f}")
print(f" Variance of y: {var_y:.3f}")
print(f" Variance/Mean: {var_y/mean_y:.3f}")

# Pearson chi-squared statistic / df
mu_hat = model.predict()
pearson_chi2 = np.sum((df['rides'] - mu_hat)**2 / mu_hat)
df_resid = model.df_resid
print(f" Pearson chi2 / df: {pearson_chi2/df_resid:.3f}"
f" (expect ≈1 if well-specified)")
if pearson_chi2 / df_resid > 2:
print(" WARNING: Possible overdispersion. Consider Negative Binomial.")

return df, model


df_rides, poisson_model = poisson_regression_demo()

Overdispersion: When Poisson Fails

The Poisson distribution assumes equidispersion: Var(y)=μ\text{Var}(y) = \mu. Real count data almost always exhibits overdispersion - variance exceeding the mean. Common causes:

  • Unobserved heterogeneity: Some individuals have systematically higher rates, not captured by features
  • Contagion / clustering: Events cluster (positive correlation within groups)
  • Zero inflation: Excess zeros from a mixture of structural zeros and Poisson process
  • Measurement aggregation: Counts measured over varying time periods

Detection

Three diagnostics for overdispersion in fitted Poisson models:

  1. Raw variance/mean ratio: If observed Var(y)/yˉ1\text{Var}(y)/\bar{y} \gg 1, likely overdispersed
  2. Pearson chi-squared: χP2=i(yiμ^i)2/μ^i\chi^2_P = \sum_i(y_i - \hat\mu_i)^2/\hat\mu_i. Under Poisson, χP2/df1\chi^2_P/df \approx 1. Ratio >2> 2 is concerning.
  3. Cameron-Trivedi test: Regress (yiμ^i)2yi(y_i - \hat\mu_i)^2 - y_i on μ^i2\hat\mu_i^2; significant positive slope indicates overdispersion

The Negative Binomial Model

The Negative Binomial adds a per-observation random effect (Gamma mixing):

yiλiPoisson(λi),λiGamma(1α,μiα)y_i | \lambda_i \sim \text{Poisson}(\lambda_i), \quad \lambda_i \sim \text{Gamma}\left(\frac{1}{\alpha}, \frac{\mu_i}{\alpha}\right)

Marginalizing over λi\lambda_i gives:

yiμi,αNegBin(μi,α)y_i | \mu_i, \alpha \sim \text{NegBin}(\mu_i, \alpha)

with variance function:

Var(yi)=μi+αμi2\text{Var}(y_i) = \mu_i + \alpha\mu_i^2

The dispersion parameter α0\alpha \geq 0: when α0\alpha \to 0, NegBin → Poisson. The quadratic variance function accommodates arbitrarily large overdispersion by estimating α\alpha from data.

import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np
import pandas as pd


def overdispersion_demo():
"""
Generate overdispersed count data, compare Poisson vs Negative Binomial.
"""
np.random.seed(42)
n = 500
X_ = np.random.randn(n, 2)
w_true = np.array([0.4, -0.3])

# Overdispersed: gamma mixing with shape=2 (alpha=0.5)
alpha_true = 0.5
mu_true = np.exp(1.5 + X_ @ w_true)
# Generate lambda from Gamma, then counts from Poisson(lambda)
lam = np.random.gamma(shape=1/alpha_true, scale=mu_true * alpha_true)
y = np.random.poisson(lam)

df_od = pd.DataFrame({'y': y, 'x1': X_[:, 0], 'x2': X_[:, 1]})

# Dispersion check
print(f"Overdispersion Diagnosis:")
print(f" Mean y: {y.mean():.2f}, Var y: {y.var():.2f}")
print(f" Raw Var/Mean ratio: {y.var()/y.mean():.2f}")

# Fit Poisson
pois_fit = smf.glm('y ~ x1 + x2', data=df_od,
family=sm.families.Poisson()).fit(disp=0)
mu_pois = pois_fit.predict()
pearson = np.sum((y - mu_pois)**2 / mu_pois)
print(f"\nPoisson GLM:")
print(f" AIC: {pois_fit.aic:.2f}")
print(f" Residual Deviance: {pois_fit.deviance:.2f}")
print(f" Pearson chi2 / df: {pearson/pois_fit.df_resid:.3f}")

# Fit Negative Binomial
nb_fit = smf.glm('y ~ x1 + x2', data=df_od,
family=sm.families.NegativeBinomial()).fit(disp=0)
mu_nb = nb_fit.predict()
pearson_nb = np.sum((y - mu_nb)**2 / mu_nb)
print(f"\nNegative Binomial GLM:")
print(f" AIC: {nb_fit.aic:.2f}")
print(f" Residual Deviance: {nb_fit.deviance:.2f}")
print(f" Pearson chi2 / df: {pearson_nb/nb_fit.df_resid:.3f}")
print(f" (Lower AIC + Pearson ≈ 1 confirms NegBin is the right model)")

# Quasi-Poisson as alternative (inflate SEs, don't change point estimates)
# In statsmodels, use Poisson with scale='X2' (Pearson chi2 dispersion)
# quasi_fit = smf.glm('y ~ x1 + x2', data=df_od,
# family=sm.families.Poisson()).fit(scale='X2')
# This gives same coefficients as Poisson but wider confidence intervals

return pois_fit, nb_fit


pois_fit, nb_fit = overdispersion_demo()

Gamma Regression: Positive Continuous Data

The Gamma distribution is appropriate for strictly positive, right-skewed continuous responses such as insurance claim amounts, trip durations, costs, or survival times.

Gamma GLM with log link (most common):

μi=E[yixi]=ewxi\mu_i = E[y_i | \mathbf{x}_i] = e^{\mathbf{w}^\top\mathbf{x}_i}

Variance function: Var(yi)=ϕμi2\text{Var}(y_i) = \phi\mu_i^2 - variance grows quadratically with the mean. This means the coefficient of variation CV=Var(yi)/μi=ϕ\text{CV} = \sqrt{\text{Var}(y_i)}/\mu_i = \sqrt{\phi} is constant across observations - a property of many financial and engineering quantities.

The Gamma NLL (loss function) with log link:

(w)=1ϕi=1n[yiμi+logμi]+const-\ell(\mathbf{w}) = \frac{1}{\phi}\sum_{i=1}^n \left[\frac{y_i}{\mu_i} + \log\mu_i\right] + \text{const}

import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt


def gamma_regression_demo():
"""
Fit Gamma GLM to simulated insurance claim amounts.
Compare vs OLS on raw vs log-transformed target.
"""
np.random.seed(42)
n = 500

# Features: driver age, vehicle value, risk score
age = np.random.uniform(18, 75, n)
veh_value = np.random.exponential(20000, n) # in $
risk_score = np.random.uniform(0, 10, n)

# True Gamma model: log(E[claim]) = linear predictor
log_mu = (7.0
- 0.01 * age
+ 0.000005 * veh_value
+ 0.15 * risk_score)
mu_true = np.exp(log_mu)

# Gamma distributed claims: shape k=4 (phi=1/k=0.25)
k = 4
y_claims = np.random.gamma(k, mu_true / k) # shape=k, scale=mu/k

df_ins = pd.DataFrame({
'claim_amount': y_claims,
'age': age,
'veh_value': veh_value,
'risk_score': risk_score,
})

print(f"Claims summary:")
print(f" Mean: ${y_claims.mean():,.0f}")
print(f" Std: ${y_claims.std():,.0f}")
print(f" CV: {y_claims.std()/y_claims.mean():.3f} (true: {1/np.sqrt(k):.3f})")

# 1. Gamma GLM (correct model)
gamma_model = smf.glm(
'claim_amount ~ age + veh_value + risk_score',
data=df_ins,
family=sm.families.Gamma(link=sm.families.links.Log())
).fit()

# 2. OLS on raw target (wrong model)
X_ins = df_ins[['age', 'veh_value', 'risk_score']].values
ols_raw = LinearRegression().fit(X_ins, y_claims)

# 3. OLS on log-transformed target
ols_log = LinearRegression().fit(X_ins, np.log(y_claims))

def rmse(y, yhat): return np.sqrt(np.mean((y - yhat)**2))
def mae(y, yhat): return np.mean(np.abs(y - yhat))

pred_gamma = gamma_model.predict()
pred_ols = ols_raw.predict(X_ins)
pred_log = np.exp(ols_log.predict(X_ins)) # back-transform (biased)

print(f"\nModel Comparison:")
print(f"{'Model':30s} {'RMSE':10s} {'MAE':10s} {'AIC':10s}")
print("-" * 65)
print(f"{'Gamma GLM (log link)':30s} {rmse(y_claims, pred_gamma):10.2f} "
f"{mae(y_claims, pred_gamma):10.2f} {gamma_model.aic:10.2f}")
print(f"{'OLS (raw target)':30s} {rmse(y_claims, pred_ols):10.2f} "
f"{mae(y_claims, pred_ols):10.2f} {'N/A':10s}")
print(f"{'OLS (log-y, back-transformed)':30s} {rmse(y_claims, pred_log):10.2f} "
f"{mae(y_claims, pred_log):10.2f} {'N/A':10s}")

print("\nGamma GLM Coefficients:")
print(gamma_model.summary().tables[1])

# OLS can predict negative claims; Gamma always positive
negative_preds = np.sum(pred_ols < 0)
print(f"\nOLS negative predictions: {negative_preds} (impossible for claims)")
print(f"Gamma negative predictions: {np.sum(pred_gamma < 0)} (impossible by construction)")

return gamma_model


gamma_model = gamma_regression_demo()

Probit vs Logit for Binary Classification

Both logit and probit model P(y=1x)P(y=1|\mathbf{x}) but use different link functions derived from different latent variable models.

Latent variable interpretation: suppose there is an unobserved continuous propensity y=wx+ϵy^* = \mathbf{w}^\top\mathbf{x} + \epsilon, and we observe y=1[y>0]y = \mathbf{1}[y^* > 0].

  • If ϵLogistic(0,1)\epsilon \sim \text{Logistic}(0, 1): P(y=1x)=σ(wx)P(y=1|\mathbf{x}) = \sigma(\mathbf{w}^\top\mathbf{x})logit model
  • If ϵN(0,1)\epsilon \sim \mathcal{N}(0, 1): P(y=1x)=Φ(wx)P(y=1|\mathbf{x}) = \Phi(\mathbf{w}^\top\mathbf{x})probit model

The Logistic and Normal CDFs are very similar in the central region but differ in the tails - Logistic has heavier tails. In practice, predictions are nearly identical.

PropertyLogitProbit
Link functionlogμ1μ\log\frac{\mu}{1-\mu}Φ1(μ)\Phi^{-1}(\mu)
Inverse linkσ(η)=11+eη\sigma(\eta) = \frac{1}{1+e^{-\eta}}Φ(η)\Phi(\eta)
Coefficient scaleLarger (~1.7× probit)Smaller
InterpretationLog-odds ratioLatent z-score
Tail behaviorHeavier tailsLighter tails
TraditionMachine learningEconometrics
Computational costCheap (exp)Moderate (erf/erfc)

The coefficient ratio w^logitπ3w^probit1.81w^probit\hat{w}_\text{logit} \approx \frac{\pi}{\sqrt{3}} \hat{w}_\text{probit} \approx 1.81\, \hat{w}_\text{probit} because the logistic distribution has standard deviation π/3\pi/\sqrt{3} while the standard normal has standard deviation 1.

import numpy as np
import statsmodels.api as sm
from scipy import stats


def probit_vs_logit():
"""
Fit probit and logit to the same data; compare coefficients and predictions.
"""
np.random.seed(42)
n = 800

# True probit model
X_ = np.random.randn(n, 3)
w_true = np.array([0.5, -0.4, 0.8])
latent = X_ @ w_true # N(0,1) errors implied by probit
p_true = stats.norm.cdf(latent)
y = np.random.binomial(1, p_true)

Xa = sm.add_constant(X_)

# Fit probit
probit_fit = sm.Probit(y, Xa).fit(disp=0)

# Fit logit
logit_fit = sm.Logit(y, Xa).fit(disp=0)

print("Probit vs Logit Coefficient Comparison:")
print(f"{'Param':10s} {'True (probit)':14s} {'Probit est':12s} {'Logit est':12s} {'Logit/Probit':12s}")
print("-" * 65)
names = ['intercept', 'x1', 'x2', 'x3']
true = [0.0] + list(w_true)

for i, name in enumerate(names):
pc = probit_fit.params[i]
lc = logit_fit.params[i]
r = lc / pc if abs(pc) > 0.01 else float('nan')
print(f"{name:10s} {true[i]:14.3f} {pc:12.4f} {lc:12.4f} {r:12.3f}")

print(f"\nTheoretical ratio π/√3 ≈ {np.pi/np.sqrt(3):.3f}")

# Compare predicted probabilities
p_probit = probit_fit.predict()
p_logit = logit_fit.predict()
max_diff = np.max(np.abs(p_probit - p_logit))
print(f"\nMax predicted probability difference: {max_diff:.4f}")
print("Probit and logit give nearly identical predictions in practice.")

return probit_fit, logit_fit


probit_fit, logit_fit = probit_vs_logit()

Deviance and Model Comparison

For GLMs, the equivalent of RSS and R² is deviance - a likelihood-ratio-based measure.

Deviance Definition

The deviance of a fitted GLM is:

D=2[(saturated model)(w^)]D = 2\left[\ell(\text{saturated model}) - \ell(\hat{\boldsymbol{w}})\right]

The saturated model fits each observation perfectly (μ^i=yi\hat\mu_i = y_i) and represents the maximum achievable log-likelihood. Deviance measures how far your model falls short.

Poisson deviance:

D=2i=1n[yilogyiμ^i(yiμ^i)]D = 2\sum_{i=1}^n \left[y_i\log\frac{y_i}{\hat\mu_i} - (y_i - \hat\mu_i)\right]

Gamma deviance:

D=2i=1n[yiμ^iμ^ilogyiμ^i]D = 2\sum_{i=1}^n \left[\frac{y_i - \hat\mu_i}{\hat\mu_i} - \log\frac{y_i}{\hat\mu_i}\right]

Bernoulli deviance (= 2 × cross-entropy):

D=2i=1n[yilogμ^i+(1yi)log(1μ^i)]D = -2\sum_{i=1}^n \left[y_i\log\hat\mu_i + (1-y_i)\log(1-\hat\mu_i)\right]

Null vs Residual Deviance

  • Null deviance: deviance of intercept-only model (μ^i=yˉ\hat\mu_i = \bar{y} for all ii) - baseline performance
  • Residual deviance: deviance of your fitted model - performance with features

Pseudo-R²: RD2=1Dresid/DnullR^2_D = 1 - D_\text{resid}/D_\text{null} (fraction of deviance explained)

AIC for GLM Model Selection

AIC=2(θ^)+2k\text{AIC} = -2\ell(\hat{\boldsymbol{\theta}}) + 2k

where kk = number of estimated parameters. Lower AIC = better model. Use AIC to:

  • Compare models with different numbers of features (AIC penalizes complexity)
  • Compare different GLM families (Poisson vs Negative Binomial vs Gaussian)
  • Compare different link functions (log vs identity for Poisson)
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf


def glm_model_selection(df: pd.DataFrame, y_col: str, x_cols: list):
"""
Compare GLM families on the same dataset using AIC, deviance, and pseudo-R².
"""
formula = f"{y_col} ~ {' + '.join(x_cols)}"

families = {
'Poisson (log)': sm.families.Poisson(sm.families.links.Log()),
'NegBinomial (log)': sm.families.NegativeBinomial(sm.families.links.Log()),
'Gaussian (identity)': sm.families.Gaussian(sm.families.links.identity()),
'Gaussian (log)': sm.families.Gaussian(sm.families.links.Log()),
}

results = []
for name, family in families.items():
try:
fit = smf.glm(formula, data=df, family=family).fit(disp=0)
pseudo_r2 = 1 - fit.deviance / fit.null_deviance
results.append({
'Model': name,
'AIC': fit.aic,
'Null Dev': fit.null_deviance,
'Resid Dev': fit.deviance,
'Pseudo R²': pseudo_r2,
'df_resid': fit.df_resid,
})
except Exception as e:
results.append({'Model': name, 'AIC': np.inf,
'Null Dev': np.nan, 'Resid Dev': np.nan,
'Pseudo R²': np.nan, 'df_resid': np.nan})

results_df = pd.DataFrame(results).sort_values('AIC')
print(f"\nGLM Model Comparison - Response: {y_col}")
print(results_df.to_string(index=False, float_format='{:.2f}'.format))
return results_df


# Simulate and compare on Poisson count data
np.random.seed(42)
n_demo = 400
X_demo = np.random.randn(n_demo, 3)
y_demo = np.random.poisson(np.exp(1.5 + X_demo @ [0.4, -0.3, 0.5]))
df_demo = pd.DataFrame({
'y': y_demo, 'x1': X_demo[:, 0], 'x2': X_demo[:, 1], 'x3': X_demo[:, 2]
})

glm_model_selection(df_demo, 'y', ['x1', 'x2', 'x3'])

Production Notes

:::tip Offset terms for varying exposure When count observations have different exposure times (e.g., incidents per machine-hour, calls per employee), add an offset term: log(mu_i) = offset_i + w'x_i. In statsmodels: smf.glm(formula, ..., offset=np.log(exposure)). This ensures your model predicts rates, not raw counts, which is almost always the right quantity. :::

:::warning Check deviance/df before reporting A Poisson GLM reporting residual deviance/df = 8.3 is severely overdispersed. Coefficients are still consistent, but standard errors are too small (anti-conservative). Always report Pearson chi2/df from the fitted model. If > 2, switch to Negative Binomial or use quasi-Poisson (which rescales SEs by sqrt(dispersion) without changing point estimates). :::

:::danger Do not exponentiate Gamma GLM coefficients for interpretation For Poisson regression with log link, ewje^{w_j} is the multiplicative rate ratio - a correct and meaningful interpretation. For Gamma regression with log link, ewje^{w_j} is also the multiplicative effect on the expected value. But for Gamma regression with inverse link (canonical), wjw_j represents the change in 1/μ1/\mu, which is rarely intuitive. Always specify which link function you used when reporting Gamma GLM coefficients. :::

:::warning Beta regression for proportions is not a standard GLM Beta regression (for responses strictly in (0,1)(0,1)) is often discussed alongside GLMs but is not a standard exponential family GLM because the Beta distribution's parameterization by (μ,ϕ)(\mu, \phi) does not fit the canonical form. It requires a separate fitting procedure. Use betareg in R or statsmodels.othermod.BetaModel in Python. :::


YouTube Resources

TitleChannelWhy Watch
Generalized Linear Models - Complete SeriesStatQuestVisual, intuitive GLM intro covering Poisson and logistic
GLMs in Python with statsmodelsData SchoolPractical coding walkthrough
IRLS Algorithm ExplainedMathematical MonkRigorous derivation of IRLS from Newton-Raphson
Count Data Regression: Poisson and Negative BinomialritvikmathOverdispersion explained clearly
Gamma Regression for Skewed DataKevin P. Murphy lecturesWhen and why to use Gamma family

Interview Q&A

Q1: What are the three components of a GLM and what does each one do?

(1) Random component: specifies the probability distribution for the response yy. Must be a member of the exponential family (Gaussian, Bernoulli, Poisson, Gamma, etc.). This choice determines the support of yy (what values are possible) and the variance function V(μ)V(\mu) (how variance scales with the mean). (2) Systematic component: the linear predictor η=wx\eta = \mathbf{w}^\top\mathbf{x} - always linear in the parameters. This is the same structure as OLS; GLMs keep linearity in parameter space while allowing non-linear mean-response relationships. (3) Link function gg: connects the mean μ=E[yx]\mu = E[y|\mathbf{x}] to the linear predictor via g(μ)=ηg(\mu) = \eta. The link ensures the mean stays in the valid range for the distribution - log link for positive means, logit for probabilities, identity for unrestricted means. The canonical link (log for Poisson, logit for Bernoulli, identity for Gaussian) makes the score equations simplest.

Q2: Why use Poisson regression instead of linear regression for count data?

Three reasons. First, support: Poisson regression uses the log link μi=ewxi\mu_i = e^{\mathbf{w}^\top\mathbf{x}_i}, guaranteeing μi>0\mu_i > 0 always. Linear regression can predict negative counts, which are impossible. Second, variance structure: Poisson assumes Var(yi)=μi\text{Var}(y_i) = \mu_i - variance grows with the mean. This reflects empirical reality for counts. Linear regression assumes constant variance (homoscedasticity), which leads to inefficient and miscalibrated predictions for count data. Third, coefficient interpretation: Poisson coefficients have a multiplicative interpretation (ewje^{w_j} = rate ratio), which is natural for rates and counts. Linear coefficients are additive, which can imply the same absolute increase regardless of baseline count - often unrealistic.

Q3: What is overdispersion, how do you detect it, and how do you handle it?

Overdispersion means the observed variance in count data exceeds the mean - violating Poisson's equidispersion assumption (Var(y)=μ\text{Var}(y) = \mu). Detection: (1) Raw ratio Var(y)/yˉ\text{Var}(y)/\bar{y}; if much greater than 1, suspect overdispersion. (2) Pearson chi-squared statistic divided by residual degrees of freedom from the fitted Poisson: if χP2/df1\chi^2_P/df \gg 1, the model is overdispersed. (3) Cameron-Trivedi test: auxiliary regression of (yiμ^i)2yi(y_i - \hat\mu_i)^2 - y_i on μ^i2\hat\mu_i^2. Consequences: Poisson standard errors are too small (anti-conservative), leading to falsely significant coefficients. Point estimates are still consistent under mild overdispersion. Handling: (1) Negative Binomial GLM - fits α\alpha explicitly, correct variance function V(μ)=μ+αμ2V(\mu) = \mu + \alpha\mu^2; (2) Quasi-Poisson - keeps Poisson point estimates but scales SEs by χP2/df\sqrt{\chi^2_P/df}; (3) Add omitted features to absorb heterogeneity.

Q4: Derive the IRLS update equation. Why is it quadratically convergent?

Start from Newton-Raphson: w(t+1)=w(t)(2)1\mathbf{w}^{(t+1)} = \mathbf{w}^{(t)} - (\nabla^2\ell)^{-1}\nabla\ell. For GLMs, the Hessian is 2=XWX\nabla^2\ell = -\mathbf{X}^\top\mathbf{W}\mathbf{X} with diagonal weight matrix Wii=1/[V(μi)g(μi)2]W_{ii} = 1/[V(\mu_i) g'(\mu_i)^2], and gradient =XD(yμ)\nabla\ell = \mathbf{X}^\top\mathbf{D}(\mathbf{y} - \boldsymbol{\mu}). Substituting: w(t+1)=(XWX)1XWz\mathbf{w}^{(t+1)} = (\mathbf{X}^\top\mathbf{W}\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{W}\mathbf{z} where zi=ηi+(yiμi)g(μi)z_i = \eta_i + (y_i - \mu_i)g'(\mu_i) is the working response. This is a weighted least squares problem with weights WW and response z\mathbf{z}, solved freshly at each iteration. Quadratic convergence: Newton-Raphson converges quadratically because it fits a local quadratic approximation (second-order Taylor expansion) of the objective at each step. Near the MLE, the approximation is very accurate and each step approximately squares the error.

Q5: When would you use a Gamma GLM vs log-transforming the target and using OLS?

Both handle positive right-skewed continuous responses, but they make different assumptions and optimize different objectives. Log-OLS (OLS on logy\log y): assumes log(yi)N(ηi,σ2)\log(y_i) \sim \mathcal{N}(\eta_i, \sigma^2), i.e., yiy_i is lognormal. The fitted model minimizes squared errors in log-space. Back-transformation eη^e^{\hat\eta} estimates the median, not the mean (E[y]=eμ+σ2/2E[y] = e^{\mu + \sigma^2/2} for lognormal - a correction term is needed). Gamma GLM with log link: assumes yiGamma(μi,ϕ)y_i \sim \text{Gamma}(\mu_i, \phi), with Var(yi)=ϕμi2\text{Var}(y_i) = \phi\mu_i^2. The model directly minimizes the Gamma deviance and directly estimates the mean E[yi]E[y_i] - no back-transformation bias. Prefer Gamma GLM when: you want unbiased predictions of the mean; the coefficient-of-variation is roughly constant across observations; your loss function should be scale-invariant (proportional errors, not absolute); you need calibrated standard errors. Prefer log-OLS when: you genuinely believe the lognormal generative model; you need a simpler pipeline; you only care about relative rankings (median predictions), not absolute means.

Q6: How would you select among Poisson, Negative Binomial, and Gaussian families for the same count response?

Use a three-step process. Step 1 - domain check: counts are never negative, so Gaussian (which can predict negatives) is suspect unless counts are large and nearly symmetric. Step 2 - fit candidate models and compare AIC: lower AIC wins. AIC = 2+2k-2\ell + 2k, penalizing both poor fit and excess parameters. Negative Binomial has one extra parameter (α\alpha) over Poisson, so it needs to improve 2-2\ell by at least 2 to win on AIC. Step 3 - residual diagnostics: (a) Pearson chi2/df should be near 1 for the winning model; (b) plot Pearson residuals vs fitted values - should show no trend; (c) QQ plot of quantile residuals (randomized for discrete distributions). If Poisson shows Pearson chi2/df ≈ 1 and Negative Binomial AIC is within 2, use Poisson for interpretability. If there is strong overdispersion, Negative Binomial is clearly better and you should report it. Never use Gaussian for count data with substantial mass at zero - the probability of negative predictions can be high.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Logistic Regression Decision Boundary demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.