Generalized Linear Models
The Real Interview Moment
The senior ML engineer leans back in her chair and asks: "Our revenue team predicted churn using linear regression. It worked okay, but predicted probabilities sometimes go above 1. They used logistic regression instead and that fixed it. Now the demand team predicts hourly ride counts and they get negative predictions. Logistic regression won't work here either. What framework unifies both problems?"
You answer: Generalized Linear Models.
The interviewer nods and follows up: "Great. Now Poisson regression is showing that our model's residual variance is nine times the mean. What does that tell you, and what do you do?"
This is a working knowledge question. The answer involves overdispersion, its causes, its consequences for inference, and the fix. Getting this right separates engineers who understand statistical machinery from those who just call .fit() and check accuracy.
Linear regression is a special case of a much larger family. The GLM framework - developed by Nelder and Wedderburn in 1972 - gives you a principled way to handle any response type: binary outcomes, counts, strictly positive measurements, proportions, or arbitrary continuous data. The key ingredients are always the same: choose the right distribution for your response, choose the right link function to connect your linear predictor to the response mean, and the IRLS algorithm handles the rest. Every model you already know - OLS, logistic regression, Poisson regression - is a GLM. Once you understand the framework, extending it to new response types costs nothing beyond identifying the correct exponential family member.
This lesson builds the GLM framework from first principles. It derives the exponential family form, explains the three GLM components, implements Poisson and Gamma regression from scratch and with statsmodels, covers overdispersion diagnosis and remediation, walks through the IRLS algorithm, and ends with a complete model-selection workflow using deviance and AIC.
Why This Exists - The Failure of Linear Regression
Ordinary linear regression assumes:
This works well when the response is continuous and unbounded. It fails systematically when:
- Binary response (): predictions can fall outside ; Gaussian residuals make no sense for binary data
- Count response (): predictions can be negative; variance grows with the mean (not constant)
- Positive continuous response (): predictions can be negative; right-skewed distributions are common (costs, durations, sizes)
- Proportional response (): predictions can exit the unit interval
The ad hoc fix is transforming the target (log-transform for positive data, logit-transform for proportions). But transforming the target changes the noise structure: if , then predictions of are unbiased, but back-transformed predictions are biased for (by Jensen's inequality). GLMs avoid this by modeling the mean directly through a link function, without transforming the response.
Historical Context
| Year | Event |
|---|---|
| 1972 | Nelder & Wedderburn publish "Generalized Linear Models" in JRSS-A |
| 1972 | IRLS algorithm formalized for GLM fitting |
| 1983 | McCullagh & Nelder write the canonical GLM textbook |
| 1984 | R's GLM function becomes the reference implementation |
| 1986 | Logistic and Poisson regression widely adopted in epidemiology and biostats |
| 1995 | Negative Binomial extended to GLM framework for overdispersed counts |
| 2000s | Statsmodels (Python) implements complete GLM suite |
| 2010s | GLMs embedded in tree models (LightGBM, XGBoost support Poisson/Gamma objectives) |
Nelder's central insight: all common regression models share the same estimation algorithm (IRLS) when expressed in exponential family form. Rather than deriving a custom algorithm for each distribution, one algorithm handles all of them.
The Exponential Family
Many probability distributions belong to the exponential family, which can be written in the form:
where:
- is the natural (canonical) parameter
- is the dispersion parameter (controls spread)
- is the cumulant function (log-normalizer, ensures valid distribution)
- is a dispersion scaling factor (usually for observation weight )
- is a normalizing function of the data
Two properties link the cumulant function to the distribution moments:
where is the variance function - it describes how variance changes with the mean.
Exponential Family Members
Gaussian (Linear Regression):
Natural parameter: , Cumulant: , Variance function: (constant)
Bernoulli (Logistic Regression):
Natural parameter: (log-odds), Cumulant: , Variance function:
Poisson (Count Regression):
Natural parameter: , Cumulant: , Variance function: (equidispersion: mean = variance)
Gamma (Positive Continuous):
Natural parameter: , Cumulant: , Variance function: (coefficient of variation is constant)
The Three GLM Components
A Generalized Linear Model consists of three choices:
Component 1: Random Component (Distribution)
The response follows a distribution from the exponential family. This choice determines:
- The support of (what values are possible)
- The variance function (how variance scales with the mean)
- The appropriate range for predictions
Component 2: Systematic Component (Linear Predictor)
The linear predictor is always:
GLMs remain linear in parameters - the systematic component is the same as OLS. This is what makes them tractable and interpretable.
Component 3: Link Function
The link function connects the mean to the linear predictor:
Equivalently:
The canonical link is the link that makes (natural parameter = linear predictor). Using canonical links gives the cleanest score equations and simplest IRLS weights.
| Distribution | Support | Canonical Link | Inverse | Model name | |
|---|---|---|---|---|---|
| Gaussian | Identity: | 1 | Linear regression | ||
| Bernoulli | Logit: | Logistic regression | |||
| Poisson | Log: | Poisson regression | |||
| Gamma | Inverse: | Gamma regression | |||
| Neg. Binomial | Log: | NB regression |
The GLM Gradient: A Unified Formula
A crucial property of exponential families: the gradient of the log-likelihood has the same structure for all GLMs:
where is a diagonal matrix of weights. The gradient is always a weighted sum of residuals times features - the same structural form as OLS gradient, generalized.
For canonical links, this simplifies to:
For Poisson (log link, canonical): where .
IRLS: The GLM Fitting Algorithm
GLMs are not fit by standard gradient descent. Instead, Newton-Raphson applied to the log-likelihood leads to Iteratively Reweighted Least Squares (IRLS).
Derivation
The Newton-Raphson update for maximizing :
The Hessian of the GLM log-likelihood equals:
where with working weights:
The Newton step becomes:
where is the adjusted dependent variable (working response):
This is exactly a weighted least squares problem with response and weights - solved at each iteration, hence "iteratively reweighted least squares."
IRLS vs Gradient Descent
| Property | IRLS | Gradient Descent |
|---|---|---|
| Convergence | Quadratic (Newton-Raphson) | Linear |
| Iterations to convergence | ~5–10 | Hundreds to thousands |
| Per-iteration cost | (matrix solve) | |
| Memory | for Gram matrix | |
| Best for | Small-medium , well-conditioned | Very large , high |
| Used by | statsmodels, R glm(), sklearn GLM | Online/streaming settings |
For small-to-medium datasets (n ≤ 100k, d ≤ 10k), IRLS is dramatically faster due to quadratic convergence. For large-scale problems, gradient descent variants (Adam, Adagrad) are preferred.
import numpy as np
from scipy.special import expit # sigmoid
class PoissonGLMFromScratch:
"""
Poisson GLM with log link, fit by IRLS.
Model: log(mu_i) = w'x_i => mu_i = exp(w'x_i)
Variance function: V(mu) = mu
Link derivative: g'(mu) = 1/mu (log link)
Working weight: w_i = 1 / (V(mu_i) * g'(mu_i)^2) = mu_i
Working response: z_i = eta_i + (y_i - mu_i) / mu_i
"""
def __init__(self, max_iter: int = 25, tol: float = 1e-8):
self.max_iter = max_iter
self.tol = tol
self.w_ = None
self.ll_history_ = []
def _log_likelihood(self, y: np.ndarray, mu: np.ndarray) -> float:
# Poisson NLL: sum[ -mu + y*log(mu) ] (ignore log(y!) constant)
return np.sum(-mu + y * np.log(np.clip(mu, 1e-12, None)))
def fit(self, X: np.ndarray, y: np.ndarray, verbose: bool = True):
n, d = X.shape
Xa = np.column_stack([np.ones(n), X]) # add intercept
# Initialization: start from eta = log(y_bar)
eta = np.full(n, np.log(np.mean(y) + 1e-6))
mu = np.exp(eta)
w = np.linalg.lstsq(Xa, eta, rcond=None)[0]
for iteration in range(self.max_iter):
# --- Compute IRLS components ---
# Working weight for Poisson + log link: W_i = mu_i
W = mu # shape (n,)
# Working response: z_i = eta_i + (y_i - mu_i) * g'(mu_i)
# g'(mu) = d(log mu)/dmu = 1/mu
z = eta + (y - mu) / mu # shape (n,)
# --- Weighted least squares step ---
# Solve: (Xa' W Xa) w = Xa' W z
Xw = Xa * W[:, np.newaxis] # weighted X
A = Xw.T @ Xa # (d+1, d+1)
b = Xw.T @ z # (d+1,)
w_new = np.linalg.solve(A + 1e-10 * np.eye(d + 1), b)
# --- Update eta and mu ---
eta_new = Xa @ w_new
mu_new = np.exp(eta_new) # inverse log link
# --- Check convergence ---
ll = self._log_likelihood(y, mu_new)
self.ll_history_.append(ll)
delta_w = np.max(np.abs(w_new - w))
if verbose and (iteration % 5 == 0 or iteration < 5):
print(f" Iter {iteration+1:3d}: log-lik = {ll:.4f}, "
f"max|Δw| = {delta_w:.2e}")
w, eta, mu = w_new, eta_new, mu_new
if delta_w < self.tol:
if verbose:
print(f" Converged at iteration {iteration+1}")
break
self.w_ = w
self.w_coef_ = w[1:]
self.intercept_ = w[0]
return self
def predict(self, X: np.ndarray) -> np.ndarray:
n = X.shape[0]
Xa = np.column_stack([np.ones(n), X])
return np.exp(Xa @ self.w_)
def deviance(self, X: np.ndarray, y: np.ndarray) -> float:
mu = self.predict(X)
# Poisson deviance: 2 * sum[ y*log(y/mu) - (y - mu) ]
safe = np.clip(mu, 1e-10, None)
yc = np.clip(y, 1e-10, None)
return 2 * np.sum(y * np.log(yc / safe) - (y - safe))
# Demo
np.random.seed(42)
n = 400
X_ = np.random.randn(n, 3)
w_true = np.array([0.5, -0.3, 0.7])
eta_true = 2.0 + X_ @ w_true
y_pois = np.random.poisson(np.exp(eta_true))
print("IRLS Fitting Poisson GLM from scratch:")
poisson_glm = PoissonGLMFromScratch(max_iter=30, tol=1e-8)
poisson_glm.fit(X_, y_pois)
print(f"\nTrue intercept: 2.0, Estimated: {poisson_glm.intercept_:.4f}")
print(f"True w: {w_true}, Estimated: {poisson_glm.w_coef_}")
print(f"Residual Deviance: {poisson_glm.deviance(X_, y_pois):.3f}")
Poisson Regression: Count Data
The Model
For count response :
The log link ensures always. The loss function (Poisson negative log-likelihood):
Coefficient interpretation: A one-unit increase in feature multiplies the expected count by (multiplicative, not additive). This is the key difference from linear regression.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
def poisson_regression_demo():
"""
Fit Poisson GLM to simulated ride demand data.
Demonstrate coefficient interpretation, overdispersion check.
"""
np.random.seed(42)
n = 600
# Features
hour = np.random.randint(0, 24, n)
temp_c = np.random.normal(18, 7, n)
is_weekend = np.random.binomial(1, 0.3, n)
rain = np.random.binomial(1, 0.25, n)
# True log-linear model
hour_effect = 0.8 * np.sin(2 * np.pi * hour / 24)
log_mu = (2.8
+ 0.025 * temp_c
+ 0.3 * is_weekend
- 0.6 * rain
+ hour_effect)
y_rides = np.random.poisson(np.exp(log_mu))
df = pd.DataFrame({
'rides': y_rides,
'hour': hour,
'temp_c': temp_c,
'is_weekend': is_weekend,
'rain': rain,
'hour_sin': np.sin(2 * np.pi * hour / 24),
'hour_cos': np.cos(2 * np.pi * hour / 24),
})
# Fit Poisson GLM
formula = 'rides ~ temp_c + is_weekend + rain + hour_sin + hour_cos'
model = smf.glm(formula, data=df,
family=sm.families.Poisson()).fit()
print("Poisson GLM Summary:")
print(model.summary())
# Coefficient interpretation
print("\nMultiplicative Effects (exp(coef)):")
for name, coef in model.params.items():
if name == 'Intercept':
continue
effect = np.exp(coef)
pct = (effect - 1) * 100
sign = "+" if pct >= 0 else ""
print(f" {name:15s}: e^{coef:6.3f} = {effect:.3f} "
f"({sign}{pct:.1f}% change in expected rides per unit)")
# Overdispersion test
mean_y = df['rides'].mean()
var_y = df['rides'].var()
print(f"\nOverdispersion Check:")
print(f" Mean of y: {mean_y:.3f}")
print(f" Variance of y: {var_y:.3f}")
print(f" Variance/Mean: {var_y/mean_y:.3f}")
# Pearson chi-squared statistic / df
mu_hat = model.predict()
pearson_chi2 = np.sum((df['rides'] - mu_hat)**2 / mu_hat)
df_resid = model.df_resid
print(f" Pearson chi2 / df: {pearson_chi2/df_resid:.3f}"
f" (expect ≈1 if well-specified)")
if pearson_chi2 / df_resid > 2:
print(" WARNING: Possible overdispersion. Consider Negative Binomial.")
return df, model
df_rides, poisson_model = poisson_regression_demo()
Overdispersion: When Poisson Fails
The Poisson distribution assumes equidispersion: . Real count data almost always exhibits overdispersion - variance exceeding the mean. Common causes:
- Unobserved heterogeneity: Some individuals have systematically higher rates, not captured by features
- Contagion / clustering: Events cluster (positive correlation within groups)
- Zero inflation: Excess zeros from a mixture of structural zeros and Poisson process
- Measurement aggregation: Counts measured over varying time periods
Detection
Three diagnostics for overdispersion in fitted Poisson models:
- Raw variance/mean ratio: If observed , likely overdispersed
- Pearson chi-squared: . Under Poisson, . Ratio is concerning.
- Cameron-Trivedi test: Regress on ; significant positive slope indicates overdispersion
The Negative Binomial Model
The Negative Binomial adds a per-observation random effect (Gamma mixing):
Marginalizing over gives:
with variance function:
The dispersion parameter : when , NegBin → Poisson. The quadratic variance function accommodates arbitrarily large overdispersion by estimating from data.
import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np
import pandas as pd
def overdispersion_demo():
"""
Generate overdispersed count data, compare Poisson vs Negative Binomial.
"""
np.random.seed(42)
n = 500
X_ = np.random.randn(n, 2)
w_true = np.array([0.4, -0.3])
# Overdispersed: gamma mixing with shape=2 (alpha=0.5)
alpha_true = 0.5
mu_true = np.exp(1.5 + X_ @ w_true)
# Generate lambda from Gamma, then counts from Poisson(lambda)
lam = np.random.gamma(shape=1/alpha_true, scale=mu_true * alpha_true)
y = np.random.poisson(lam)
df_od = pd.DataFrame({'y': y, 'x1': X_[:, 0], 'x2': X_[:, 1]})
# Dispersion check
print(f"Overdispersion Diagnosis:")
print(f" Mean y: {y.mean():.2f}, Var y: {y.var():.2f}")
print(f" Raw Var/Mean ratio: {y.var()/y.mean():.2f}")
# Fit Poisson
pois_fit = smf.glm('y ~ x1 + x2', data=df_od,
family=sm.families.Poisson()).fit(disp=0)
mu_pois = pois_fit.predict()
pearson = np.sum((y - mu_pois)**2 / mu_pois)
print(f"\nPoisson GLM:")
print(f" AIC: {pois_fit.aic:.2f}")
print(f" Residual Deviance: {pois_fit.deviance:.2f}")
print(f" Pearson chi2 / df: {pearson/pois_fit.df_resid:.3f}")
# Fit Negative Binomial
nb_fit = smf.glm('y ~ x1 + x2', data=df_od,
family=sm.families.NegativeBinomial()).fit(disp=0)
mu_nb = nb_fit.predict()
pearson_nb = np.sum((y - mu_nb)**2 / mu_nb)
print(f"\nNegative Binomial GLM:")
print(f" AIC: {nb_fit.aic:.2f}")
print(f" Residual Deviance: {nb_fit.deviance:.2f}")
print(f" Pearson chi2 / df: {pearson_nb/nb_fit.df_resid:.3f}")
print(f" (Lower AIC + Pearson ≈ 1 confirms NegBin is the right model)")
# Quasi-Poisson as alternative (inflate SEs, don't change point estimates)
# In statsmodels, use Poisson with scale='X2' (Pearson chi2 dispersion)
# quasi_fit = smf.glm('y ~ x1 + x2', data=df_od,
# family=sm.families.Poisson()).fit(scale='X2')
# This gives same coefficients as Poisson but wider confidence intervals
return pois_fit, nb_fit
pois_fit, nb_fit = overdispersion_demo()
Gamma Regression: Positive Continuous Data
The Gamma distribution is appropriate for strictly positive, right-skewed continuous responses such as insurance claim amounts, trip durations, costs, or survival times.
Gamma GLM with log link (most common):
Variance function: - variance grows quadratically with the mean. This means the coefficient of variation is constant across observations - a property of many financial and engineering quantities.
The Gamma NLL (loss function) with log link:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
def gamma_regression_demo():
"""
Fit Gamma GLM to simulated insurance claim amounts.
Compare vs OLS on raw vs log-transformed target.
"""
np.random.seed(42)
n = 500
# Features: driver age, vehicle value, risk score
age = np.random.uniform(18, 75, n)
veh_value = np.random.exponential(20000, n) # in $
risk_score = np.random.uniform(0, 10, n)
# True Gamma model: log(E[claim]) = linear predictor
log_mu = (7.0
- 0.01 * age
+ 0.000005 * veh_value
+ 0.15 * risk_score)
mu_true = np.exp(log_mu)
# Gamma distributed claims: shape k=4 (phi=1/k=0.25)
k = 4
y_claims = np.random.gamma(k, mu_true / k) # shape=k, scale=mu/k
df_ins = pd.DataFrame({
'claim_amount': y_claims,
'age': age,
'veh_value': veh_value,
'risk_score': risk_score,
})
print(f"Claims summary:")
print(f" Mean: ${y_claims.mean():,.0f}")
print(f" Std: ${y_claims.std():,.0f}")
print(f" CV: {y_claims.std()/y_claims.mean():.3f} (true: {1/np.sqrt(k):.3f})")
# 1. Gamma GLM (correct model)
gamma_model = smf.glm(
'claim_amount ~ age + veh_value + risk_score',
data=df_ins,
family=sm.families.Gamma(link=sm.families.links.Log())
).fit()
# 2. OLS on raw target (wrong model)
X_ins = df_ins[['age', 'veh_value', 'risk_score']].values
ols_raw = LinearRegression().fit(X_ins, y_claims)
# 3. OLS on log-transformed target
ols_log = LinearRegression().fit(X_ins, np.log(y_claims))
def rmse(y, yhat): return np.sqrt(np.mean((y - yhat)**2))
def mae(y, yhat): return np.mean(np.abs(y - yhat))
pred_gamma = gamma_model.predict()
pred_ols = ols_raw.predict(X_ins)
pred_log = np.exp(ols_log.predict(X_ins)) # back-transform (biased)
print(f"\nModel Comparison:")
print(f"{'Model':30s} {'RMSE':10s} {'MAE':10s} {'AIC':10s}")
print("-" * 65)
print(f"{'Gamma GLM (log link)':30s} {rmse(y_claims, pred_gamma):10.2f} "
f"{mae(y_claims, pred_gamma):10.2f} {gamma_model.aic:10.2f}")
print(f"{'OLS (raw target)':30s} {rmse(y_claims, pred_ols):10.2f} "
f"{mae(y_claims, pred_ols):10.2f} {'N/A':10s}")
print(f"{'OLS (log-y, back-transformed)':30s} {rmse(y_claims, pred_log):10.2f} "
f"{mae(y_claims, pred_log):10.2f} {'N/A':10s}")
print("\nGamma GLM Coefficients:")
print(gamma_model.summary().tables[1])
# OLS can predict negative claims; Gamma always positive
negative_preds = np.sum(pred_ols < 0)
print(f"\nOLS negative predictions: {negative_preds} (impossible for claims)")
print(f"Gamma negative predictions: {np.sum(pred_gamma < 0)} (impossible by construction)")
return gamma_model
gamma_model = gamma_regression_demo()
Probit vs Logit for Binary Classification
Both logit and probit model but use different link functions derived from different latent variable models.
Latent variable interpretation: suppose there is an unobserved continuous propensity , and we observe .
- If : → logit model
- If : → probit model
The Logistic and Normal CDFs are very similar in the central region but differ in the tails - Logistic has heavier tails. In practice, predictions are nearly identical.
| Property | Logit | Probit |
|---|---|---|
| Link function | ||
| Inverse link | ||
| Coefficient scale | Larger (~1.7× probit) | Smaller |
| Interpretation | Log-odds ratio | Latent z-score |
| Tail behavior | Heavier tails | Lighter tails |
| Tradition | Machine learning | Econometrics |
| Computational cost | Cheap (exp) | Moderate (erf/erfc) |
The coefficient ratio because the logistic distribution has standard deviation while the standard normal has standard deviation 1.
import numpy as np
import statsmodels.api as sm
from scipy import stats
def probit_vs_logit():
"""
Fit probit and logit to the same data; compare coefficients and predictions.
"""
np.random.seed(42)
n = 800
# True probit model
X_ = np.random.randn(n, 3)
w_true = np.array([0.5, -0.4, 0.8])
latent = X_ @ w_true # N(0,1) errors implied by probit
p_true = stats.norm.cdf(latent)
y = np.random.binomial(1, p_true)
Xa = sm.add_constant(X_)
# Fit probit
probit_fit = sm.Probit(y, Xa).fit(disp=0)
# Fit logit
logit_fit = sm.Logit(y, Xa).fit(disp=0)
print("Probit vs Logit Coefficient Comparison:")
print(f"{'Param':10s} {'True (probit)':14s} {'Probit est':12s} {'Logit est':12s} {'Logit/Probit':12s}")
print("-" * 65)
names = ['intercept', 'x1', 'x2', 'x3']
true = [0.0] + list(w_true)
for i, name in enumerate(names):
pc = probit_fit.params[i]
lc = logit_fit.params[i]
r = lc / pc if abs(pc) > 0.01 else float('nan')
print(f"{name:10s} {true[i]:14.3f} {pc:12.4f} {lc:12.4f} {r:12.3f}")
print(f"\nTheoretical ratio π/√3 ≈ {np.pi/np.sqrt(3):.3f}")
# Compare predicted probabilities
p_probit = probit_fit.predict()
p_logit = logit_fit.predict()
max_diff = np.max(np.abs(p_probit - p_logit))
print(f"\nMax predicted probability difference: {max_diff:.4f}")
print("Probit and logit give nearly identical predictions in practice.")
return probit_fit, logit_fit
probit_fit, logit_fit = probit_vs_logit()
Deviance and Model Comparison
For GLMs, the equivalent of RSS and R² is deviance - a likelihood-ratio-based measure.
Deviance Definition
The deviance of a fitted GLM is:
The saturated model fits each observation perfectly () and represents the maximum achievable log-likelihood. Deviance measures how far your model falls short.
Poisson deviance:
Gamma deviance:
Bernoulli deviance (= 2 × cross-entropy):
Null vs Residual Deviance
- Null deviance: deviance of intercept-only model ( for all ) - baseline performance
- Residual deviance: deviance of your fitted model - performance with features
Pseudo-R²: (fraction of deviance explained)
AIC for GLM Model Selection
where = number of estimated parameters. Lower AIC = better model. Use AIC to:
- Compare models with different numbers of features (AIC penalizes complexity)
- Compare different GLM families (Poisson vs Negative Binomial vs Gaussian)
- Compare different link functions (log vs identity for Poisson)
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
def glm_model_selection(df: pd.DataFrame, y_col: str, x_cols: list):
"""
Compare GLM families on the same dataset using AIC, deviance, and pseudo-R².
"""
formula = f"{y_col} ~ {' + '.join(x_cols)}"
families = {
'Poisson (log)': sm.families.Poisson(sm.families.links.Log()),
'NegBinomial (log)': sm.families.NegativeBinomial(sm.families.links.Log()),
'Gaussian (identity)': sm.families.Gaussian(sm.families.links.identity()),
'Gaussian (log)': sm.families.Gaussian(sm.families.links.Log()),
}
results = []
for name, family in families.items():
try:
fit = smf.glm(formula, data=df, family=family).fit(disp=0)
pseudo_r2 = 1 - fit.deviance / fit.null_deviance
results.append({
'Model': name,
'AIC': fit.aic,
'Null Dev': fit.null_deviance,
'Resid Dev': fit.deviance,
'Pseudo R²': pseudo_r2,
'df_resid': fit.df_resid,
})
except Exception as e:
results.append({'Model': name, 'AIC': np.inf,
'Null Dev': np.nan, 'Resid Dev': np.nan,
'Pseudo R²': np.nan, 'df_resid': np.nan})
results_df = pd.DataFrame(results).sort_values('AIC')
print(f"\nGLM Model Comparison - Response: {y_col}")
print(results_df.to_string(index=False, float_format='{:.2f}'.format))
return results_df
# Simulate and compare on Poisson count data
np.random.seed(42)
n_demo = 400
X_demo = np.random.randn(n_demo, 3)
y_demo = np.random.poisson(np.exp(1.5 + X_demo @ [0.4, -0.3, 0.5]))
df_demo = pd.DataFrame({
'y': y_demo, 'x1': X_demo[:, 0], 'x2': X_demo[:, 1], 'x3': X_demo[:, 2]
})
glm_model_selection(df_demo, 'y', ['x1', 'x2', 'x3'])
Production Notes
:::tip Offset terms for varying exposure
When count observations have different exposure times (e.g., incidents per machine-hour, calls per employee), add an offset term: log(mu_i) = offset_i + w'x_i. In statsmodels: smf.glm(formula, ..., offset=np.log(exposure)). This ensures your model predicts rates, not raw counts, which is almost always the right quantity.
:::
:::warning Check deviance/df before reporting A Poisson GLM reporting residual deviance/df = 8.3 is severely overdispersed. Coefficients are still consistent, but standard errors are too small (anti-conservative). Always report Pearson chi2/df from the fitted model. If > 2, switch to Negative Binomial or use quasi-Poisson (which rescales SEs by sqrt(dispersion) without changing point estimates). :::
:::danger Do not exponentiate Gamma GLM coefficients for interpretation For Poisson regression with log link, is the multiplicative rate ratio - a correct and meaningful interpretation. For Gamma regression with log link, is also the multiplicative effect on the expected value. But for Gamma regression with inverse link (canonical), represents the change in , which is rarely intuitive. Always specify which link function you used when reporting Gamma GLM coefficients. :::
:::warning Beta regression for proportions is not a standard GLM
Beta regression (for responses strictly in ) is often discussed alongside GLMs but is not a standard exponential family GLM because the Beta distribution's parameterization by does not fit the canonical form. It requires a separate fitting procedure. Use betareg in R or statsmodels.othermod.BetaModel in Python.
:::
YouTube Resources
| Title | Channel | Why Watch |
|---|---|---|
| Generalized Linear Models - Complete Series | StatQuest | Visual, intuitive GLM intro covering Poisson and logistic |
| GLMs in Python with statsmodels | Data School | Practical coding walkthrough |
| IRLS Algorithm Explained | Mathematical Monk | Rigorous derivation of IRLS from Newton-Raphson |
| Count Data Regression: Poisson and Negative Binomial | ritvikmath | Overdispersion explained clearly |
| Gamma Regression for Skewed Data | Kevin P. Murphy lectures | When and why to use Gamma family |
Interview Q&A
Q1: What are the three components of a GLM and what does each one do?
(1) Random component: specifies the probability distribution for the response . Must be a member of the exponential family (Gaussian, Bernoulli, Poisson, Gamma, etc.). This choice determines the support of (what values are possible) and the variance function (how variance scales with the mean). (2) Systematic component: the linear predictor - always linear in the parameters. This is the same structure as OLS; GLMs keep linearity in parameter space while allowing non-linear mean-response relationships. (3) Link function : connects the mean to the linear predictor via . The link ensures the mean stays in the valid range for the distribution - log link for positive means, logit for probabilities, identity for unrestricted means. The canonical link (log for Poisson, logit for Bernoulli, identity for Gaussian) makes the score equations simplest.
Q2: Why use Poisson regression instead of linear regression for count data?
Three reasons. First, support: Poisson regression uses the log link , guaranteeing always. Linear regression can predict negative counts, which are impossible. Second, variance structure: Poisson assumes - variance grows with the mean. This reflects empirical reality for counts. Linear regression assumes constant variance (homoscedasticity), which leads to inefficient and miscalibrated predictions for count data. Third, coefficient interpretation: Poisson coefficients have a multiplicative interpretation ( = rate ratio), which is natural for rates and counts. Linear coefficients are additive, which can imply the same absolute increase regardless of baseline count - often unrealistic.
Q3: What is overdispersion, how do you detect it, and how do you handle it?
Overdispersion means the observed variance in count data exceeds the mean - violating Poisson's equidispersion assumption (). Detection: (1) Raw ratio ; if much greater than 1, suspect overdispersion. (2) Pearson chi-squared statistic divided by residual degrees of freedom from the fitted Poisson: if , the model is overdispersed. (3) Cameron-Trivedi test: auxiliary regression of on . Consequences: Poisson standard errors are too small (anti-conservative), leading to falsely significant coefficients. Point estimates are still consistent under mild overdispersion. Handling: (1) Negative Binomial GLM - fits explicitly, correct variance function ; (2) Quasi-Poisson - keeps Poisson point estimates but scales SEs by ; (3) Add omitted features to absorb heterogeneity.
Q4: Derive the IRLS update equation. Why is it quadratically convergent?
Start from Newton-Raphson: . For GLMs, the Hessian is with diagonal weight matrix , and gradient . Substituting: where is the working response. This is a weighted least squares problem with weights and response , solved freshly at each iteration. Quadratic convergence: Newton-Raphson converges quadratically because it fits a local quadratic approximation (second-order Taylor expansion) of the objective at each step. Near the MLE, the approximation is very accurate and each step approximately squares the error.
Q5: When would you use a Gamma GLM vs log-transforming the target and using OLS?
Both handle positive right-skewed continuous responses, but they make different assumptions and optimize different objectives. Log-OLS (OLS on ): assumes , i.e., is lognormal. The fitted model minimizes squared errors in log-space. Back-transformation estimates the median, not the mean ( for lognormal - a correction term is needed). Gamma GLM with log link: assumes , with . The model directly minimizes the Gamma deviance and directly estimates the mean - no back-transformation bias. Prefer Gamma GLM when: you want unbiased predictions of the mean; the coefficient-of-variation is roughly constant across observations; your loss function should be scale-invariant (proportional errors, not absolute); you need calibrated standard errors. Prefer log-OLS when: you genuinely believe the lognormal generative model; you need a simpler pipeline; you only care about relative rankings (median predictions), not absolute means.
Q6: How would you select among Poisson, Negative Binomial, and Gaussian families for the same count response?
Use a three-step process. Step 1 - domain check: counts are never negative, so Gaussian (which can predict negatives) is suspect unless counts are large and nearly symmetric. Step 2 - fit candidate models and compare AIC: lower AIC wins. AIC = , penalizing both poor fit and excess parameters. Negative Binomial has one extra parameter () over Poisson, so it needs to improve by at least 2 to win on AIC. Step 3 - residual diagnostics: (a) Pearson chi2/df should be near 1 for the winning model; (b) plot Pearson residuals vs fitted values - should show no trend; (c) QQ plot of quantile residuals (randomized for discrete distributions). If Poisson shows Pearson chi2/df ≈ 1 and Negative Binomial AIC is within 2, use Poisson for interpretability. If there is strong overdispersion, Negative Binomial is clearly better and you should report it. Never use Gaussian for count data with substantial mass at zero - the probability of negative predictions can be high.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Logistic Regression Decision Boundary demo on the EngineersOfAI Playground - no code required.
:::
