Skip to main content

Logistic Regression Deep Dive

The Real Interview Moment

You are in a final round interview for an ML research engineer position. The interviewer puts a plot on the screen: a calibration curve for a fraud detection model. The x-axis shows "predicted probability," the y-axis shows "actual fraud rate." The curve bows upward - the model outputs 0.3 when the true fraud rate is 0.5.

"This model has 95% AUC-ROC. Our business team says it's unusable. Why? And what would you do about it?"

A candidate who only knows accuracy and AUC cannot answer this. A candidate who understands logistic regression from first principles - why it produces calibrated probabilities, what conditions break calibration, and how to fix it - can answer it in three minutes with a concrete plan involving Platt scaling, isotonic regression, and temperature scaling.

This lesson builds that depth. Logistic regression is deceptively simple on the surface. Most engineers know the sigmoid formula. Few can derive why cross-entropy is the right loss, prove why the decision boundary is a hyperplane, explain what "calibrated" means mathematically, or correctly handle 1:99 class imbalance without introducing subtle bugs.

Why Logistic Regression Exists

Before logistic regression, the standard approach for classification was linear discriminant analysis (LDA), which assumes Gaussian features and estimates class-conditional densities. The problem: LDA fails when features are not Gaussian, and it does not directly model P(y=1x)P(y=1|\mathbf{x}).

Logistic regression takes a different approach: directly model the conditional probability P(y=1x)P(y=1|\mathbf{x}) as a function of the features. Proposed in the 1950s by David Cox, logistic regression became the standard baseline for binary classification because it directly answers the question that matters: "given these features, what is the probability of the positive class?"

Historical Context

David Cox introduced logistic regression in 1958 in the journal Biometrika as a method for binary outcome modeling in clinical trials. The model was motivated by the logistic curve - an S-shaped function that had been used in population growth models since the 1830s (Verhulst 1838). The connection to maximum likelihood was formalized by statisticians in the 1960s-70s. By the 1980s, logistic regression was the standard baseline in epidemiology, social sciences, and later machine learning. Today it remains the default linear classifier and a required building block for understanding neural network outputs.

The Sigmoid Function: Derivation and Properties

Given input features xRd\mathbf{x} \in \mathbb{R}^d, we want to model P(y=1x)P(y = 1 | \mathbf{x}). A linear model wx\mathbf{w}^\top\mathbf{x} can produce any real value, but probabilities must be in [0,1][0, 1].

From Log-Odds to Sigmoid

The logit model assumes that the log-odds of the positive class are linear in the features:

logP(y=1x)P(y=0x)=wx\log\frac{P(y=1|\mathbf{x})}{P(y=0|\mathbf{x})} = \mathbf{w}^\top\mathbf{x}

Solving for P(y=1x)P(y=1|\mathbf{x}): let p=P(y=1x)p = P(y=1|\mathbf{x}) and z=wxz = \mathbf{w}^\top\mathbf{x}:

p1p=ez    p=ez1+ez=11+ez\frac{p}{1-p} = e^z \implies p = \frac{e^z}{1+e^z} = \frac{1}{1+e^{-z}}

This is the sigmoid function:

σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}}

Key properties:

  • σ()=0\sigma(-\infty) = 0, σ(0)=0.5\sigma(0) = 0.5, σ(+)=1\sigma(+\infty) = 1 - valid probability range
  • Monotone increasing, smooth, differentiable everywhere
  • Derivative: σ(z)=σ(z)(1σ(z))\sigma'(z) = \sigma(z)(1 - \sigma(z)) - elegant recursive form
  • Numerically stable sigmoid: for large negative zz, use ez/(1+ez)e^z/(1+e^z) to avoid overflow

The logistic regression model is:

P(y=1x;w)=σ(wx)=11+ewx\boxed{P(y = 1 | \mathbf{x}; \mathbf{w}) = \sigma(\mathbf{w}^\top\mathbf{x}) = \frac{1}{1 + e^{-\mathbf{w}^\top\mathbf{x}}}}

Log-Odds Interpretation

The log-odds interpretation is crucial for coefficient interpretation. From the logit model:

logP(y=1x)P(y=0x)=w0+w1x1+w2x2++wdxd\log\frac{P(y=1|\mathbf{x})}{P(y=0|\mathbf{x})} = w_0 + w_1 x_1 + w_2 x_2 + \cdots + w_d x_d

Each coefficient wjw_j is the change in log-odds per unit increase in xjx_j (holding all others constant). More usefully: ewje^{w_j} is the odds ratio - a one-unit increase in xjx_j multiplies the odds of the positive class by ewje^{w_j}.

Interpretation example: In a fraud detection model with standardized features, wamount=0.5w_\text{amount} = 0.5 means a 1 standard deviation increase in transaction amount multiplies the fraud odds by e0.51.65e^{0.5} \approx 1.65 (65% increase in odds).

Cross-Entropy Loss: Derivation from Maximum Likelihood

The cross-entropy loss is not arbitrary - it is derived from the probabilistic model for binary outcomes.

Step 1: Write the Likelihood

For a binary label yi{0,1}y_i \in \{0, 1\}, the probability of the observed label given features and weights:

P(yixi;w)=y^iyi(1y^i)1yiP(y_i | \mathbf{x}_i; \mathbf{w}) = \hat{y}_i^{y_i}(1-\hat{y}_i)^{1-y_i}

where y^i=σ(wxi)\hat{y}_i = \sigma(\mathbf{w}^\top\mathbf{x}_i). Verify:

  • If yi=1y_i = 1: P=y^i1(1y^i)0=y^iP = \hat{y}_i^1 (1-\hat{y}_i)^0 = \hat{y}_i
  • If yi=0y_i = 0: P=y^i0(1y^i)1=1y^iP = \hat{y}_i^0 (1-\hat{y}_i)^1 = 1-\hat{y}_i

This is a Bernoulli distribution parameterized by y^i\hat{y}_i.

Step 2: Write the Log-Likelihood

Assuming nn i.i.d. observations, the joint likelihood is a product. Taking the log:

logL(w)=i=1n[yilogy^i+(1yi)log(1y^i)]\log \mathcal{L}(\mathbf{w}) = \sum_{i=1}^n \left[ y_i \log \hat{y}_i + (1-y_i)\log(1-\hat{y}_i) \right]

Step 3: Maximize Log-Likelihood = Minimize NLL = Minimize Cross-Entropy

Minimizing the negative log-likelihood gives the binary cross-entropy loss:

J(w)=1ni=1n[yilogy^i+(1yi)log(1y^i)]\boxed{\mathcal{J}(\mathbf{w}) = -\frac{1}{n}\sum_{i=1}^n \left[ y_i \log \hat{y}_i + (1-y_i)\log(1-\hat{y}_i) \right]}

This is not a choice - it is the statistically principled loss function derived from the Bernoulli likelihood model. Lesson 07 on MLE shows this is the same principle that derives OLS from Gaussian noise.

Why Not MSE for Classification?

Using JMSE=1n(yiy^i)2\mathcal{J}_{\text{MSE}} = \frac{1}{n}\sum (y_i - \hat{y}_i)^2 with sigmoid output is problematic:

The gradient of MSE + sigmoid:

JMSEwj=(y^iyi)σ(wxi)xij\frac{\partial \mathcal{J}_{\text{MSE}}}{\partial w_j} = (\hat{y}_i - y_i) \cdot \sigma'(\mathbf{w}^\top\mathbf{x}_i) \cdot x_{ij}

When wxi|\mathbf{w}^\top\mathbf{x}_i| is large (confident wrong prediction), σ(z)=σ(z)(1σ(z))0\sigma'(z) = \sigma(z)(1-\sigma(z)) \approx 0. The gradient vanishes where it is most needed. Cross-entropy avoids this:

JCEwj=1ni(y^iyi)xij\frac{\partial \mathcal{J}_{\text{CE}}}{\partial w_j} = \frac{1}{n}\sum_i (\hat{y}_i - y_i) x_{ij}

No sigmoid derivative - the gradient is simply the prediction error times the feature value.

Computing the Gradient

The gradient of cross-entropy loss is elegantly simple:

wJ=1nX(y^y)\nabla_\mathbf{w} \mathcal{J} = \frac{1}{n}\mathbf{X}^\top(\hat{\mathbf{y}} - \mathbf{y})

Proof for a single sample:

wj[ylogy^(1y)log(1y^)]\frac{\partial}{\partial w_j}[-y\log\hat{y} - (1-y)\log(1-\hat{y})]

=[yy^1y1y^]y^wj= -\left[\frac{y}{\hat{y}} - \frac{1-y}{1-\hat{y}}\right] \cdot \frac{\partial\hat{y}}{\partial w_j}

Using y^wj=σ(z)zzwj=y^(1y^)xj\frac{\partial\hat{y}}{\partial w_j} = \frac{\partial\sigma(z)}{\partial z}\cdot\frac{\partial z}{\partial w_j} = \hat{y}(1-\hat{y}) \cdot x_j:

=[y(1y^)(1y)y^y^(1y^)]y^(1y^)xj= -\left[\frac{y(1-\hat{y}) - (1-y)\hat{y}}{\hat{y}(1-\hat{y})}\right] \cdot \hat{y}(1-\hat{y}) \cdot x_j

=(yy^)xj=(y^y)xj= -(y - \hat{y})x_j = (\hat{y} - y)x_j

In matrix form: wJ=1nX(y^y)\nabla_\mathbf{w}\mathcal{J} = \frac{1}{n}\mathbf{X}^\top(\hat{\mathbf{y}} - \mathbf{y}) - same structure as linear regression's gradient.

Decision Boundary Geometry

The decision boundary is where P(y=1x)=0.5P(y=1|\mathbf{x}) = 0.5:

σ(wx+b)=0.5    wx+b=0\sigma(\mathbf{w}^\top\mathbf{x} + b) = 0.5 \implies \mathbf{w}^\top\mathbf{x} + b = 0

The decision boundary is always a linear hyperplane perpendicular to w\mathbf{w}:

  • In 2D: a line
  • In 3D: a plane
  • In dd-D: a (d1)(d-1)-dimensional hyperplane

Key geometric facts:

  • The vector w\mathbf{w} is the normal to the decision boundary, pointing toward the positive class
  • w\|\mathbf{w}\| controls the sharpness of the sigmoid - larger w\|\mathbf{w}\| → sharper transition
  • Logistic regression can never learn nonlinear boundaries (XOR, circles) without feature engineering

Full NumPy Implementation from Scratch

import numpy as np
from sklearn.datasets import load_breast_cancer, make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (classification_report, roc_auc_score,
roc_curve, precision_recall_curve, average_precision_score)
import matplotlib.pyplot as plt


class LogisticRegressionFromScratch:
"""
Binary logistic regression via gradient descent from scratch.
Includes numerically stable sigmoid, regularization option,
and full probability calibration support.
"""

def __init__(self, lr=0.1, n_iterations=1000, tolerance=1e-6,
lambda_reg=0.0, verbose=False):
self.lr = lr
self.n_iterations = n_iterations
self.tolerance = tolerance
self.lambda_reg = lambda_reg # L2 regularization strength
self.verbose = verbose
self.weights = None
self.bias = None
self.loss_history = []

def _sigmoid(self, z):
"""
Numerically stable sigmoid.
For large negative z, exp(-z) overflows; use exp(z)/(1+exp(z)) instead.
"""
return np.where(
z >= 0,
1 / (1 + np.exp(-z)),
np.exp(z) / (1 + np.exp(z))
)

def _cross_entropy(self, y, y_hat):
"""Binary cross-entropy with epsilon clipping to avoid log(0)."""
eps = 1e-15
y_hat = np.clip(y_hat, eps, 1 - eps)
ce = -np.mean(y * np.log(y_hat) + (1 - y) * np.log(1 - y_hat))
# Add L2 regularization term (don't regularize the bias)
if self.lambda_reg > 0:
ce += 0.5 * self.lambda_reg * np.sum(self.weights ** 2)
return ce

def fit(self, X, y):
n, d = X.shape
self.weights = np.zeros(d)
self.bias = 0.0

for t in range(self.n_iterations):
# Forward pass
z = X @ self.weights + self.bias
y_hat = self._sigmoid(z)

# Loss
loss = self._cross_entropy(y, y_hat)
self.loss_history.append(loss)

# Gradients: (1/n) X^T (ŷ - y) + λw (L2 regularization)
error = y_hat - y # shape (n,)
grad_w = (1 / n) * X.T @ error + self.lambda_reg * self.weights
grad_b = (1 / n) * error.sum()

grad_norm = np.linalg.norm(grad_w)
if self.verbose and t % 200 == 0:
print(f"Iter {t:5d}: loss={loss:.6f}, |∇|={grad_norm:.6f}")

if grad_norm < self.tolerance:
if self.verbose:
print(f"Converged at iteration {t}")
break

self.weights -= self.lr * grad_w
self.bias -= self.lr * grad_b

return self

def predict_proba(self, X):
"""Return probabilities for both classes: shape (n, 2)."""
z = X @ self.weights + self.bias
p1 = self._sigmoid(z)
return np.column_stack([1 - p1, p1])

def predict(self, X, threshold=0.5):
"""Return binary predictions using specified threshold."""
return (self.predict_proba(X)[:, 1] >= threshold).astype(int)

def score(self, X, y, threshold=0.5):
return float(np.mean(self.predict(X, threshold) == y))


# Load the breast cancer dataset
data = load_breast_cancer()
X_raw, y_raw = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(
X_raw, y_raw, test_size=0.2, random_state=42, stratify=y_raw)

scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc = scaler.transform(X_test)

model = LogisticRegressionFromScratch(
lr=0.1, n_iterations=2000, lambda_reg=0.01, verbose=True)
model.fit(X_train_sc, y_train)

y_prob = model.predict_proba(X_test_sc)[:, 1]
y_pred = model.predict(X_test_sc)

print(f"\nAccuracy: {model.score(X_test_sc, y_test):.4f}")
print(f"ROC-AUC: {roc_auc_score(y_test, y_prob):.4f}")
print(f"Avg Prec: {average_precision_score(y_test, y_prob):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=data.target_names))

ROC and PR Curves

def plot_roc_and_pr(y_true, y_prob, model_name="Logistic Regression"):
"""Plot ROC curve and Precision-Recall curve side by side."""
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# ROC Curve
fpr, tpr, roc_thresholds = roc_curve(y_true, y_prob)
auc_roc = roc_auc_score(y_true, y_prob)

ax = axes[0]
ax.plot(fpr, tpr, 'b-', linewidth=2, label=f'ROC (AUC={auc_roc:.3f})')
ax.plot([0, 1], [0, 1], 'k--', linewidth=1, label='Random (AUC=0.5)')
ax.fill_between(fpr, tpr, alpha=0.1, color='blue')
ax.set_xlabel("False Positive Rate (1 - Specificity)")
ax.set_ylabel("True Positive Rate (Sensitivity)")
ax.set_title(f"ROC Curve - {model_name}\n(Threshold-independent ranking quality)")
ax.legend()
ax.grid(True, alpha=0.3)

# Precision-Recall Curve
precision, recall, pr_thresholds = precision_recall_curve(y_true, y_prob)
avg_prec = average_precision_score(y_true, y_prob)
baseline = y_true.mean() # random classifier baseline = prevalence

ax = axes[1]
ax.plot(recall, precision, 'g-', linewidth=2,
label=f'PR (AP={avg_prec:.3f})')
ax.axhline(baseline, color='k', linestyle='--', linewidth=1,
label=f'Random (P={baseline:.2f})')
ax.fill_between(recall, precision, alpha=0.1, color='green')
ax.set_xlabel("Recall (Sensitivity)")
ax.set_ylabel("Precision (Positive Predictive Value)")
ax.set_title(f"Precision-Recall Curve - {model_name}\n"
f"(Better for imbalanced datasets)")
ax.legend()
ax.grid(True, alpha=0.3)

# Annotate threshold at 0.5
idx_50 = np.argmin(np.abs(roc_thresholds - 0.5))
axes[0].scatter(fpr[idx_50], tpr[idx_50], s=100, color='red', zorder=5,
label=f'Threshold=0.5 (FPR={fpr[idx_50]:.2f}, TPR={tpr[idx_50]:.2f})')
axes[0].legend()

plt.tight_layout()
plt.savefig("roc_pr_curves.png", dpi=150)
plt.show()


plot_roc_and_pr(y_test, y_prob)

Multi-Class Classification

One-vs-Rest (OvR)

Train KK independent binary classifiers. Classifier kk treats class kk as positive and all others as negative.

Pros: Simple, parallelizable, interpretable per-class coefficients. Cons: Probabilities from KK classifiers are not calibrated against each other - they do not sum to 1.

Softmax (Multinomial Logistic Regression)

The softmax function generalizes sigmoid to KK classes:

P(y=kx)=ewkxj=1KewjxP(y = k | \mathbf{x}) = \frac{e^{\mathbf{w}_k^\top\mathbf{x}}}{\sum_{j=1}^K e^{\mathbf{w}_j^\top\mathbf{x}}}

Properties:

  • Outputs are in (0,1)(0, 1) and sum to 1 - valid probability distribution
  • The softmax is identifiable only up to a constant: adding cc to all wkx\mathbf{w}_k^\top\mathbf{x} does not change the output. Set wK=0\mathbf{w}_K = \mathbf{0} for identifiability.
  • For K=2K=2, softmax reduces to the binary sigmoid

The multi-class cross-entropy loss:

J=1ni=1nk=1KyiklogP(yi=kxi)\mathcal{J} = -\frac{1}{n}\sum_{i=1}^n \sum_{k=1}^K y_{ik} \log P(y_i = k | \mathbf{x}_i)

where yik=1y_{ik} = 1 if sample ii belongs to class kk (one-hot encoding).

class SoftmaxRegression:
"""
Multi-class logistic regression via softmax.
Equivalent to multinomial logistic regression (one-hot cross-entropy).
"""

def __init__(self, lr=0.1, n_iterations=1000, lambda_reg=0.01):
self.lr = lr
self.n_iterations = n_iterations
self.lambda_reg = lambda_reg
self.W = None # (d, K) weight matrix
self.b = None # (K,) bias vector
self.loss_history = []

def _softmax(self, Z):
"""
Numerically stable softmax: subtract max per row before exp.
Without this, exp() overflows for large logit values.
"""
Z_shifted = Z - Z.max(axis=1, keepdims=True)
exp_Z = np.exp(Z_shifted)
return exp_Z / exp_Z.sum(axis=1, keepdims=True)

def fit(self, X, y):
n, d = X.shape
K = len(np.unique(y))
self.W = np.zeros((d, K))
self.b = np.zeros(K)
self.classes_ = np.unique(y)

# One-hot encode labels
Y_onehot = np.eye(K)[y] # shape (n, K)

for t in range(self.n_iterations):
# Forward pass
Z = X @ self.W + self.b # logits, shape (n, K)
probs = self._softmax(Z) # probabilities, shape (n, K)

# Cross-entropy loss + L2 regularization
eps = 1e-15
loss = -np.mean(np.sum(Y_onehot * np.log(np.clip(probs, eps, 1)), axis=1))
loss += 0.5 * self.lambda_reg * np.sum(self.W ** 2)
self.loss_history.append(loss)

# Gradients: (1/n) X^T (P - Y) + λW
error = probs - Y_onehot # shape (n, K)
grad_W = (1 / n) * X.T @ error + self.lambda_reg * self.W
grad_b = (1 / n) * error.sum(axis=0)

self.W -= self.lr * grad_W
self.b -= self.lr * grad_b

return self

def predict_proba(self, X):
Z = X @ self.W + self.b
return self._softmax(Z)

def predict(self, X):
return self.predict_proba(X).argmax(axis=1)

def score(self, X, y):
return float(np.mean(self.predict(X) == y))


# Multi-class example on Iris
from sklearn.datasets import load_iris

iris = load_iris()
X_i, y_i = iris.data, iris.target
X_tr, X_te, y_tr, y_te = train_test_split(X_i, y_i, test_size=0.2,
random_state=42, stratify=y_i)
sc_i = StandardScaler()
X_tr_sc = sc_i.fit_transform(X_tr)
X_te_sc = sc_i.transform(X_te)

sm = SoftmaxRegression(lr=0.5, n_iterations=2000, lambda_reg=0.01)
sm.fit(X_tr_sc, y_tr)
print(f"\nSoftmax Regression Accuracy on Iris: {sm.score(X_te_sc, y_te):.4f}")

# Compare with OvR
from sklearn.linear_model import LogisticRegression as SKLogReg

ovr = SKLogReg(multi_class='ovr', C=10, max_iter=1000, random_state=42)
ovr.fit(X_tr_sc, y_tr)
print(f"OvR Logistic Regression Accuracy: {ovr.score(X_te_sc, y_te):.4f}")

Probability Calibration

A model is well-calibrated if among all predictions of probability pp, approximately fraction pp of them are actually positive. Formally:

P(y=1P^(y=1x)=p)=pp[0,1]P(y=1 | \hat{P}(y=1|\mathbf{x}) = p) = p \quad \forall p \in [0,1]

Expected Calibration Error (ECE)

The Expected Calibration Error (ECE) quantifies miscalibration. Partition the [0,1][0,1] probability range into MM bins of equal width. For each bin BmB_m:

ECE=m=1MBmnacc(Bm)conf(Bm)ECE = \sum_{m=1}^M \frac{|B_m|}{n} \left|\text{acc}(B_m) - \text{conf}(B_m)\right|

where:

  • Bm|B_m| = number of samples in bin mm
  • acc(Bm)=1BmiBm1[yi=y^i]\text{acc}(B_m) = \frac{1}{|B_m|}\sum_{i \in B_m} \mathbf{1}[y_i = \hat{y}_i] = fraction of correct predictions in bin
  • conf(Bm)=1BmiBmp^i\text{conf}(B_m) = \frac{1}{|B_m|}\sum_{i \in B_m} \hat{p}_i = average predicted probability in bin

A perfectly calibrated model has ECE = 0.

from sklearn.calibration import calibration_curve, CalibratedClassifierCV
from sklearn.linear_model import LogisticRegression as SKLogReg


def expected_calibration_error(y_true, y_prob, n_bins=10):
"""
Compute Expected Calibration Error (ECE).
Lower is better. Perfect calibration = ECE of 0.
"""
bin_edges = np.linspace(0, 1, n_bins + 1)
ece = 0.0
n = len(y_true)

for i in range(n_bins):
lo, hi = bin_edges[i], bin_edges[i + 1]
in_bin = (y_prob >= lo) & (y_prob < hi)

if in_bin.sum() == 0:
continue

bin_size = in_bin.sum()
acc = y_true[in_bin].mean() # fraction truly positive in bin
conf = y_prob[in_bin].mean() # average predicted probability in bin
ece += (bin_size / n) * abs(acc - conf)

return ece


# Evaluate calibration of our from-scratch model
ece_scratch = expected_calibration_error(y_test, y_prob)
print(f"\nECE (from-scratch logistic regression): {ece_scratch:.4f}")

# sklearn LogisticRegression
sk_lr = SKLogReg(C=100, max_iter=2000, random_state=42)
sk_lr.fit(X_train_sc, y_train)
sk_prob = sk_lr.predict_proba(X_test_sc)[:, 1]
ece_sk = expected_calibration_error(y_test, sk_prob)
print(f"ECE (sklearn LogisticRegression): {ece_sk:.4f}")

# Visualize calibration curve
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

ax = axes[0]
prob_true, prob_pred = calibration_curve(y_test, y_prob, n_bins=10)
ax.plot([0, 1], [0, 1], 'k--', label='Perfect calibration')
ax.plot(prob_pred, prob_true, 's-', color='blue',
label=f'From-scratch (ECE={ece_scratch:.3f})')
prob_true_sk, prob_pred_sk = calibration_curve(y_test, sk_prob, n_bins=10)
ax.plot(prob_pred_sk, prob_true_sk, 's-', color='green',
label=f'sklearn (ECE={ece_sk:.3f})')
ax.set_xlabel("Mean Predicted Probability")
ax.set_ylabel("Fraction of Positives (Actual Rate)")
ax.set_title("Calibration Curve\n(closer to diagonal = better calibration)")
ax.legend()
ax.grid(True, alpha=0.3)

# Probability distribution histogram
ax = axes[1]
ax.hist(y_prob[y_test == 0], bins=30, alpha=0.5, color='blue',
label='Negative class', density=True)
ax.hist(y_prob[y_test == 1], bins=30, alpha=0.5, color='red',
label='Positive class', density=True)
ax.set_xlabel("Predicted P(y=1|x)")
ax.set_ylabel("Density")
ax.set_title("Probability Distribution by True Class\n"
"(good separation = good discrimination)")
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig("calibration_curves.png", dpi=150)
plt.show()


# Platt Scaling - recalibrate a poorly calibrated model
# Fit logistic regression on top of uncalibrated scores
from sklearn.linear_model import LogisticRegression as PlattLR

def platt_scaling(y_cal, raw_scores_cal, y_test, raw_scores_test):
"""
Platt scaling: fit a sigmoid (logistic regression) on top of raw scores
to recalibrate probabilities.
"""
platt = PlattLR(C=1e6, max_iter=1000)
platt.fit(raw_scores_cal.reshape(-1, 1), y_cal)
calibrated_probs = platt.predict_proba(raw_scores_test.reshape(-1, 1))[:, 1]
return calibrated_probs

# For this already-calibrated model the improvement will be minor,
# but the pattern applies to SVMs, gradient boosting, etc.
n_cal = len(y_test) // 2
platt_prob = platt_scaling(
y_test[:n_cal], y_prob[:n_cal],
y_test[n_cal:], y_prob[n_cal:]
)
ece_platt = expected_calibration_error(y_test[n_cal:], platt_prob)
print(f"\nECE after Platt scaling: {ece_platt:.4f}")
print("(Platt scaling is most useful for SVMs and gradient boosted trees)")

Handling Class Imbalance

With imbalanced classes (e.g., 1% fraud, 99% legitimate), the default threshold of 0.5 is usually wrong. The model outputs probabilities reflecting the class distribution, and simply classifying by majority class achieves 99% accuracy - useless.

Strategy 1: Class Weights

Upweight the minority class in the loss:

from sklearn.linear_model import LogisticRegression as SKLogReg
from sklearn.utils.class_weight import compute_class_weight

# Create severely imbalanced dataset (5% positive)
np.random.seed(42)
X_imb, y_imb = make_classification(
n_samples=5000, n_features=20, n_informative=10,
weights=[0.95, 0.05], # 95% negative, 5% positive
random_state=42
)
X_tr_i, X_te_i, y_tr_i, y_te_i = train_test_split(
X_imb, y_imb, test_size=0.2, random_state=42, stratify=y_imb)

sc_imb = StandardScaler()
X_tr_i_sc = sc_imb.fit_transform(X_tr_i)
X_te_i_sc = sc_imb.transform(X_te_i)

# Compute balanced class weights: n_samples / (n_classes * count_per_class)
classes = np.unique(y_tr_i)
class_weights = compute_class_weight('balanced', classes=classes, y=y_tr_i)
weight_dict = dict(zip(classes, class_weights))
print(f"Class weights: {weight_dict}")

# Without class weights - likely predicts all zeros
lr_unweighted = SKLogReg(C=1.0, max_iter=1000, random_state=42)
lr_unweighted.fit(X_tr_i_sc, y_tr_i)

# With class weights - penalizes minority class errors more heavily
lr_weighted = SKLogReg(C=1.0, max_iter=1000, class_weight='balanced', random_state=42)
lr_weighted.fit(X_tr_i_sc, y_tr_i)

print("\nClass Imbalance: Effect of class_weight='balanced'")
print(f"{'Model':25s} | {'Accuracy':>10} | {'AUC-ROC':>8} | {'Recall-1':>10}")
print("-" * 60)

for name, m in [("Unweighted LR", lr_unweighted), ("Weighted LR", lr_weighted)]:
y_pred_i = m.predict(X_te_i_sc)
y_prob_i = m.predict_proba(X_te_i_sc)[:, 1]
acc = np.mean(y_pred_i == y_te_i)
auc = roc_auc_score(y_te_i, y_prob_i)
recall = np.sum((y_pred_i == 1) & (y_te_i == 1)) / np.sum(y_te_i == 1)
print(f"{name:25s} | {acc:>10.4f} | {auc:>8.4f} | {recall:>10.4f}")

Strategy 2: Threshold Tuning

The default threshold of 0.5 is arbitrary. Tune it based on the business cost of false positives vs false negatives:

def find_optimal_threshold(y_true, y_prob, metric='f1'):
"""
Find the probability threshold that optimizes the given metric.
Business context determines the right metric:
- F1: balanced precision/recall (most common)
- Precision: when false positives are very costly (fraud review queue)
- Recall: when false negatives are very costly (cancer screening)
"""
thresholds = np.arange(0.05, 0.95, 0.01)
scores = []

for thresh in thresholds:
y_pred = (y_prob >= thresh).astype(int)
tp = np.sum((y_pred == 1) & (y_true == 1))
fp = np.sum((y_pred == 1) & (y_true == 0))
fn = np.sum((y_pred == 0) & (y_true == 1))

precision = tp / (tp + fp) if (tp + fp) > 0 else 0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0

if metric == 'f1':
score = (2 * precision * recall / (precision + recall)
if (precision + recall) > 0 else 0)
elif metric == 'precision':
score = precision
elif metric == 'recall':
score = recall

scores.append(score)

best_idx = np.argmax(scores)
best_thresh = thresholds[best_idx]
best_score = scores[best_idx]

plt.figure(figsize=(8, 4))
plt.plot(thresholds, scores, 'b-', linewidth=2)
plt.axvline(best_thresh, color='r', linestyle='--',
label=f'Optimal threshold={best_thresh:.2f} ({metric}={best_score:.3f})')
plt.xlabel("Classification Threshold")
plt.ylabel(f"{metric.capitalize()} Score")
plt.title(f"Threshold Tuning: Optimize {metric.upper()}")
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("threshold_tuning.png", dpi=150)
plt.show()

return best_thresh, best_score


y_prob_weighted = lr_weighted.predict_proba(X_te_i_sc)[:, 1]
best_thresh, best_f1 = find_optimal_threshold(y_te_i, y_prob_weighted, metric='f1')
print(f"\nOptimal threshold for F1: {best_thresh:.2f}, F1={best_f1:.4f}")

scikit-learn with Calibration

from sklearn.linear_model import LogisticRegression as SKLogReg
from sklearn.calibration import CalibratedClassifierCV

# sklearn LogisticRegression with L-BFGS solver (quasi-Newton, much faster than GD)
sk_lr = SKLogReg(C=10.0, max_iter=2000, solver='lbfgs', random_state=42)
sk_lr.fit(X_train_sc, y_train)
print(f"\nsklearn LogReg (lbfgs) Accuracy: {sk_lr.score(X_test_sc, y_test):.4f}")
print(f"sklearn ROC-AUC: {roc_auc_score(y_test, sk_lr.predict_proba(X_test_sc)[:, 1]):.4f}")

# Calibrated logistic regression using isotonic regression
# Useful for models that are not natively calibrated (e.g., SVMs, gradient boosting)
calib_lr = CalibratedClassifierCV(sk_lr, method='isotonic', cv=5)
calib_lr.fit(X_train_sc, y_train)
calib_probs = calib_lr.predict_proba(X_test_sc)[:, 1]
ece_calib = expected_calibration_error(y_test, calib_probs)
print(f"Calibrated LR ECE: {ece_calib:.4f}")

# Solver comparison
print("\nSolver Comparison:")
for solver in ['lbfgs', 'liblinear', 'saga']:
try:
m = SKLogReg(C=10.0, max_iter=2000, solver=solver, random_state=42)
m.fit(X_train_sc, y_train)
prob_s = m.predict_proba(X_test_sc)[:, 1]
print(f" {solver:12s}: Acc={m.score(X_test_sc, y_test):.4f}, "
f"AUC={roc_auc_score(y_test, prob_s):.4f}")
except Exception as e:
print(f" {solver:12s}: {e}")

:::tip Production solver choice sklearn's LogisticRegression uses LBFGS by default for small/medium datasets - it is a quasi-Newton method that converges much faster than gradient descent (quadratic vs linear convergence). For large datasets (n>105n > 10^5) or sparse features, use solver='saga' which supports all penalties (L1, L2, ElasticNet) and mini-batch training. For L1 penalty with small datasets, solver='liblinear' uses coordinate descent - very efficient for sparse data. :::

Common Mistakes

:::danger Using MSE loss for logistic regression Logistic regression with MSE loss is not logistic regression - it is a different model with different optimization properties. MSE + sigmoid has vanishing gradients when predictions are saturated, poor calibration, and a non-convex loss surface. Always use cross-entropy (NLL) for logistic regression. In PyTorch: use nn.BCEWithLogitsLoss() (which combines sigmoid and cross-entropy for numerical stability). In sklearn: LogisticRegression always uses cross-entropy. :::

:::danger Threshold = 0.5 with imbalanced classes The 0.5 threshold assumes both classes are equally important. For a 1:99 imbalanced dataset, predicting "negative" for everything achieves 99% accuracy - a useless model. Always tune the threshold based on the precision/recall trade-off that matches your business requirements. Use classification_report with the tuned threshold, and report precision and recall separately, not just accuracy. :::

:::warning Not checking calibration before using predicted probabilities for decisions Raw logistic regression probabilities are approximately calibrated when the model is correctly specified, but miscalibration occurs with class imbalance, distribution shift, or overfit models. Before using probabilities for cost-benefit calculations, risk scoring, or ranking, always plot the calibration curve and compute ECE. If ECE > 0.05 on the test set, apply Platt scaling or isotonic regression recalibration. :::

:::warning Interpreting coefficients without standardizing A coefficient of 2.3 for age (in years) and 0.001 for income (in dollars) cannot be compared directly. Standardize features before training if you want to compare coefficient magnitudes. Report both standardized beta coefficients (for comparison) and unstandardized coefficients (for the original unit interpretation). :::

YouTube Resources

ResourceChannelWhy Watch
Logistic Regression, Clearly Explained!StatQuest with Josh StarmerBest visual introduction - sigmoid, log-odds, MLE connection
Softmax and Cross-Entropy ExplainedAndrej KarpathyDeep derivation of softmax, cross-entropy, and gradients
Precision vs Recall vs F1 ScoreStatQuestWhen accuracy fails - ROC, PR curves, and threshold selection
Probability Calibration for ML ModelsScikit-learn talksECE, Platt scaling, isotonic regression - practical calibration
Handling Imbalanced DatasetsDeepLearning.AIClass weights, SMOTE, and threshold tuning - complete imbalance guide

Interview Q&A

Q1: Derive the cross-entropy loss from the logistic regression model. Why not use MSE?

Cross-entropy is derived from MLE under the Bernoulli likelihood. For binary label yiy_i: P(yixi;w)=y^iyi(1y^i)1yiP(y_i|\mathbf{x}_i;\mathbf{w}) = \hat{y}_i^{y_i}(1-\hat{y}_i)^{1-y_i}. Taking the log over nn samples: i[yilogy^i+(1yi)log(1y^i)]\sum_i[y_i\log\hat{y}_i + (1-y_i)\log(1-\hat{y}_i)]. Minimizing the negative log-likelihood gives cross-entropy. This is the statistically principled loss.

MSE with sigmoid suffers from vanishing gradients: the gradient (y^y)σ(z)xj(\hat{y}-y)\sigma'(z)x_j involves σ(z)=y^(1y^)0\sigma'(z) = \hat{y}(1-\hat{y}) \approx 0 for saturated predictions - exactly when the model is most wrong. Cross-entropy's gradient (y^y)xj(\hat{y}-y)x_j has no such term - it is large when the model is confidently wrong, driving fast correction.

Q2: What does a logistic regression coefficient mean?

For a binary classification model with standardized feature jj: ewje^{w_j} is the odds ratio - a 1 SD increase in feature jj multiplies the odds of the positive class by ewje^{w_j} (holding all other features constant). For wj=0.5w_j = 0.5: odds multiply by e0.51.65e^{0.5} \approx 1.65 (65% increase in odds, not 65% increase in probability). For unstandardized features: a one-unit increase in feature jj multiplies odds by ewje^{w_j}. Positive coefficients increase the probability; negative decrease it. The magnitude indicates strength; the sign indicates direction.

Q3: What is the decision boundary of logistic regression geometrically?

The decision boundary is the hyperplane {x:wx+b=0}\{\mathbf{x}: \mathbf{w}^\top\mathbf{x} + b = 0\} - a (d1)(d-1)-dimensional linear subspace in dd-dimensional feature space. In 2D it is a line; in 3D a plane. The vector w\mathbf{w} is perpendicular to this hyperplane and points toward the positive class. Logistic regression is inherently linear - it cannot learn XOR, circles, or any non-linear boundary without explicit feature engineering (polynomial features, interaction terms). This is the fundamental limitation that motivates kernel methods (Lesson 06) and neural networks.

Q4: How does softmax generalize sigmoid to multiple classes?

For binary classification, sigmoid outputs P(y=1x)=ez/(ez+e0)P(y=1|\mathbf{x}) = e^{z}/(e^z + e^0) - a ratio of two exponentials. Softmax generalizes: P(y=kx)=ewkx/j=1KewjxP(y=k|\mathbf{x}) = e^{\mathbf{w}_k^\top\mathbf{x}} / \sum_{j=1}^K e^{\mathbf{w}_j^\top\mathbf{x}}. Key properties: (1) outputs are in (0,1)(0,1) and sum to 1 - valid probability distribution; (2) softmax is invariant to adding a constant to all logits (set last class weight to zero for identifiability); (3) the gradient for each class kk is (p^kyk)x(\hat{p}_k - y_k)\mathbf{x} - same beautiful error-times-features form as binary; (4) with K=2K=2, softmax reduces to the binary sigmoid.

Q5: What is the Expected Calibration Error (ECE) and when is logistic regression miscalibrated?

ECE = m=1MBmnacc(Bm)conf(Bm)\sum_{m=1}^M \frac{|B_m|}{n}|\text{acc}(B_m) - \text{conf}(B_m)| - the weighted average absolute difference between predicted probabilities and actual positive rates across MM probability bins. Perfect calibration: ECE = 0. Logistic regression is theoretically well-calibrated when the linear model is correctly specified (true log-odds are linear in features). Miscalibration occurs when: (1) extreme class imbalance shifts probabilities away from the true base rate; (2) features are correlated with omitted variables; (3) distribution shift between train and test; (4) the model is overfit (training probabilities pushed to 0/1 extremes). Fix with Platt scaling (logistic regression on top of raw outputs), isotonic regression (non-parametric), or temperature scaling (TT in σ(z/T)\sigma(z/T) where T>1T > 1 softens overconfident predictions).

Q6: How do you handle a dataset with 1% positive examples for logistic regression?

Multi-step approach: (1) Class weights: class_weight='balanced' in sklearn weights minority class errors by nneg/npos=99×n_\text{neg}/n_\text{pos} = 99\times more heavily. This is equivalent to oversampling the minority class. (2) Threshold tuning: the default 0.5 threshold is wrong. Plot the PR curve and find the threshold that gives the precision/recall balance matching your business requirements - often around 0.1–0.2 for 1% prevalence. (3) Evaluation: accuracy is useless (99% by predicting all negative). Use precision@K, AUC-PR, and recall at your target precision point. (4) Data augmentation: SMOTE generates synthetic minority class samples in feature space, or use actual oversampling/undersampling. (5) Check calibration: ECE after class weighting ensures probabilities remain meaningful for threshold selection.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Logistic Regression Decision Boundary demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.