Logistic Regression Deep Dive
The Real Interview Moment
You are in a final round interview for an ML research engineer position. The interviewer puts a plot on the screen: a calibration curve for a fraud detection model. The x-axis shows "predicted probability," the y-axis shows "actual fraud rate." The curve bows upward - the model outputs 0.3 when the true fraud rate is 0.5.
"This model has 95% AUC-ROC. Our business team says it's unusable. Why? And what would you do about it?"
A candidate who only knows accuracy and AUC cannot answer this. A candidate who understands logistic regression from first principles - why it produces calibrated probabilities, what conditions break calibration, and how to fix it - can answer it in three minutes with a concrete plan involving Platt scaling, isotonic regression, and temperature scaling.
This lesson builds that depth. Logistic regression is deceptively simple on the surface. Most engineers know the sigmoid formula. Few can derive why cross-entropy is the right loss, prove why the decision boundary is a hyperplane, explain what "calibrated" means mathematically, or correctly handle 1:99 class imbalance without introducing subtle bugs.
Why Logistic Regression Exists
Before logistic regression, the standard approach for classification was linear discriminant analysis (LDA), which assumes Gaussian features and estimates class-conditional densities. The problem: LDA fails when features are not Gaussian, and it does not directly model .
Logistic regression takes a different approach: directly model the conditional probability as a function of the features. Proposed in the 1950s by David Cox, logistic regression became the standard baseline for binary classification because it directly answers the question that matters: "given these features, what is the probability of the positive class?"
Historical Context
David Cox introduced logistic regression in 1958 in the journal Biometrika as a method for binary outcome modeling in clinical trials. The model was motivated by the logistic curve - an S-shaped function that had been used in population growth models since the 1830s (Verhulst 1838). The connection to maximum likelihood was formalized by statisticians in the 1960s-70s. By the 1980s, logistic regression was the standard baseline in epidemiology, social sciences, and later machine learning. Today it remains the default linear classifier and a required building block for understanding neural network outputs.
The Sigmoid Function: Derivation and Properties
Given input features , we want to model . A linear model can produce any real value, but probabilities must be in .
From Log-Odds to Sigmoid
The logit model assumes that the log-odds of the positive class are linear in the features:
Solving for : let and :
This is the sigmoid function:
Key properties:
- , , - valid probability range
- Monotone increasing, smooth, differentiable everywhere
- Derivative: - elegant recursive form
- Numerically stable sigmoid: for large negative , use to avoid overflow
The logistic regression model is:
Log-Odds Interpretation
The log-odds interpretation is crucial for coefficient interpretation. From the logit model:
Each coefficient is the change in log-odds per unit increase in (holding all others constant). More usefully: is the odds ratio - a one-unit increase in multiplies the odds of the positive class by .
Interpretation example: In a fraud detection model with standardized features, means a 1 standard deviation increase in transaction amount multiplies the fraud odds by (65% increase in odds).
Cross-Entropy Loss: Derivation from Maximum Likelihood
The cross-entropy loss is not arbitrary - it is derived from the probabilistic model for binary outcomes.
Step 1: Write the Likelihood
For a binary label , the probability of the observed label given features and weights:
where . Verify:
- If : ✓
- If : ✓
This is a Bernoulli distribution parameterized by .
Step 2: Write the Log-Likelihood
Assuming i.i.d. observations, the joint likelihood is a product. Taking the log:
Step 3: Maximize Log-Likelihood = Minimize NLL = Minimize Cross-Entropy
Minimizing the negative log-likelihood gives the binary cross-entropy loss:
This is not a choice - it is the statistically principled loss function derived from the Bernoulli likelihood model. Lesson 07 on MLE shows this is the same principle that derives OLS from Gaussian noise.
Why Not MSE for Classification?
Using with sigmoid output is problematic:
The gradient of MSE + sigmoid:
When is large (confident wrong prediction), . The gradient vanishes where it is most needed. Cross-entropy avoids this:
No sigmoid derivative - the gradient is simply the prediction error times the feature value.
Computing the Gradient
The gradient of cross-entropy loss is elegantly simple:
Proof for a single sample:
Using :
In matrix form: - same structure as linear regression's gradient.
Decision Boundary Geometry
The decision boundary is where :
The decision boundary is always a linear hyperplane perpendicular to :
- In 2D: a line
- In 3D: a plane
- In -D: a -dimensional hyperplane
Key geometric facts:
- The vector is the normal to the decision boundary, pointing toward the positive class
- controls the sharpness of the sigmoid - larger → sharper transition
- Logistic regression can never learn nonlinear boundaries (XOR, circles) without feature engineering
Full NumPy Implementation from Scratch
import numpy as np
from sklearn.datasets import load_breast_cancer, make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (classification_report, roc_auc_score,
roc_curve, precision_recall_curve, average_precision_score)
import matplotlib.pyplot as plt
class LogisticRegressionFromScratch:
"""
Binary logistic regression via gradient descent from scratch.
Includes numerically stable sigmoid, regularization option,
and full probability calibration support.
"""
def __init__(self, lr=0.1, n_iterations=1000, tolerance=1e-6,
lambda_reg=0.0, verbose=False):
self.lr = lr
self.n_iterations = n_iterations
self.tolerance = tolerance
self.lambda_reg = lambda_reg # L2 regularization strength
self.verbose = verbose
self.weights = None
self.bias = None
self.loss_history = []
def _sigmoid(self, z):
"""
Numerically stable sigmoid.
For large negative z, exp(-z) overflows; use exp(z)/(1+exp(z)) instead.
"""
return np.where(
z >= 0,
1 / (1 + np.exp(-z)),
np.exp(z) / (1 + np.exp(z))
)
def _cross_entropy(self, y, y_hat):
"""Binary cross-entropy with epsilon clipping to avoid log(0)."""
eps = 1e-15
y_hat = np.clip(y_hat, eps, 1 - eps)
ce = -np.mean(y * np.log(y_hat) + (1 - y) * np.log(1 - y_hat))
# Add L2 regularization term (don't regularize the bias)
if self.lambda_reg > 0:
ce += 0.5 * self.lambda_reg * np.sum(self.weights ** 2)
return ce
def fit(self, X, y):
n, d = X.shape
self.weights = np.zeros(d)
self.bias = 0.0
for t in range(self.n_iterations):
# Forward pass
z = X @ self.weights + self.bias
y_hat = self._sigmoid(z)
# Loss
loss = self._cross_entropy(y, y_hat)
self.loss_history.append(loss)
# Gradients: (1/n) X^T (ŷ - y) + λw (L2 regularization)
error = y_hat - y # shape (n,)
grad_w = (1 / n) * X.T @ error + self.lambda_reg * self.weights
grad_b = (1 / n) * error.sum()
grad_norm = np.linalg.norm(grad_w)
if self.verbose and t % 200 == 0:
print(f"Iter {t:5d}: loss={loss:.6f}, |∇|={grad_norm:.6f}")
if grad_norm < self.tolerance:
if self.verbose:
print(f"Converged at iteration {t}")
break
self.weights -= self.lr * grad_w
self.bias -= self.lr * grad_b
return self
def predict_proba(self, X):
"""Return probabilities for both classes: shape (n, 2)."""
z = X @ self.weights + self.bias
p1 = self._sigmoid(z)
return np.column_stack([1 - p1, p1])
def predict(self, X, threshold=0.5):
"""Return binary predictions using specified threshold."""
return (self.predict_proba(X)[:, 1] >= threshold).astype(int)
def score(self, X, y, threshold=0.5):
return float(np.mean(self.predict(X, threshold) == y))
# Load the breast cancer dataset
data = load_breast_cancer()
X_raw, y_raw = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(
X_raw, y_raw, test_size=0.2, random_state=42, stratify=y_raw)
scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc = scaler.transform(X_test)
model = LogisticRegressionFromScratch(
lr=0.1, n_iterations=2000, lambda_reg=0.01, verbose=True)
model.fit(X_train_sc, y_train)
y_prob = model.predict_proba(X_test_sc)[:, 1]
y_pred = model.predict(X_test_sc)
print(f"\nAccuracy: {model.score(X_test_sc, y_test):.4f}")
print(f"ROC-AUC: {roc_auc_score(y_test, y_prob):.4f}")
print(f"Avg Prec: {average_precision_score(y_test, y_prob):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=data.target_names))
ROC and PR Curves
def plot_roc_and_pr(y_true, y_prob, model_name="Logistic Regression"):
"""Plot ROC curve and Precision-Recall curve side by side."""
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# ROC Curve
fpr, tpr, roc_thresholds = roc_curve(y_true, y_prob)
auc_roc = roc_auc_score(y_true, y_prob)
ax = axes[0]
ax.plot(fpr, tpr, 'b-', linewidth=2, label=f'ROC (AUC={auc_roc:.3f})')
ax.plot([0, 1], [0, 1], 'k--', linewidth=1, label='Random (AUC=0.5)')
ax.fill_between(fpr, tpr, alpha=0.1, color='blue')
ax.set_xlabel("False Positive Rate (1 - Specificity)")
ax.set_ylabel("True Positive Rate (Sensitivity)")
ax.set_title(f"ROC Curve - {model_name}\n(Threshold-independent ranking quality)")
ax.legend()
ax.grid(True, alpha=0.3)
# Precision-Recall Curve
precision, recall, pr_thresholds = precision_recall_curve(y_true, y_prob)
avg_prec = average_precision_score(y_true, y_prob)
baseline = y_true.mean() # random classifier baseline = prevalence
ax = axes[1]
ax.plot(recall, precision, 'g-', linewidth=2,
label=f'PR (AP={avg_prec:.3f})')
ax.axhline(baseline, color='k', linestyle='--', linewidth=1,
label=f'Random (P={baseline:.2f})')
ax.fill_between(recall, precision, alpha=0.1, color='green')
ax.set_xlabel("Recall (Sensitivity)")
ax.set_ylabel("Precision (Positive Predictive Value)")
ax.set_title(f"Precision-Recall Curve - {model_name}\n"
f"(Better for imbalanced datasets)")
ax.legend()
ax.grid(True, alpha=0.3)
# Annotate threshold at 0.5
idx_50 = np.argmin(np.abs(roc_thresholds - 0.5))
axes[0].scatter(fpr[idx_50], tpr[idx_50], s=100, color='red', zorder=5,
label=f'Threshold=0.5 (FPR={fpr[idx_50]:.2f}, TPR={tpr[idx_50]:.2f})')
axes[0].legend()
plt.tight_layout()
plt.savefig("roc_pr_curves.png", dpi=150)
plt.show()
plot_roc_and_pr(y_test, y_prob)
Multi-Class Classification
One-vs-Rest (OvR)
Train independent binary classifiers. Classifier treats class as positive and all others as negative.
Pros: Simple, parallelizable, interpretable per-class coefficients. Cons: Probabilities from classifiers are not calibrated against each other - they do not sum to 1.
Softmax (Multinomial Logistic Regression)
The softmax function generalizes sigmoid to classes:
Properties:
- Outputs are in and sum to 1 - valid probability distribution
- The softmax is identifiable only up to a constant: adding to all does not change the output. Set for identifiability.
- For , softmax reduces to the binary sigmoid
The multi-class cross-entropy loss:
where if sample belongs to class (one-hot encoding).
class SoftmaxRegression:
"""
Multi-class logistic regression via softmax.
Equivalent to multinomial logistic regression (one-hot cross-entropy).
"""
def __init__(self, lr=0.1, n_iterations=1000, lambda_reg=0.01):
self.lr = lr
self.n_iterations = n_iterations
self.lambda_reg = lambda_reg
self.W = None # (d, K) weight matrix
self.b = None # (K,) bias vector
self.loss_history = []
def _softmax(self, Z):
"""
Numerically stable softmax: subtract max per row before exp.
Without this, exp() overflows for large logit values.
"""
Z_shifted = Z - Z.max(axis=1, keepdims=True)
exp_Z = np.exp(Z_shifted)
return exp_Z / exp_Z.sum(axis=1, keepdims=True)
def fit(self, X, y):
n, d = X.shape
K = len(np.unique(y))
self.W = np.zeros((d, K))
self.b = np.zeros(K)
self.classes_ = np.unique(y)
# One-hot encode labels
Y_onehot = np.eye(K)[y] # shape (n, K)
for t in range(self.n_iterations):
# Forward pass
Z = X @ self.W + self.b # logits, shape (n, K)
probs = self._softmax(Z) # probabilities, shape (n, K)
# Cross-entropy loss + L2 regularization
eps = 1e-15
loss = -np.mean(np.sum(Y_onehot * np.log(np.clip(probs, eps, 1)), axis=1))
loss += 0.5 * self.lambda_reg * np.sum(self.W ** 2)
self.loss_history.append(loss)
# Gradients: (1/n) X^T (P - Y) + λW
error = probs - Y_onehot # shape (n, K)
grad_W = (1 / n) * X.T @ error + self.lambda_reg * self.W
grad_b = (1 / n) * error.sum(axis=0)
self.W -= self.lr * grad_W
self.b -= self.lr * grad_b
return self
def predict_proba(self, X):
Z = X @ self.W + self.b
return self._softmax(Z)
def predict(self, X):
return self.predict_proba(X).argmax(axis=1)
def score(self, X, y):
return float(np.mean(self.predict(X) == y))
# Multi-class example on Iris
from sklearn.datasets import load_iris
iris = load_iris()
X_i, y_i = iris.data, iris.target
X_tr, X_te, y_tr, y_te = train_test_split(X_i, y_i, test_size=0.2,
random_state=42, stratify=y_i)
sc_i = StandardScaler()
X_tr_sc = sc_i.fit_transform(X_tr)
X_te_sc = sc_i.transform(X_te)
sm = SoftmaxRegression(lr=0.5, n_iterations=2000, lambda_reg=0.01)
sm.fit(X_tr_sc, y_tr)
print(f"\nSoftmax Regression Accuracy on Iris: {sm.score(X_te_sc, y_te):.4f}")
# Compare with OvR
from sklearn.linear_model import LogisticRegression as SKLogReg
ovr = SKLogReg(multi_class='ovr', C=10, max_iter=1000, random_state=42)
ovr.fit(X_tr_sc, y_tr)
print(f"OvR Logistic Regression Accuracy: {ovr.score(X_te_sc, y_te):.4f}")
Probability Calibration
A model is well-calibrated if among all predictions of probability , approximately fraction of them are actually positive. Formally:
Expected Calibration Error (ECE)
The Expected Calibration Error (ECE) quantifies miscalibration. Partition the probability range into bins of equal width. For each bin :
where:
- = number of samples in bin
- = fraction of correct predictions in bin
- = average predicted probability in bin
A perfectly calibrated model has ECE = 0.
from sklearn.calibration import calibration_curve, CalibratedClassifierCV
from sklearn.linear_model import LogisticRegression as SKLogReg
def expected_calibration_error(y_true, y_prob, n_bins=10):
"""
Compute Expected Calibration Error (ECE).
Lower is better. Perfect calibration = ECE of 0.
"""
bin_edges = np.linspace(0, 1, n_bins + 1)
ece = 0.0
n = len(y_true)
for i in range(n_bins):
lo, hi = bin_edges[i], bin_edges[i + 1]
in_bin = (y_prob >= lo) & (y_prob < hi)
if in_bin.sum() == 0:
continue
bin_size = in_bin.sum()
acc = y_true[in_bin].mean() # fraction truly positive in bin
conf = y_prob[in_bin].mean() # average predicted probability in bin
ece += (bin_size / n) * abs(acc - conf)
return ece
# Evaluate calibration of our from-scratch model
ece_scratch = expected_calibration_error(y_test, y_prob)
print(f"\nECE (from-scratch logistic regression): {ece_scratch:.4f}")
# sklearn LogisticRegression
sk_lr = SKLogReg(C=100, max_iter=2000, random_state=42)
sk_lr.fit(X_train_sc, y_train)
sk_prob = sk_lr.predict_proba(X_test_sc)[:, 1]
ece_sk = expected_calibration_error(y_test, sk_prob)
print(f"ECE (sklearn LogisticRegression): {ece_sk:.4f}")
# Visualize calibration curve
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
ax = axes[0]
prob_true, prob_pred = calibration_curve(y_test, y_prob, n_bins=10)
ax.plot([0, 1], [0, 1], 'k--', label='Perfect calibration')
ax.plot(prob_pred, prob_true, 's-', color='blue',
label=f'From-scratch (ECE={ece_scratch:.3f})')
prob_true_sk, prob_pred_sk = calibration_curve(y_test, sk_prob, n_bins=10)
ax.plot(prob_pred_sk, prob_true_sk, 's-', color='green',
label=f'sklearn (ECE={ece_sk:.3f})')
ax.set_xlabel("Mean Predicted Probability")
ax.set_ylabel("Fraction of Positives (Actual Rate)")
ax.set_title("Calibration Curve\n(closer to diagonal = better calibration)")
ax.legend()
ax.grid(True, alpha=0.3)
# Probability distribution histogram
ax = axes[1]
ax.hist(y_prob[y_test == 0], bins=30, alpha=0.5, color='blue',
label='Negative class', density=True)
ax.hist(y_prob[y_test == 1], bins=30, alpha=0.5, color='red',
label='Positive class', density=True)
ax.set_xlabel("Predicted P(y=1|x)")
ax.set_ylabel("Density")
ax.set_title("Probability Distribution by True Class\n"
"(good separation = good discrimination)")
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("calibration_curves.png", dpi=150)
plt.show()
# Platt Scaling - recalibrate a poorly calibrated model
# Fit logistic regression on top of uncalibrated scores
from sklearn.linear_model import LogisticRegression as PlattLR
def platt_scaling(y_cal, raw_scores_cal, y_test, raw_scores_test):
"""
Platt scaling: fit a sigmoid (logistic regression) on top of raw scores
to recalibrate probabilities.
"""
platt = PlattLR(C=1e6, max_iter=1000)
platt.fit(raw_scores_cal.reshape(-1, 1), y_cal)
calibrated_probs = platt.predict_proba(raw_scores_test.reshape(-1, 1))[:, 1]
return calibrated_probs
# For this already-calibrated model the improvement will be minor,
# but the pattern applies to SVMs, gradient boosting, etc.
n_cal = len(y_test) // 2
platt_prob = platt_scaling(
y_test[:n_cal], y_prob[:n_cal],
y_test[n_cal:], y_prob[n_cal:]
)
ece_platt = expected_calibration_error(y_test[n_cal:], platt_prob)
print(f"\nECE after Platt scaling: {ece_platt:.4f}")
print("(Platt scaling is most useful for SVMs and gradient boosted trees)")
Handling Class Imbalance
With imbalanced classes (e.g., 1% fraud, 99% legitimate), the default threshold of 0.5 is usually wrong. The model outputs probabilities reflecting the class distribution, and simply classifying by majority class achieves 99% accuracy - useless.
Strategy 1: Class Weights
Upweight the minority class in the loss:
from sklearn.linear_model import LogisticRegression as SKLogReg
from sklearn.utils.class_weight import compute_class_weight
# Create severely imbalanced dataset (5% positive)
np.random.seed(42)
X_imb, y_imb = make_classification(
n_samples=5000, n_features=20, n_informative=10,
weights=[0.95, 0.05], # 95% negative, 5% positive
random_state=42
)
X_tr_i, X_te_i, y_tr_i, y_te_i = train_test_split(
X_imb, y_imb, test_size=0.2, random_state=42, stratify=y_imb)
sc_imb = StandardScaler()
X_tr_i_sc = sc_imb.fit_transform(X_tr_i)
X_te_i_sc = sc_imb.transform(X_te_i)
# Compute balanced class weights: n_samples / (n_classes * count_per_class)
classes = np.unique(y_tr_i)
class_weights = compute_class_weight('balanced', classes=classes, y=y_tr_i)
weight_dict = dict(zip(classes, class_weights))
print(f"Class weights: {weight_dict}")
# Without class weights - likely predicts all zeros
lr_unweighted = SKLogReg(C=1.0, max_iter=1000, random_state=42)
lr_unweighted.fit(X_tr_i_sc, y_tr_i)
# With class weights - penalizes minority class errors more heavily
lr_weighted = SKLogReg(C=1.0, max_iter=1000, class_weight='balanced', random_state=42)
lr_weighted.fit(X_tr_i_sc, y_tr_i)
print("\nClass Imbalance: Effect of class_weight='balanced'")
print(f"{'Model':25s} | {'Accuracy':>10} | {'AUC-ROC':>8} | {'Recall-1':>10}")
print("-" * 60)
for name, m in [("Unweighted LR", lr_unweighted), ("Weighted LR", lr_weighted)]:
y_pred_i = m.predict(X_te_i_sc)
y_prob_i = m.predict_proba(X_te_i_sc)[:, 1]
acc = np.mean(y_pred_i == y_te_i)
auc = roc_auc_score(y_te_i, y_prob_i)
recall = np.sum((y_pred_i == 1) & (y_te_i == 1)) / np.sum(y_te_i == 1)
print(f"{name:25s} | {acc:>10.4f} | {auc:>8.4f} | {recall:>10.4f}")
Strategy 2: Threshold Tuning
The default threshold of 0.5 is arbitrary. Tune it based on the business cost of false positives vs false negatives:
def find_optimal_threshold(y_true, y_prob, metric='f1'):
"""
Find the probability threshold that optimizes the given metric.
Business context determines the right metric:
- F1: balanced precision/recall (most common)
- Precision: when false positives are very costly (fraud review queue)
- Recall: when false negatives are very costly (cancer screening)
"""
thresholds = np.arange(0.05, 0.95, 0.01)
scores = []
for thresh in thresholds:
y_pred = (y_prob >= thresh).astype(int)
tp = np.sum((y_pred == 1) & (y_true == 1))
fp = np.sum((y_pred == 1) & (y_true == 0))
fn = np.sum((y_pred == 0) & (y_true == 1))
precision = tp / (tp + fp) if (tp + fp) > 0 else 0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0
if metric == 'f1':
score = (2 * precision * recall / (precision + recall)
if (precision + recall) > 0 else 0)
elif metric == 'precision':
score = precision
elif metric == 'recall':
score = recall
scores.append(score)
best_idx = np.argmax(scores)
best_thresh = thresholds[best_idx]
best_score = scores[best_idx]
plt.figure(figsize=(8, 4))
plt.plot(thresholds, scores, 'b-', linewidth=2)
plt.axvline(best_thresh, color='r', linestyle='--',
label=f'Optimal threshold={best_thresh:.2f} ({metric}={best_score:.3f})')
plt.xlabel("Classification Threshold")
plt.ylabel(f"{metric.capitalize()} Score")
plt.title(f"Threshold Tuning: Optimize {metric.upper()}")
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("threshold_tuning.png", dpi=150)
plt.show()
return best_thresh, best_score
y_prob_weighted = lr_weighted.predict_proba(X_te_i_sc)[:, 1]
best_thresh, best_f1 = find_optimal_threshold(y_te_i, y_prob_weighted, metric='f1')
print(f"\nOptimal threshold for F1: {best_thresh:.2f}, F1={best_f1:.4f}")
scikit-learn with Calibration
from sklearn.linear_model import LogisticRegression as SKLogReg
from sklearn.calibration import CalibratedClassifierCV
# sklearn LogisticRegression with L-BFGS solver (quasi-Newton, much faster than GD)
sk_lr = SKLogReg(C=10.0, max_iter=2000, solver='lbfgs', random_state=42)
sk_lr.fit(X_train_sc, y_train)
print(f"\nsklearn LogReg (lbfgs) Accuracy: {sk_lr.score(X_test_sc, y_test):.4f}")
print(f"sklearn ROC-AUC: {roc_auc_score(y_test, sk_lr.predict_proba(X_test_sc)[:, 1]):.4f}")
# Calibrated logistic regression using isotonic regression
# Useful for models that are not natively calibrated (e.g., SVMs, gradient boosting)
calib_lr = CalibratedClassifierCV(sk_lr, method='isotonic', cv=5)
calib_lr.fit(X_train_sc, y_train)
calib_probs = calib_lr.predict_proba(X_test_sc)[:, 1]
ece_calib = expected_calibration_error(y_test, calib_probs)
print(f"Calibrated LR ECE: {ece_calib:.4f}")
# Solver comparison
print("\nSolver Comparison:")
for solver in ['lbfgs', 'liblinear', 'saga']:
try:
m = SKLogReg(C=10.0, max_iter=2000, solver=solver, random_state=42)
m.fit(X_train_sc, y_train)
prob_s = m.predict_proba(X_test_sc)[:, 1]
print(f" {solver:12s}: Acc={m.score(X_test_sc, y_test):.4f}, "
f"AUC={roc_auc_score(y_test, prob_s):.4f}")
except Exception as e:
print(f" {solver:12s}: {e}")
:::tip Production solver choice
sklearn's LogisticRegression uses LBFGS by default for small/medium datasets - it is a quasi-Newton method that converges much faster than gradient descent (quadratic vs linear convergence). For large datasets () or sparse features, use solver='saga' which supports all penalties (L1, L2, ElasticNet) and mini-batch training. For L1 penalty with small datasets, solver='liblinear' uses coordinate descent - very efficient for sparse data.
:::
Common Mistakes
:::danger Using MSE loss for logistic regression
Logistic regression with MSE loss is not logistic regression - it is a different model with different optimization properties. MSE + sigmoid has vanishing gradients when predictions are saturated, poor calibration, and a non-convex loss surface. Always use cross-entropy (NLL) for logistic regression. In PyTorch: use nn.BCEWithLogitsLoss() (which combines sigmoid and cross-entropy for numerical stability). In sklearn: LogisticRegression always uses cross-entropy.
:::
:::danger Threshold = 0.5 with imbalanced classes
The 0.5 threshold assumes both classes are equally important. For a 1:99 imbalanced dataset, predicting "negative" for everything achieves 99% accuracy - a useless model. Always tune the threshold based on the precision/recall trade-off that matches your business requirements. Use classification_report with the tuned threshold, and report precision and recall separately, not just accuracy.
:::
:::warning Not checking calibration before using predicted probabilities for decisions Raw logistic regression probabilities are approximately calibrated when the model is correctly specified, but miscalibration occurs with class imbalance, distribution shift, or overfit models. Before using probabilities for cost-benefit calculations, risk scoring, or ranking, always plot the calibration curve and compute ECE. If ECE > 0.05 on the test set, apply Platt scaling or isotonic regression recalibration. :::
:::warning Interpreting coefficients without standardizing A coefficient of 2.3 for age (in years) and 0.001 for income (in dollars) cannot be compared directly. Standardize features before training if you want to compare coefficient magnitudes. Report both standardized beta coefficients (for comparison) and unstandardized coefficients (for the original unit interpretation). :::
YouTube Resources
| Resource | Channel | Why Watch |
|---|---|---|
| Logistic Regression, Clearly Explained! | StatQuest with Josh Starmer | Best visual introduction - sigmoid, log-odds, MLE connection |
| Softmax and Cross-Entropy Explained | Andrej Karpathy | Deep derivation of softmax, cross-entropy, and gradients |
| Precision vs Recall vs F1 Score | StatQuest | When accuracy fails - ROC, PR curves, and threshold selection |
| Probability Calibration for ML Models | Scikit-learn talks | ECE, Platt scaling, isotonic regression - practical calibration |
| Handling Imbalanced Datasets | DeepLearning.AI | Class weights, SMOTE, and threshold tuning - complete imbalance guide |
Interview Q&A
Q1: Derive the cross-entropy loss from the logistic regression model. Why not use MSE?
Cross-entropy is derived from MLE under the Bernoulli likelihood. For binary label : . Taking the log over samples: . Minimizing the negative log-likelihood gives cross-entropy. This is the statistically principled loss.
MSE with sigmoid suffers from vanishing gradients: the gradient involves for saturated predictions - exactly when the model is most wrong. Cross-entropy's gradient has no such term - it is large when the model is confidently wrong, driving fast correction.
Q2: What does a logistic regression coefficient mean?
For a binary classification model with standardized feature : is the odds ratio - a 1 SD increase in feature multiplies the odds of the positive class by (holding all other features constant). For : odds multiply by (65% increase in odds, not 65% increase in probability). For unstandardized features: a one-unit increase in feature multiplies odds by . Positive coefficients increase the probability; negative decrease it. The magnitude indicates strength; the sign indicates direction.
Q3: What is the decision boundary of logistic regression geometrically?
The decision boundary is the hyperplane - a -dimensional linear subspace in -dimensional feature space. In 2D it is a line; in 3D a plane. The vector is perpendicular to this hyperplane and points toward the positive class. Logistic regression is inherently linear - it cannot learn XOR, circles, or any non-linear boundary without explicit feature engineering (polynomial features, interaction terms). This is the fundamental limitation that motivates kernel methods (Lesson 06) and neural networks.
Q4: How does softmax generalize sigmoid to multiple classes?
For binary classification, sigmoid outputs - a ratio of two exponentials. Softmax generalizes: . Key properties: (1) outputs are in and sum to 1 - valid probability distribution; (2) softmax is invariant to adding a constant to all logits (set last class weight to zero for identifiability); (3) the gradient for each class is - same beautiful error-times-features form as binary; (4) with , softmax reduces to the binary sigmoid.
Q5: What is the Expected Calibration Error (ECE) and when is logistic regression miscalibrated?
ECE = - the weighted average absolute difference between predicted probabilities and actual positive rates across probability bins. Perfect calibration: ECE = 0. Logistic regression is theoretically well-calibrated when the linear model is correctly specified (true log-odds are linear in features). Miscalibration occurs when: (1) extreme class imbalance shifts probabilities away from the true base rate; (2) features are correlated with omitted variables; (3) distribution shift between train and test; (4) the model is overfit (training probabilities pushed to 0/1 extremes). Fix with Platt scaling (logistic regression on top of raw outputs), isotonic regression (non-parametric), or temperature scaling ( in where softens overconfident predictions).
Q6: How do you handle a dataset with 1% positive examples for logistic regression?
Multi-step approach: (1) Class weights: class_weight='balanced' in sklearn weights minority class errors by more heavily. This is equivalent to oversampling the minority class. (2) Threshold tuning: the default 0.5 threshold is wrong. Plot the PR curve and find the threshold that gives the precision/recall balance matching your business requirements - often around 0.1–0.2 for 1% prevalence. (3) Evaluation: accuracy is useless (99% by predicting all negative). Use precision@K, AUC-PR, and recall at your target precision point. (4) Data augmentation: SMOTE generates synthetic minority class samples in feature space, or use actual oversampling/undersampling. (5) Check calibration: ECE after class weighting ensures probabilities remain meaningful for threshold selection.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Logistic Regression Decision Boundary demo on the EngineersOfAI Playground - no code required.
:::
