What is logistic regression?

Master logistic regression from first principles - sigmoid derivation, log-likelihood to cross-entropy, decision boundary geometry, softmax multiclass, probability calibration with ECE, class imbalance handling, and full NumPy implementation.

How does classification work in practice?

Logistic Regression Deep Dive covers logistic regression, classification, sigmoid function from first principles with code examples. Free lesson at https://engineersofai.com/docs/ml/linear-models/logistic-regression-deep-dive

What is the difference between logistic regression and sigmoid function?

See the full breakdown at https://engineersofai.com/docs/ml/linear-models/logistic-regression-deep-dive

Logistic Regression Deep Dive

The Real Interview Moment

You are in a final round interview for an ML research engineer position. The interviewer puts a plot on the screen: a calibration curve for a fraud detection model. The x-axis shows "predicted probability," the y-axis shows "actual fraud rate." The curve bows upward - the model outputs 0.3 when the true fraud rate is 0.5.

"This model has 95% AUC-ROC. Our business team says it's unusable. Why? And what would you do about it?"

A candidate who only knows accuracy and AUC cannot answer this. A candidate who understands logistic regression from first principles - why it produces calibrated probabilities, what conditions break calibration, and how to fix it - can answer it in three minutes with a concrete plan involving Platt scaling, isotonic regression, and temperature scaling.

This lesson builds that depth. Logistic regression is deceptively simple on the surface. Most engineers know the sigmoid formula. Few can derive why cross-entropy is the right loss, prove why the decision boundary is a hyperplane, explain what "calibrated" means mathematically, or correctly handle 1:99 class imbalance without introducing subtle bugs.

Why Logistic Regression Exists

Before logistic regression, the standard approach for classification was linear discriminant analysis (LDA), which assumes Gaussian features and estimates class-conditional densities. The problem: LDA fails when features are not Gaussian, and it does not directly model $P(y=1|\mathbf{x})$ .

Logistic regression takes a different approach: directly model the conditional probability $P(y=1|\mathbf{x})$ as a function of the features. Proposed in the 1950s by David Cox, logistic regression became the standard baseline for binary classification because it directly answers the question that matters: "given these features, what is the probability of the positive class?"

Historical Context

David Cox introduced logistic regression in 1958 in the journal Biometrika as a method for binary outcome modeling in clinical trials. The model was motivated by the logistic curve - an S-shaped function that had been used in population growth models since the 1830s (Verhulst 1838). The connection to maximum likelihood was formalized by statisticians in the 1960s-70s. By the 1980s, logistic regression was the standard baseline in epidemiology, social sciences, and later machine learning. Today it remains the default linear classifier and a required building block for understanding neural network outputs.

The Sigmoid Function: Derivation and Properties

Given input features $\mathbf{x} \in \mathbb{R}^d$ , we want to model $P(y = 1 | \mathbf{x})$ . A linear model $\mathbf{w}^\top\mathbf{x}$ can produce any real value, but probabilities must be in $[0, 1]$ .

From Log-Odds to Sigmoid

The logit model assumes that the log-odds of the positive class are linear in the features:

$\log\frac{P(y=1|\mathbf{x})}{P(y=0|\mathbf{x})} = \mathbf{w}^\top\mathbf{x}$

Solving for $P(y=1|\mathbf{x})$ : let $p = P(y=1|\mathbf{x})$ and $z = \mathbf{w}^\top\mathbf{x}$ :

$\frac{p}{1-p} = e^z \implies p = \frac{e^z}{1+e^z} = \frac{1}{1+e^{-z}}$

This is the sigmoid function:

$\sigma(z) = \frac{1}{1 + e^{-z}}$

Key properties:

$\sigma(-\infty) = 0$ , $\sigma(0) = 0.5$ , $\sigma(+\infty) = 1$ - valid probability range
Monotone increasing, smooth, differentiable everywhere
Derivative: $\sigma'(z) = \sigma(z)(1 - \sigma(z))$ - elegant recursive form
Numerically stable sigmoid: for large negative $z$ , use $e^z/(1+e^z)$ to avoid overflow

The logistic regression model is:

$\boxed{P(y = 1 | \mathbf{x}; \mathbf{w}) = \sigma(\mathbf{w}^\top\mathbf{x}) = \frac{1}{1 + e^{-\mathbf{w}^\top\mathbf{x}}}}$

Log-Odds Interpretation

The log-odds interpretation is crucial for coefficient interpretation. From the logit model:

$\log\frac{P(y=1|\mathbf{x})}{P(y=0|\mathbf{x})} = w_0 + w_1 x_1 + w_2 x_2 + \cdots + w_d x_d$

Each coefficient $w_j$ is the change in log-odds per unit increase in $x_j$ (holding all others constant). More usefully: $e^{w_j}$ is the odds ratio - a one-unit increase in $x_j$ multiplies the odds of the positive class by $e^{w_j}$ .

Interpretation example: In a fraud detection model with standardized features, $w_\text{amount} = 0.5$ means a 1 standard deviation increase in transaction amount multiplies the fraud odds by $e^{0.5} \approx 1.65$ (65% increase in odds).

Cross-Entropy Loss: Derivation from Maximum Likelihood

The cross-entropy loss is not arbitrary - it is derived from the probabilistic model for binary outcomes.

Step 1: Write the Likelihood

For a binary label $y_i \in \{0, 1\}$ , the probability of the observed label given features and weights:

$P(y_i | \mathbf{x}_i; \mathbf{w}) = \hat{y}_i^{y_i}(1-\hat{y}_i)^{1-y_i}$

where $\hat{y}_i = \sigma(\mathbf{w}^\top\mathbf{x}_i)$ . Verify:

If $y_i = 1$ : $P = \hat{y}_i^1 (1-\hat{y}_i)^0 = \hat{y}_i$ ✓
If $y_i = 0$ : $P = \hat{y}_i^0 (1-\hat{y}_i)^1 = 1-\hat{y}_i$ ✓

This is a Bernoulli distribution parameterized by $\hat{y}_i$ .

Step 2: Write the Log-Likelihood

Assuming $n$ i.i.d. observations, the joint likelihood is a product. Taking the log:

$\log \mathcal{L}(\mathbf{w}) = \sum_{i=1}^n \left[ y_i \log \hat{y}_i + (1-y_i)\log(1-\hat{y}_i) \right]$

Step 3: Maximize Log-Likelihood = Minimize NLL = Minimize Cross-Entropy

Minimizing the negative log-likelihood gives the binary cross-entropy loss:

$\boxed{\mathcal{J}(\mathbf{w}) = -\frac{1}{n}\sum_{i=1}^n \left[ y_i \log \hat{y}_i + (1-y_i)\log(1-\hat{y}_i) \right]}$

This is not a choice - it is the statistically principled loss function derived from the Bernoulli likelihood model. Lesson 07 on MLE shows this is the same principle that derives OLS from Gaussian noise.

Why Not MSE for Classification?

Using $\mathcal{J}_{\text{MSE}} = \frac{1}{n}\sum (y_i - \hat{y}_i)^2$ with sigmoid output is problematic:

The gradient of MSE + sigmoid:

$\frac{\partial \mathcal{J}_{\text{MSE}}}{\partial w_j} = (\hat{y}_i - y_i) \cdot \sigma'(\mathbf{w}^\top\mathbf{x}_i) \cdot x_{ij}$

When $|\mathbf{w}^\top\mathbf{x}_i|$ is large (confident wrong prediction), $\sigma'(z) = \sigma(z)(1-\sigma(z)) \approx 0$ . The gradient vanishes where it is most needed. Cross-entropy avoids this:

$\frac{\partial \mathcal{J}_{\text{CE}}}{\partial w_j} = \frac{1}{n}\sum_i (\hat{y}_i - y_i) x_{ij}$

No sigmoid derivative - the gradient is simply the prediction error times the feature value.

Computing the Gradient

The gradient of cross-entropy loss is elegantly simple:

$\nabla_\mathbf{w} \mathcal{J} = \frac{1}{n}\mathbf{X}^\top(\hat{\mathbf{y}} - \mathbf{y})$

Proof for a single sample:

$\frac{\partial}{\partial w_j}[-y\log\hat{y} - (1-y)\log(1-\hat{y})]$

$= -\left[\frac{y}{\hat{y}} - \frac{1-y}{1-\hat{y}}\right] \cdot \frac{\partial\hat{y}}{\partial w_j}$

Using $\frac{\partial\hat{y}}{\partial w_j} = \frac{\partial\sigma(z)}{\partial z}\cdot\frac{\partial z}{\partial w_j} = \hat{y}(1-\hat{y}) \cdot x_j$ :

$= -\left[\frac{y(1-\hat{y}) - (1-y)\hat{y}}{\hat{y}(1-\hat{y})}\right] \cdot \hat{y}(1-\hat{y}) \cdot x_j$

$= -(y - \hat{y})x_j = (\hat{y} - y)x_j$

In matrix form: $\nabla_\mathbf{w}\mathcal{J} = \frac{1}{n}\mathbf{X}^\top(\hat{\mathbf{y}} - \mathbf{y})$ - same structure as linear regression's gradient.

Decision Boundary Geometry

The decision boundary is where $P(y=1|\mathbf{x}) = 0.5$ :

$\sigma(\mathbf{w}^\top\mathbf{x} + b) = 0.5 \implies \mathbf{w}^\top\mathbf{x} + b = 0$

The decision boundary is always a linear hyperplane perpendicular to $\mathbf{w}$ :

In 2D: a line
In 3D: a plane
In $d$ -D: a $(d-1)$ -dimensional hyperplane

Key geometric facts:

The vector $\mathbf{w}$ is the normal to the decision boundary, pointing toward the positive class
$\|\mathbf{w}\|$ controls the sharpness of the sigmoid - larger $\|\mathbf{w}\|$ → sharper transition
Logistic regression can never learn nonlinear boundaries (XOR, circles) without feature engineering

Full NumPy Implementation from Scratch

import numpy as np
from sklearn.datasets import load_breast_cancer, make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (classification_report, roc_auc_score,
                              roc_curve, precision_recall_curve, average_precision_score)
import matplotlib.pyplot as plt


class LogisticRegressionFromScratch:
    """
    Binary logistic regression via gradient descent from scratch.
    Includes numerically stable sigmoid, regularization option,
    and full probability calibration support.
    """

    def __init__(self, lr=0.1, n_iterations=1000, tolerance=1e-6,
                 lambda_reg=0.0, verbose=False):
        self.lr = lr
        self.n_iterations = n_iterations
        self.tolerance = tolerance
        self.lambda_reg = lambda_reg  # L2 regularization strength
        self.verbose = verbose
        self.weights = None
        self.bias = None
        self.loss_history = []

    def _sigmoid(self, z):
        """
        Numerically stable sigmoid.
        For large negative z, exp(-z) overflows; use exp(z)/(1+exp(z)) instead.
        """
        return np.where(
            z >= 0,
            1 / (1 + np.exp(-z)),
            np.exp(z) / (1 + np.exp(z))
        )

    def _cross_entropy(self, y, y_hat):
        """Binary cross-entropy with epsilon clipping to avoid log(0)."""
        eps = 1e-15
        y_hat = np.clip(y_hat, eps, 1 - eps)
        ce = -np.mean(y * np.log(y_hat) + (1 - y) * np.log(1 - y_hat))
        # Add L2 regularization term (don't regularize the bias)
        if self.lambda_reg > 0:
            ce += 0.5 * self.lambda_reg * np.sum(self.weights ** 2)
        return ce

    def fit(self, X, y):
        n, d = X.shape
        self.weights = np.zeros(d)
        self.bias = 0.0

        for t in range(self.n_iterations):
            # Forward pass
            z = X @ self.weights + self.bias
            y_hat = self._sigmoid(z)

            # Loss
            loss = self._cross_entropy(y, y_hat)
            self.loss_history.append(loss)

            # Gradients: (1/n) X^T (ŷ - y) + λw (L2 regularization)
            error = y_hat - y  # shape (n,)
            grad_w = (1 / n) * X.T @ error + self.lambda_reg * self.weights
            grad_b = (1 / n) * error.sum()

            grad_norm = np.linalg.norm(grad_w)
            if self.verbose and t % 200 == 0:
                print(f"Iter {t:5d}: loss={loss:.6f}, |∇|={grad_norm:.6f}")

            if grad_norm < self.tolerance:
                if self.verbose:
                    print(f"Converged at iteration {t}")
                break

            self.weights -= self.lr * grad_w
            self.bias    -= self.lr * grad_b

        return self

    def predict_proba(self, X):
        """Return probabilities for both classes: shape (n, 2)."""
        z = X @ self.weights + self.bias
        p1 = self._sigmoid(z)
        return np.column_stack([1 - p1, p1])

    def predict(self, X, threshold=0.5):
        """Return binary predictions using specified threshold."""
        return (self.predict_proba(X)[:, 1] >= threshold).astype(int)

    def score(self, X, y, threshold=0.5):
        return float(np.mean(self.predict(X, threshold) == y))


# Load the breast cancer dataset
data = load_breast_cancer()
X_raw, y_raw = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(
    X_raw, y_raw, test_size=0.2, random_state=42, stratify=y_raw)

scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc  = scaler.transform(X_test)

model = LogisticRegressionFromScratch(
    lr=0.1, n_iterations=2000, lambda_reg=0.01, verbose=True)
model.fit(X_train_sc, y_train)

y_prob = model.predict_proba(X_test_sc)[:, 1]
y_pred = model.predict(X_test_sc)

print(f"\nAccuracy:  {model.score(X_test_sc, y_test):.4f}")
print(f"ROC-AUC:   {roc_auc_score(y_test, y_prob):.4f}")
print(f"Avg Prec:  {average_precision_score(y_test, y_prob):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=data.target_names))

ROC and PR Curves

def plot_roc_and_pr(y_true, y_prob, model_name="Logistic Regression"):
    """Plot ROC curve and Precision-Recall curve side by side."""
    fig, axes = plt.subplots(1, 2, figsize=(12, 5))

    # ROC Curve
    fpr, tpr, roc_thresholds = roc_curve(y_true, y_prob)
    auc_roc = roc_auc_score(y_true, y_prob)

    ax = axes[0]
    ax.plot(fpr, tpr, 'b-', linewidth=2, label=f'ROC (AUC={auc_roc:.3f})')
    ax.plot([0, 1], [0, 1], 'k--', linewidth=1, label='Random (AUC=0.5)')
    ax.fill_between(fpr, tpr, alpha=0.1, color='blue')
    ax.set_xlabel("False Positive Rate (1 - Specificity)")
    ax.set_ylabel("True Positive Rate (Sensitivity)")
    ax.set_title(f"ROC Curve - {model_name}\n(Threshold-independent ranking quality)")
    ax.legend()
    ax.grid(True, alpha=0.3)

    # Precision-Recall Curve
    precision, recall, pr_thresholds = precision_recall_curve(y_true, y_prob)
    avg_prec = average_precision_score(y_true, y_prob)
    baseline = y_true.mean()  # random classifier baseline = prevalence

    ax = axes[1]
    ax.plot(recall, precision, 'g-', linewidth=2,
            label=f'PR (AP={avg_prec:.3f})')
    ax.axhline(baseline, color='k', linestyle='--', linewidth=1,
               label=f'Random (P={baseline:.2f})')
    ax.fill_between(recall, precision, alpha=0.1, color='green')
    ax.set_xlabel("Recall (Sensitivity)")
    ax.set_ylabel("Precision (Positive Predictive Value)")
    ax.set_title(f"Precision-Recall Curve - {model_name}\n"
                 f"(Better for imbalanced datasets)")
    ax.legend()
    ax.grid(True, alpha=0.3)

    # Annotate threshold at 0.5
    idx_50 = np.argmin(np.abs(roc_thresholds - 0.5))
    axes[0].scatter(fpr[idx_50], tpr[idx_50], s=100, color='red', zorder=5,
                    label=f'Threshold=0.5 (FPR={fpr[idx_50]:.2f}, TPR={tpr[idx_50]:.2f})')
    axes[0].legend()

    plt.tight_layout()
    plt.savefig("roc_pr_curves.png", dpi=150)
    plt.show()


plot_roc_and_pr(y_test, y_prob)

Multi-Class Classification

One-vs-Rest (OvR)

Train $K$ independent binary classifiers. Classifier $k$ treats class $k$ as positive and all others as negative.

Pros: Simple, parallelizable, interpretable per-class coefficients. Cons: Probabilities from $K$ classifiers are not calibrated against each other - they do not sum to 1.

Softmax (Multinomial Logistic Regression)

The softmax function generalizes sigmoid to $K$ classes:

$P(y = k | \mathbf{x}) = \frac{e^{\mathbf{w}_k^\top\mathbf{x}}}{\sum_{j=1}^K e^{\mathbf{w}_j^\top\mathbf{x}}}$

Properties:

Outputs are in $(0, 1)$ and sum to 1 - valid probability distribution
The softmax is identifiable only up to a constant: adding $c$ to all $\mathbf{w}_k^\top\mathbf{x}$ does not change the output. Set $\mathbf{w}_K = \mathbf{0}$ for identifiability.
For $K=2$ , softmax reduces to the binary sigmoid

The multi-class cross-entropy loss:

$\mathcal{J} = -\frac{1}{n}\sum_{i=1}^n \sum_{k=1}^K y_{ik} \log P(y_i = k | \mathbf{x}_i)$

where $y_{ik} = 1$ if sample $i$ belongs to class $k$ (one-hot encoding).

class SoftmaxRegression:
    """
    Multi-class logistic regression via softmax.
    Equivalent to multinomial logistic regression (one-hot cross-entropy).
    """

    def __init__(self, lr=0.1, n_iterations=1000, lambda_reg=0.01):
        self.lr = lr
        self.n_iterations = n_iterations
        self.lambda_reg = lambda_reg
        self.W = None  # (d, K) weight matrix
        self.b = None  # (K,) bias vector
        self.loss_history = []

    def _softmax(self, Z):
        """
        Numerically stable softmax: subtract max per row before exp.
        Without this, exp() overflows for large logit values.
        """
        Z_shifted = Z - Z.max(axis=1, keepdims=True)
        exp_Z = np.exp(Z_shifted)
        return exp_Z / exp_Z.sum(axis=1, keepdims=True)

    def fit(self, X, y):
        n, d = X.shape
        K = len(np.unique(y))
        self.W = np.zeros((d, K))
        self.b = np.zeros(K)
        self.classes_ = np.unique(y)

        # One-hot encode labels
        Y_onehot = np.eye(K)[y]  # shape (n, K)

        for t in range(self.n_iterations):
            # Forward pass
            Z = X @ self.W + self.b       # logits, shape (n, K)
            probs = self._softmax(Z)       # probabilities, shape (n, K)

            # Cross-entropy loss + L2 regularization
            eps = 1e-15
            loss = -np.mean(np.sum(Y_onehot * np.log(np.clip(probs, eps, 1)), axis=1))
            loss += 0.5 * self.lambda_reg * np.sum(self.W ** 2)
            self.loss_history.append(loss)

            # Gradients: (1/n) X^T (P - Y) + λW
            error = probs - Y_onehot      # shape (n, K)
            grad_W = (1 / n) * X.T @ error + self.lambda_reg * self.W
            grad_b = (1 / n) * error.sum(axis=0)

            self.W -= self.lr * grad_W
            self.b -= self.lr * grad_b

        return self

    def predict_proba(self, X):
        Z = X @ self.W + self.b
        return self._softmax(Z)

    def predict(self, X):
        return self.predict_proba(X).argmax(axis=1)

    def score(self, X, y):
        return float(np.mean(self.predict(X) == y))


# Multi-class example on Iris
from sklearn.datasets import load_iris

iris = load_iris()
X_i, y_i = iris.data, iris.target
X_tr, X_te, y_tr, y_te = train_test_split(X_i, y_i, test_size=0.2,
                                            random_state=42, stratify=y_i)
sc_i = StandardScaler()
X_tr_sc = sc_i.fit_transform(X_tr)
X_te_sc = sc_i.transform(X_te)

sm = SoftmaxRegression(lr=0.5, n_iterations=2000, lambda_reg=0.01)
sm.fit(X_tr_sc, y_tr)
print(f"\nSoftmax Regression Accuracy on Iris: {sm.score(X_te_sc, y_te):.4f}")

# Compare with OvR
from sklearn.linear_model import LogisticRegression as SKLogReg

ovr = SKLogReg(multi_class='ovr', C=10, max_iter=1000, random_state=42)
ovr.fit(X_tr_sc, y_tr)
print(f"OvR Logistic Regression Accuracy:   {ovr.score(X_te_sc, y_te):.4f}")

Probability Calibration

A model is well-calibrated if among all predictions of probability $p$ , approximately fraction $p$ of them are actually positive. Formally:

$P(y=1 | \hat{P}(y=1|\mathbf{x}) = p) = p \quad \forall p \in [0,1]$

Expected Calibration Error (ECE)

The Expected Calibration Error (ECE) quantifies miscalibration. Partition the $[0,1]$ probability range into $M$ bins of equal width. For each bin $B_m$ :

$ECE = \sum_{m=1}^M \frac{|B_m|}{n} \left|\text{acc}(B_m) - \text{conf}(B_m)\right|$

where:

$|B_m|$ = number of samples in bin $m$
$\text{acc}(B_m) = \frac{1}{|B_m|}\sum_{i \in B_m} \mathbf{1}[y_i = \hat{y}_i]$ = fraction of correct predictions in bin
$\text{conf}(B_m) = \frac{1}{|B_m|}\sum_{i \in B_m} \hat{p}_i$ = average predicted probability in bin

A perfectly calibrated model has ECE = 0.

from sklearn.calibration import calibration_curve, CalibratedClassifierCV
from sklearn.linear_model import LogisticRegression as SKLogReg


def expected_calibration_error(y_true, y_prob, n_bins=10):
    """
    Compute Expected Calibration Error (ECE).
    Lower is better. Perfect calibration = ECE of 0.
    """
    bin_edges = np.linspace(0, 1, n_bins + 1)
    ece = 0.0
    n = len(y_true)

    for i in range(n_bins):
        lo, hi = bin_edges[i], bin_edges[i + 1]
        in_bin = (y_prob >= lo) & (y_prob < hi)

        if in_bin.sum() == 0:
            continue

        bin_size = in_bin.sum()
        acc = y_true[in_bin].mean()   # fraction truly positive in bin
        conf = y_prob[in_bin].mean()  # average predicted probability in bin
        ece += (bin_size / n) * abs(acc - conf)

    return ece


# Evaluate calibration of our from-scratch model
ece_scratch = expected_calibration_error(y_test, y_prob)
print(f"\nECE (from-scratch logistic regression): {ece_scratch:.4f}")

# sklearn LogisticRegression
sk_lr = SKLogReg(C=100, max_iter=2000, random_state=42)
sk_lr.fit(X_train_sc, y_train)
sk_prob = sk_lr.predict_proba(X_test_sc)[:, 1]
ece_sk = expected_calibration_error(y_test, sk_prob)
print(f"ECE (sklearn LogisticRegression):       {ece_sk:.4f}")

# Visualize calibration curve
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

ax = axes[0]
prob_true, prob_pred = calibration_curve(y_test, y_prob, n_bins=10)
ax.plot([0, 1], [0, 1], 'k--', label='Perfect calibration')
ax.plot(prob_pred, prob_true, 's-', color='blue',
        label=f'From-scratch (ECE={ece_scratch:.3f})')
prob_true_sk, prob_pred_sk = calibration_curve(y_test, sk_prob, n_bins=10)
ax.plot(prob_pred_sk, prob_true_sk, 's-', color='green',
        label=f'sklearn (ECE={ece_sk:.3f})')
ax.set_xlabel("Mean Predicted Probability")
ax.set_ylabel("Fraction of Positives (Actual Rate)")
ax.set_title("Calibration Curve\n(closer to diagonal = better calibration)")
ax.legend()
ax.grid(True, alpha=0.3)

# Probability distribution histogram
ax = axes[1]
ax.hist(y_prob[y_test == 0], bins=30, alpha=0.5, color='blue',
        label='Negative class', density=True)
ax.hist(y_prob[y_test == 1], bins=30, alpha=0.5, color='red',
        label='Positive class', density=True)
ax.set_xlabel("Predicted P(y=1|x)")
ax.set_ylabel("Density")
ax.set_title("Probability Distribution by True Class\n"
             "(good separation = good discrimination)")
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig("calibration_curves.png", dpi=150)
plt.show()


# Platt Scaling - recalibrate a poorly calibrated model
# Fit logistic regression on top of uncalibrated scores
from sklearn.linear_model import LogisticRegression as PlattLR

def platt_scaling(y_cal, raw_scores_cal, y_test, raw_scores_test):
    """
    Platt scaling: fit a sigmoid (logistic regression) on top of raw scores
    to recalibrate probabilities.
    """
    platt = PlattLR(C=1e6, max_iter=1000)
    platt.fit(raw_scores_cal.reshape(-1, 1), y_cal)
    calibrated_probs = platt.predict_proba(raw_scores_test.reshape(-1, 1))[:, 1]
    return calibrated_probs

# For this already-calibrated model the improvement will be minor,
# but the pattern applies to SVMs, gradient boosting, etc.
n_cal = len(y_test) // 2
platt_prob = platt_scaling(
    y_test[:n_cal], y_prob[:n_cal],
    y_test[n_cal:], y_prob[n_cal:]
)
ece_platt = expected_calibration_error(y_test[n_cal:], platt_prob)
print(f"\nECE after Platt scaling: {ece_platt:.4f}")
print("(Platt scaling is most useful for SVMs and gradient boosted trees)")

Handling Class Imbalance

With imbalanced classes (e.g., 1% fraud, 99% legitimate), the default threshold of 0.5 is usually wrong. The model outputs probabilities reflecting the class distribution, and simply classifying by majority class achieves 99% accuracy - useless.

Strategy 1: Class Weights

Upweight the minority class in the loss:

from sklearn.linear_model import LogisticRegression as SKLogReg
from sklearn.utils.class_weight import compute_class_weight

# Create severely imbalanced dataset (5% positive)
np.random.seed(42)
X_imb, y_imb = make_classification(
    n_samples=5000, n_features=20, n_informative=10,
    weights=[0.95, 0.05],  # 95% negative, 5% positive
    random_state=42
)
X_tr_i, X_te_i, y_tr_i, y_te_i = train_test_split(
    X_imb, y_imb, test_size=0.2, random_state=42, stratify=y_imb)

sc_imb = StandardScaler()
X_tr_i_sc = sc_imb.fit_transform(X_tr_i)
X_te_i_sc  = sc_imb.transform(X_te_i)

# Compute balanced class weights: n_samples / (n_classes * count_per_class)
classes = np.unique(y_tr_i)
class_weights = compute_class_weight('balanced', classes=classes, y=y_tr_i)
weight_dict = dict(zip(classes, class_weights))
print(f"Class weights: {weight_dict}")

# Without class weights - likely predicts all zeros
lr_unweighted = SKLogReg(C=1.0, max_iter=1000, random_state=42)
lr_unweighted.fit(X_tr_i_sc, y_tr_i)

# With class weights - penalizes minority class errors more heavily
lr_weighted = SKLogReg(C=1.0, max_iter=1000, class_weight='balanced', random_state=42)
lr_weighted.fit(X_tr_i_sc, y_tr_i)

print("\nClass Imbalance: Effect of class_weight='balanced'")
print(f"{'Model':25s} | {'Accuracy':>10} | {'AUC-ROC':>8} | {'Recall-1':>10}")
print("-" * 60)

for name, m in [("Unweighted LR", lr_unweighted), ("Weighted LR", lr_weighted)]:
    y_pred_i = m.predict(X_te_i_sc)
    y_prob_i = m.predict_proba(X_te_i_sc)[:, 1]
    acc = np.mean(y_pred_i == y_te_i)
    auc = roc_auc_score(y_te_i, y_prob_i)
    recall = np.sum((y_pred_i == 1) & (y_te_i == 1)) / np.sum(y_te_i == 1)
    print(f"{name:25s} | {acc:>10.4f} | {auc:>8.4f} | {recall:>10.4f}")

Strategy 2: Threshold Tuning

The default threshold of 0.5 is arbitrary. Tune it based on the business cost of false positives vs false negatives:

def find_optimal_threshold(y_true, y_prob, metric='f1'):
    """
    Find the probability threshold that optimizes the given metric.
    Business context determines the right metric:
    - F1: balanced precision/recall (most common)
    - Precision: when false positives are very costly (fraud review queue)
    - Recall: when false negatives are very costly (cancer screening)
    """
    thresholds = np.arange(0.05, 0.95, 0.01)
    scores = []

    for thresh in thresholds:
        y_pred = (y_prob >= thresh).astype(int)
        tp = np.sum((y_pred == 1) & (y_true == 1))
        fp = np.sum((y_pred == 1) & (y_true == 0))
        fn = np.sum((y_pred == 0) & (y_true == 1))

        precision = tp / (tp + fp) if (tp + fp) > 0 else 0
        recall    = tp / (tp + fn) if (tp + fn) > 0 else 0

        if metric == 'f1':
            score = (2 * precision * recall / (precision + recall)
                     if (precision + recall) > 0 else 0)
        elif metric == 'precision':
            score = precision
        elif metric == 'recall':
            score = recall

        scores.append(score)

    best_idx = np.argmax(scores)
    best_thresh = thresholds[best_idx]
    best_score = scores[best_idx]

    plt.figure(figsize=(8, 4))
    plt.plot(thresholds, scores, 'b-', linewidth=2)
    plt.axvline(best_thresh, color='r', linestyle='--',
                label=f'Optimal threshold={best_thresh:.2f} ({metric}={best_score:.3f})')
    plt.xlabel("Classification Threshold")
    plt.ylabel(f"{metric.capitalize()} Score")
    plt.title(f"Threshold Tuning: Optimize {metric.upper()}")
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.savefig("threshold_tuning.png", dpi=150)
    plt.show()

    return best_thresh, best_score


y_prob_weighted = lr_weighted.predict_proba(X_te_i_sc)[:, 1]
best_thresh, best_f1 = find_optimal_threshold(y_te_i, y_prob_weighted, metric='f1')
print(f"\nOptimal threshold for F1: {best_thresh:.2f}, F1={best_f1:.4f}")

scikit-learn with Calibration

from sklearn.linear_model import LogisticRegression as SKLogReg
from sklearn.calibration import CalibratedClassifierCV

# sklearn LogisticRegression with L-BFGS solver (quasi-Newton, much faster than GD)
sk_lr = SKLogReg(C=10.0, max_iter=2000, solver='lbfgs', random_state=42)
sk_lr.fit(X_train_sc, y_train)
print(f"\nsklearn LogReg (lbfgs) Accuracy: {sk_lr.score(X_test_sc, y_test):.4f}")
print(f"sklearn ROC-AUC: {roc_auc_score(y_test, sk_lr.predict_proba(X_test_sc)[:, 1]):.4f}")

# Calibrated logistic regression using isotonic regression
# Useful for models that are not natively calibrated (e.g., SVMs, gradient boosting)
calib_lr = CalibratedClassifierCV(sk_lr, method='isotonic', cv=5)
calib_lr.fit(X_train_sc, y_train)
calib_probs = calib_lr.predict_proba(X_test_sc)[:, 1]
ece_calib = expected_calibration_error(y_test, calib_probs)
print(f"Calibrated LR ECE: {ece_calib:.4f}")

# Solver comparison
print("\nSolver Comparison:")
for solver in ['lbfgs', 'liblinear', 'saga']:
    try:
        m = SKLogReg(C=10.0, max_iter=2000, solver=solver, random_state=42)
        m.fit(X_train_sc, y_train)
        prob_s = m.predict_proba(X_test_sc)[:, 1]
        print(f"  {solver:12s}: Acc={m.score(X_test_sc, y_test):.4f}, "
              f"AUC={roc_auc_score(y_test, prob_s):.4f}")
    except Exception as e:
        print(f"  {solver:12s}: {e}")

:::tip Production solver choice sklearn's LogisticRegression uses LBFGS by default for small/medium datasets - it is a quasi-Newton method that converges much faster than gradient descent (quadratic vs linear convergence). For large datasets ( $n > 10^5$ ) or sparse features, use solver='saga' which supports all penalties (L1, L2, ElasticNet) and mini-batch training. For L1 penalty with small datasets, solver='liblinear' uses coordinate descent - very efficient for sparse data. :::

Common Mistakes

:::danger Using MSE loss for logistic regression Logistic regression with MSE loss is not logistic regression - it is a different model with different optimization properties. MSE + sigmoid has vanishing gradients when predictions are saturated, poor calibration, and a non-convex loss surface. Always use cross-entropy (NLL) for logistic regression. In PyTorch: use nn.BCEWithLogitsLoss() (which combines sigmoid and cross-entropy for numerical stability). In sklearn: LogisticRegression always uses cross-entropy. :::

:::danger Threshold = 0.5 with imbalanced classes The 0.5 threshold assumes both classes are equally important. For a 1:99 imbalanced dataset, predicting "negative" for everything achieves 99% accuracy - a useless model. Always tune the threshold based on the precision/recall trade-off that matches your business requirements. Use classification_report with the tuned threshold, and report precision and recall separately, not just accuracy. :::

:::warning Not checking calibration before using predicted probabilities for decisions Raw logistic regression probabilities are approximately calibrated when the model is correctly specified, but miscalibration occurs with class imbalance, distribution shift, or overfit models. Before using probabilities for cost-benefit calculations, risk scoring, or ranking, always plot the calibration curve and compute ECE. If ECE > 0.05 on the test set, apply Platt scaling or isotonic regression recalibration. :::

:::warning Interpreting coefficients without standardizing A coefficient of 2.3 for age (in years) and 0.001 for income (in dollars) cannot be compared directly. Standardize features before training if you want to compare coefficient magnitudes. Report both standardized beta coefficients (for comparison) and unstandardized coefficients (for the original unit interpretation). :::

YouTube Resources

Resource	Channel	Why Watch
Logistic Regression, Clearly Explained!	StatQuest with Josh Starmer	Best visual introduction - sigmoid, log-odds, MLE connection
Softmax and Cross-Entropy Explained	Andrej Karpathy	Deep derivation of softmax, cross-entropy, and gradients
Precision vs Recall vs F1 Score	StatQuest	When accuracy fails - ROC, PR curves, and threshold selection
Probability Calibration for ML Models	Scikit-learn talks	ECE, Platt scaling, isotonic regression - practical calibration
Handling Imbalanced Datasets	DeepLearning.AI	Class weights, SMOTE, and threshold tuning - complete imbalance guide

Interview Q&A

Q1: Derive the cross-entropy loss from the logistic regression model. Why not use MSE?

Cross-entropy is derived from MLE under the Bernoulli likelihood. For binary label $y_i$ : $P(y_i|\mathbf{x}_i;\mathbf{w}) = \hat{y}_i^{y_i}(1-\hat{y}_i)^{1-y_i}$ . Taking the log over $n$ samples: $\sum_i[y_i\log\hat{y}_i + (1-y_i)\log(1-\hat{y}_i)]$ . Minimizing the negative log-likelihood gives cross-entropy. This is the statistically principled loss.

MSE with sigmoid suffers from vanishing gradients: the gradient $(\hat{y}-y)\sigma'(z)x_j$ involves $\sigma'(z) = \hat{y}(1-\hat{y}) \approx 0$ for saturated predictions - exactly when the model is most wrong. Cross-entropy's gradient $(\hat{y}-y)x_j$ has no such term - it is large when the model is confidently wrong, driving fast correction.

Q2: What does a logistic regression coefficient mean?

For a binary classification model with standardized feature $j$ : $e^{w_j}$ is the odds ratio - a 1 SD increase in feature $j$ multiplies the odds of the positive class by $e^{w_j}$ (holding all other features constant). For $w_j = 0.5$ : odds multiply by $e^{0.5} \approx 1.65$ (65% increase in odds, not 65% increase in probability). For unstandardized features: a one-unit increase in feature $j$ multiplies odds by $e^{w_j}$ . Positive coefficients increase the probability; negative decrease it. The magnitude indicates strength; the sign indicates direction.

Q3: What is the decision boundary of logistic regression geometrically?

The decision boundary is the hyperplane $\{\mathbf{x}: \mathbf{w}^\top\mathbf{x} + b = 0\}$ - a $(d-1)$ -dimensional linear subspace in $d$ -dimensional feature space. In 2D it is a line; in 3D a plane. The vector $\mathbf{w}$ is perpendicular to this hyperplane and points toward the positive class. Logistic regression is inherently linear - it cannot learn XOR, circles, or any non-linear boundary without explicit feature engineering (polynomial features, interaction terms). This is the fundamental limitation that motivates kernel methods (Lesson 06) and neural networks.

Q4: How does softmax generalize sigmoid to multiple classes?

For binary classification, sigmoid outputs $P(y=1|\mathbf{x}) = e^{z}/(e^z + e^0)$ - a ratio of two exponentials. Softmax generalizes: $P(y=k|\mathbf{x}) = e^{\mathbf{w}_k^\top\mathbf{x}} / \sum_{j=1}^K e^{\mathbf{w}_j^\top\mathbf{x}}$ . Key properties: (1) outputs are in $(0,1)$ and sum to 1 - valid probability distribution; (2) softmax is invariant to adding a constant to all logits (set last class weight to zero for identifiability); (3) the gradient for each class $k$ is $(\hat{p}_k - y_k)\mathbf{x}$ - same beautiful error-times-features form as binary; (4) with $K=2$ , softmax reduces to the binary sigmoid.

Q5: What is the Expected Calibration Error (ECE) and when is logistic regression miscalibrated?

ECE = $\sum_{m=1}^M \frac{|B_m|}{n}|\text{acc}(B_m) - \text{conf}(B_m)|$ - the weighted average absolute difference between predicted probabilities and actual positive rates across $M$ probability bins. Perfect calibration: ECE = 0. Logistic regression is theoretically well-calibrated when the linear model is correctly specified (true log-odds are linear in features). Miscalibration occurs when: (1) extreme class imbalance shifts probabilities away from the true base rate; (2) features are correlated with omitted variables; (3) distribution shift between train and test; (4) the model is overfit (training probabilities pushed to 0/1 extremes). Fix with Platt scaling (logistic regression on top of raw outputs), isotonic regression (non-parametric), or temperature scaling ( $T$ in $\sigma(z/T)$ where $T > 1$ softens overconfident predictions).

Q6: How do you handle a dataset with 1% positive examples for logistic regression?

Multi-step approach: (1) Class weights: class_weight='balanced' in sklearn weights minority class errors by $n_\text{neg}/n_\text{pos} = 99\times$ more heavily. This is equivalent to oversampling the minority class. (2) Threshold tuning: the default 0.5 threshold is wrong. Plot the PR curve and find the threshold that gives the precision/recall balance matching your business requirements - often around 0.1–0.2 for 1% prevalence. (3) Evaluation: accuracy is useless (99% by predicting all negative). Use precision@K, AUC-PR, and recall at your target precision point. (4) Data augmentation: SMOTE generates synthetic minority class samples in feature space, or use actual oversampling/undersampling. (5) Check calibration: ECE after class weighting ensures probabilities remain meaningful for threshold selection.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Logistic Regression Decision Boundary demo on the EngineersOfAI Playground - no code required.

:::

The Real Interview Moment​

Why Logistic Regression Exists​

Historical Context​

The Sigmoid Function: Derivation and Properties​

From Log-Odds to Sigmoid​

Log-Odds Interpretation​

Cross-Entropy Loss: Derivation from Maximum Likelihood​

Step 1: Write the Likelihood​

Step 2: Write the Log-Likelihood​

Step 3: Maximize Log-Likelihood = Minimize NLL = Minimize Cross-Entropy​

Why Not MSE for Classification?​

Computing the Gradient​

Decision Boundary Geometry​

Full NumPy Implementation from Scratch​

ROC and PR Curves​

Multi-Class Classification​

One-vs-Rest (OvR)​

Softmax (Multinomial Logistic Regression)​

Probability Calibration​

Expected Calibration Error (ECE)​

Handling Class Imbalance​

Strategy 1: Class Weights​

Strategy 2: Threshold Tuning​

scikit-learn with Calibration​

Common Mistakes​

YouTube Resources​

Interview Q&A​