Python ML Coding - Implement Algorithms From Scratch

Reading time: ~50 min | Interview relevance: Critical | Roles: MLE, AI Eng, Research Engineer, Applied Scientist

The Real Interview Moment

You are in an Anthropic Research Engineer interview. The interviewer says: "Implement logistic regression from scratch. No sklearn. Just NumPy. Include gradient descent, prediction, and show me the training loop." You have seen logistic regression a hundred times in courses, but now you are writing it under pressure. Do you remember the gradient formula? Is it X.T @ (predictions - labels) or (predictions - labels).T @ X? Do you divide by batch size? Do you apply sigmoid before or after the dot product?

This is the ultimate test of understanding. Anyone can call sklearn.LogisticRegression().fit(X, y). But implementing the algorithm from scratch - deriving the gradient, managing the shapes, handling numerical stability - proves you understand what the library is actually doing. When your model misbehaves in production, this is the knowledge that lets you debug it.

Every ML-focused interview at research labs (Anthropic, OpenAI, DeepMind, FAIR) and top companies (Google Brain, Meta AI) includes at least one "implement from scratch" question. Candidates who can write clean, correct implementations with proper vectorization get "strong hire." Candidates who get lost in the shapes or cannot remember the gradient get "no hire."

What You Will Master

Implement linear regression with closed-form solution and gradient descent
Implement logistic regression with binary cross-entropy and sigmoid
Implement a decision tree classifier with information gain
Implement K-nearest neighbors with vectorized distance
Implement K-means clustering with convergence detection
Implement PCA via eigendecomposition and SVD
Implement gradient descent variants: batch, mini-batch, SGD
Implement a neural network forward and backward pass
Implement train-test split and k-fold cross-validation
Explain every step - shapes, gradients, and design decisions

Self-Assessment: Where Are You Now?

Skill	1 - Cannot	2 - Vaguely	3 - Can Code	4 - Can Derive + Code	5 - Can Teach	Your Score
Linear regression (closed-form + GD)						___
Logistic regression from scratch						___
Decision tree with info gain						___
K-NN prediction						___
K-means clustering						___
PCA / dimensionality reduction						___
Neural network forward/backward						___
Cross-validation implementation						___

Target: All 4s and 5s before your interview.

Implementation 1 - Linear Regression

Closed-Form Solution (Normal Equation)

import numpy as np

class LinearRegressionClosedForm:
    """
    Linear regression using the normal equation.
    w* = (X^T X)^(-1) X^T y
    """

    def __init__(self):
        self.weights = None  # Includes bias as last element

    def fit(self, X, y):
        """
        Fit the model using the normal equation.

        Args:
            X: np.ndarray of shape (n_samples, n_features)
            y: np.ndarray of shape (n_samples,)
        """
        # Add bias column (column of ones)
        X_b = np.column_stack([X, np.ones(X.shape[0])])

        # Normal equation: w = (X^T X)^(-1) X^T y
        # Use lstsq for numerical stability (avoids inverting X^T X)
        self.weights, _, _, _ = np.linalg.lstsq(X_b, y, rcond=None)

    def predict(self, X):
        """Predict target values."""
        X_b = np.column_stack([X, np.ones(X.shape[0])])
        return X_b @ self.weights

60-Second Answer

"Linear regression minimizes squared error. The closed-form solution is w = (X^T X)^(-1) X^T y, which I implement using np.linalg.lstsq for numerical stability. For large datasets, I use gradient descent instead - the gradient of MSE with respect to w is (2/n) * X^T @ (X @ w - y). I add a bias term by appending a column of ones to X. The closed-form is O(n*d^2 + d^3), so it becomes impractical when d exceeds ~10,000 features."

Gradient Descent Solution

class LinearRegressionGD:
    """
    Linear regression using gradient descent.
    Loss: L = (1/2n) * ||Xw - y||^2
    Gradient: dL/dw = (1/n) * X^T @ (Xw - y)
    """

    def __init__(self, learning_rate=0.01, n_iterations=1000):
        self.lr = learning_rate
        self.n_iter = n_iterations
        self.weights = None
        self.bias = None
        self.loss_history = []

    def fit(self, X, y):
        """
        Fit the model using gradient descent.

        Args:
            X: np.ndarray of shape (n_samples, n_features)
            y: np.ndarray of shape (n_samples,)
        """
        n_samples, n_features = X.shape

        # Initialize weights to small random values
        self.weights = np.random.randn(n_features) * 0.01
        self.bias = 0.0

        for i in range(self.n_iter):
            # Forward pass: compute predictions
            y_pred = X @ self.weights + self.bias  # Shape: (n_samples,)

            # Compute loss (MSE)
            error = y_pred - y  # Shape: (n_samples,)
            loss = np.mean(error ** 2) / 2
            self.loss_history.append(loss)

            # Compute gradients
            dw = (1 / n_samples) * (X.T @ error)  # Shape: (n_features,)
            db = (1 / n_samples) * np.sum(error)   # Scalar

            # Update weights
            self.weights -= self.lr * dw
            self.bias -= self.lr * db

    def predict(self, X):
        return X @ self.weights + self.bias

Common Trap

The gradient of MSE loss is (1/n) * X.T @ (predictions - targets), not (1/n) * X.T @ (targets - predictions). Getting the sign wrong means you do gradient ascent instead of descent - your loss will increase. This is the single most common mistake in "implement from scratch" interviews. Double-check: the gradient points in the direction of increasing loss, so you subtract it.

Implementation 2 - Logistic Regression

class LogisticRegression:
    """
    Binary logistic regression using gradient descent.
    Model: p(y=1|x) = sigmoid(x @ w + b)
    Loss: Binary cross-entropy = -(1/n) * sum(y*log(p) + (1-y)*log(1-p))
    Gradient: dL/dw = (1/n) * X^T @ (sigmoid(Xw+b) - y)
    """

    def __init__(self, learning_rate=0.01, n_iterations=1000,
                 reg_lambda=0.0):
        self.lr = learning_rate
        self.n_iter = n_iterations
        self.reg_lambda = reg_lambda  # L2 regularization strength
        self.weights = None
        self.bias = None
        self.loss_history = []

    @staticmethod
    def _sigmoid(z):
        """Numerically stable sigmoid."""
        # Clip to prevent overflow in exp
        z = np.clip(z, -500, 500)
        return 1.0 / (1.0 + np.exp(-z))

    def fit(self, X, y):
        """
        Fit the model.

        Args:
            X: np.ndarray of shape (n_samples, n_features)
            y: np.ndarray of shape (n_samples,) with values 0 or 1
        """
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = 0.0

        for i in range(self.n_iter):
            # Forward pass
            z = X @ self.weights + self.bias     # Shape: (n_samples,)
            y_pred = self._sigmoid(z)            # Shape: (n_samples,)

            # Compute binary cross-entropy loss
            eps = 1e-12  # Prevent log(0)
            loss = -(1 / n_samples) * np.sum(
                y * np.log(y_pred + eps) +
                (1 - y) * np.log(1 - y_pred + eps)
            )
            # Add L2 regularization term
            loss += (self.reg_lambda / (2 * n_samples)) * np.sum(self.weights ** 2)
            self.loss_history.append(loss)

            # Compute gradients
            # Key insight: gradient has the same form as linear regression
            # but with sigmoid(Xw) instead of Xw
            error = y_pred - y  # Shape: (n_samples,)
            dw = (1 / n_samples) * (X.T @ error)  # Shape: (n_features,)
            dw += (self.reg_lambda / n_samples) * self.weights  # L2 gradient
            db = (1 / n_samples) * np.sum(error)

            # Update
            self.weights -= self.lr * dw
            self.bias -= self.lr * db

    def predict_proba(self, X):
        """Return probability of class 1."""
        return self._sigmoid(X @ self.weights + self.bias)

    def predict(self, X, threshold=0.5):
        """Return binary predictions."""
        return (self.predict_proba(X) >= threshold).astype(int)

Logistic Regression Training Loop

Instant Rejection

Never confuse the logistic regression gradient with the linear regression gradient. The logistic regression gradient is X.T @ (sigmoid(Xw) - y), NOT X.T @ (Xw - y). Both have the form X.T @ (prediction - target), but the "prediction" is different - sigmoid output for logistic, raw linear output for linear. If you write the wrong one, the interviewer knows you are memorizing rather than deriving.

Implementation 3 - Decision Tree

class DecisionTreeClassifier:
    """
    Decision tree using information gain (entropy-based).
    """

    class Node:
        def __init__(self, feature=None, threshold=None, left=None,
                     right=None, value=None):
            self.feature = feature        # Feature index to split on
            self.threshold = threshold    # Threshold value
            self.left = left              # Left child (<=threshold)
            self.right = right            # Right child (>threshold)
            self.value = value            # Class prediction (leaf node)

    def __init__(self, max_depth=10, min_samples_split=2):
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.root = None

    @staticmethod
    def _entropy(y):
        """Compute entropy of a label distribution."""
        if len(y) == 0:
            return 0
        proportions = np.bincount(y) / len(y)
        proportions = proportions[proportions > 0]  # Remove zeros
        return -np.sum(proportions * np.log2(proportions))

    def _information_gain(self, y, left_mask):
        """Compute information gain from a split."""
        parent_entropy = self._entropy(y)

        left_y = y[left_mask]
        right_y = y[~left_mask]

        if len(left_y) == 0 or len(right_y) == 0:
            return 0

        n = len(y)
        child_entropy = (
            (len(left_y) / n) * self._entropy(left_y) +
            (len(right_y) / n) * self._entropy(right_y)
        )

        return parent_entropy - child_entropy

    def _best_split(self, X, y):
        """Find the best feature and threshold to split on."""
        best_gain = -1
        best_feature = None
        best_threshold = None

        n_features = X.shape[1]

        for feature in range(n_features):
            # Get unique sorted thresholds (midpoints between unique values)
            values = np.unique(X[:, feature])
            thresholds = (values[:-1] + values[1:]) / 2

            for threshold in thresholds:
                left_mask = X[:, feature] <= threshold
                gain = self._information_gain(y, left_mask)

                if gain > best_gain:
                    best_gain = gain
                    best_feature = feature
                    best_threshold = threshold

        return best_feature, best_threshold, best_gain

    def _build_tree(self, X, y, depth=0):
        """Recursively build the decision tree."""
        n_samples = len(y)
        n_classes = len(np.unique(y))

        # Stopping conditions
        if (depth >= self.max_depth or
            n_classes == 1 or
            n_samples < self.min_samples_split):
            # Leaf node: predict the majority class
            leaf_value = np.argmax(np.bincount(y))
            return self.Node(value=leaf_value)

        # Find the best split
        feature, threshold, gain = self._best_split(X, y)

        if gain <= 0:
            leaf_value = np.argmax(np.bincount(y))
            return self.Node(value=leaf_value)

        # Split the data
        left_mask = X[:, feature] <= threshold
        left_child = self._build_tree(X[left_mask], y[left_mask], depth + 1)
        right_child = self._build_tree(X[~left_mask], y[~left_mask], depth + 1)

        return self.Node(feature=feature, threshold=threshold,
                        left=left_child, right=right_child)

    def fit(self, X, y):
        """Build the tree."""
        self.root = self._build_tree(X, y.astype(int))

    def _predict_single(self, x, node):
        """Predict for a single sample."""
        if node.value is not None:
            return node.value
        if x[node.feature] <= node.threshold:
            return self._predict_single(x, node.left)
        return self._predict_single(x, node.right)

    def predict(self, X):
        """Predict for multiple samples."""
        return np.array([self._predict_single(x, self.root) for x in X])

Company Variation

Google and Meta rarely ask you to implement a full decision tree in an interview - it is too long. They may ask you to implement just the _information_gain or _best_split function. Research labs (Anthropic, DeepMind) may ask for the full tree or ask you to extend it (e.g., add Gini impurity, handle continuous vs categorical features, implement pruning). The key is knowing the algorithm well enough to implement any part on demand.

Implementation 4 - K-Nearest Neighbors

class KNNClassifier:
    """
    K-nearest neighbors classifier using vectorized distance computation.
    """

    def __init__(self, k=5):
        self.k = k
        self.X_train = None
        self.y_train = None

    def fit(self, X, y):
        """Store training data (lazy learner)."""
        self.X_train = X
        self.y_train = y.astype(int)

    def predict(self, X):
        """
        Predict labels for query points.

        Args:
            X: np.ndarray of shape (n_query, n_features)
        Returns:
            predictions: np.ndarray of shape (n_query,)
        """
        # Compute pairwise distances using the expansion trick
        # ||a-b||^2 = ||a||^2 + ||b||^2 - 2*a.b
        X_sq = np.sum(X ** 2, axis=1)[:, None]          # (n_query, 1)
        Xt_sq = np.sum(self.X_train ** 2, axis=1)[None, :]  # (1, n_train)
        cross = X @ self.X_train.T                        # (n_query, n_train)
        distances = np.sqrt(np.maximum(X_sq + Xt_sq - 2 * cross, 0))

        # Find k nearest neighbors (O(n) partial sort per query)
        knn_indices = np.argpartition(distances, self.k, axis=1)[:, :self.k]

        # Get labels of nearest neighbors
        knn_labels = self.y_train[knn_indices]  # (n_query, k)

        # Majority vote
        n_classes = self.y_train.max() + 1
        predictions = np.zeros(X.shape[0], dtype=int)
        for i in range(X.shape[0]):
            counts = np.bincount(knn_labels[i], minlength=n_classes)
            predictions[i] = np.argmax(counts)

        return predictions

Implementation 5 - K-Means Clustering

class KMeans:
    """
    K-means clustering with k-means++ initialization.
    """

    def __init__(self, n_clusters=3, max_iter=100, tol=1e-4, random_state=42):
        self.n_clusters = n_clusters
        self.max_iter = max_iter
        self.tol = tol
        self.rng = np.random.default_rng(random_state)
        self.centroids = None
        self.labels = None
        self.inertia = None  # Sum of squared distances to nearest centroid

    def _init_centroids_plusplus(self, X):
        """K-means++ initialization: spread out initial centroids."""
        n_samples = X.shape[0]
        centroids = np.zeros((self.n_clusters, X.shape[1]))

        # First centroid: random sample
        centroids[0] = X[self.rng.integers(n_samples)]

        for k in range(1, self.n_clusters):
            # Compute distance from each point to nearest existing centroid
            distances = np.min(
                np.sum((X[:, None, :] - centroids[None, :k, :]) ** 2, axis=2),
                axis=1
            )
            # Sample next centroid with probability proportional to distance^2
            probs = distances / distances.sum()
            centroids[k] = X[self.rng.choice(n_samples, p=probs)]

        return centroids

    def fit(self, X):
        """
        Fit K-means to data.

        Args:
            X: np.ndarray of shape (n_samples, n_features)
        """
        # Initialize centroids using k-means++
        self.centroids = self._init_centroids_plusplus(X)

        for iteration in range(self.max_iter):
            # Step 1: Assign each point to nearest centroid
            # Distances: (n_samples, n_clusters)
            distances = np.sqrt(np.sum(
                (X[:, None, :] - self.centroids[None, :, :]) ** 2,
                axis=2
            ))
            self.labels = np.argmin(distances, axis=1)

            # Step 2: Update centroids to mean of assigned points
            new_centroids = np.zeros_like(self.centroids)
            for k in range(self.n_clusters):
                cluster_points = X[self.labels == k]
                if len(cluster_points) > 0:
                    new_centroids[k] = cluster_points.mean(axis=0)
                else:
                    # Empty cluster: reinitialize to random point
                    new_centroids[k] = X[self.rng.integers(X.shape[0])]

            # Step 3: Check convergence
            centroid_shift = np.sqrt(np.sum((new_centroids - self.centroids) ** 2))
            self.centroids = new_centroids

            if centroid_shift < self.tol:
                break

        # Compute inertia (sum of squared distances to nearest centroid)
        self.inertia = sum(
            np.sum((X[self.labels == k] - self.centroids[k]) ** 2)
            for k in range(self.n_clusters)
        )

        return self

    def predict(self, X):
        """Assign new points to nearest centroid."""
        distances = np.sqrt(np.sum(
            (X[:, None, :] - self.centroids[None, :, :]) ** 2,
            axis=2
        ))
        return np.argmin(distances, axis=1)

Implementation 6 - PCA

class PCA:
    """
    Principal Component Analysis via eigendecomposition.
    """

    def __init__(self, n_components):
        self.n_components = n_components
        self.components = None     # Top-k eigenvectors (k, d)
        self.mean = None           # Feature means (d,)
        self.explained_variance = None
        self.explained_variance_ratio = None

    def fit(self, X):
        """
        Fit PCA: compute principal components.

        Args:
            X: np.ndarray of shape (n_samples, n_features)
        """
        n_samples = X.shape[0]

        # Step 1: Center the data
        self.mean = X.mean(axis=0)
        X_centered = X - self.mean

        # Step 2: Compute covariance matrix
        # cov = (1/(n-1)) * X_centered^T @ X_centered
        cov_matrix = (X_centered.T @ X_centered) / (n_samples - 1)

        # Step 3: Eigendecomposition (eigh for symmetric matrices)
        eigenvalues, eigenvectors = np.linalg.eigh(cov_matrix)

        # Step 4: Sort by eigenvalue (descending)
        # eigh returns in ascending order, so reverse
        idx = np.argsort(eigenvalues)[::-1]
        eigenvalues = eigenvalues[idx]
        eigenvectors = eigenvectors[:, idx]

        # Step 5: Select top-k components
        self.components = eigenvectors[:, :self.n_components].T  # Shape: (k, d)
        self.explained_variance = eigenvalues[:self.n_components]
        self.explained_variance_ratio = (
            self.explained_variance / eigenvalues.sum()
        )

        return self

    def transform(self, X):
        """Project data onto principal components."""
        X_centered = X - self.mean
        return X_centered @ self.components.T  # (n, d) @ (d, k) = (n, k)

    def fit_transform(self, X):
        """Fit and transform in one step."""
        self.fit(X)
        return self.transform(X)

    def inverse_transform(self, X_reduced):
        """Reconstruct data from reduced representation."""
        return X_reduced @ self.components + self.mean  # (n, k) @ (k, d) + (d,)

Implementation 7 - Gradient Descent Variants

class GradientDescentOptimizer:
    """
    Gradient descent variants: batch, mini-batch, SGD with momentum.
    """

    def __init__(self, learning_rate=0.01, momentum=0.0, batch_size=None):
        self.lr = learning_rate
        self.momentum = momentum
        self.batch_size = batch_size  # None = full batch
        self.velocity_w = None
        self.velocity_b = None

    def step(self, X, y, weights, bias, loss_fn, grad_fn):
        """
        Perform one step of gradient descent.

        Args:
            X, y: data
            weights: current weights
            bias: current bias
            loss_fn: function(X, y, w, b) -> scalar loss
            grad_fn: function(X, y, w, b) -> (dw, db)
        Returns:
            updated weights, bias, loss
        """
        n_samples = X.shape[0]

        if self.batch_size is None or self.batch_size >= n_samples:
            # Full batch gradient descent
            batch_X, batch_y = X, y
        else:
            # Mini-batch: randomly sample a batch
            indices = np.random.choice(n_samples, self.batch_size, replace=False)
            batch_X, batch_y = X[indices], y[indices]

        # Compute gradients on the batch
        dw, db = grad_fn(batch_X, batch_y, weights, bias)

        # Initialize velocity if needed
        if self.velocity_w is None:
            self.velocity_w = np.zeros_like(weights)
            self.velocity_b = 0.0

        # Update with momentum
        self.velocity_w = self.momentum * self.velocity_w - self.lr * dw
        self.velocity_b = self.momentum * self.velocity_b - self.lr * db

        weights = weights + self.velocity_w
        bias = bias + self.velocity_b

        # Compute loss on full data for monitoring
        loss = loss_fn(X, y, weights, bias)

        return weights, bias, loss

Gradient Descent Comparison

Variant	Batch Size	Gradient Noise	Convergence	Memory	When to Use
Batch GD	Full dataset	None (exact gradient)	Smooth, slow	High	Small datasets
Mini-batch GD	32-256	Moderate	Noisy but fast	Moderate	Standard choice
SGD	1	High	Very noisy, fast	Low	Online learning
SGD + Momentum	1-256	Moderate	Accelerated	Moderate	Deep learning default

Implementation 8 - Neural Network (Forward and Backward Pass)

class NeuralNetwork:
    """
    Simple feedforward neural network with backpropagation.
    Architecture: input -> hidden (ReLU) -> output (softmax)
    """

    def __init__(self, input_size, hidden_size, output_size, learning_rate=0.01):
        self.lr = learning_rate

        # Xavier initialization
        self.W1 = np.random.randn(input_size, hidden_size) * np.sqrt(2.0 / input_size)
        self.b1 = np.zeros(hidden_size)
        self.W2 = np.random.randn(hidden_size, output_size) * np.sqrt(2.0 / hidden_size)
        self.b2 = np.zeros(output_size)

        # Cache for backward pass
        self.cache = {}

    def _relu(self, z):
        return np.maximum(z, 0)

    def _relu_grad(self, z):
        return (z > 0).astype(float)

    def _softmax(self, z):
        shifted = z - z.max(axis=1, keepdims=True)
        exp_z = np.exp(shifted)
        return exp_z / exp_z.sum(axis=1, keepdims=True)

    def forward(self, X):
        """
        Forward pass.

        Args:
            X: np.ndarray of shape (batch_size, input_size)
        Returns:
            probabilities: np.ndarray of shape (batch_size, output_size)
        """
        # Layer 1: linear + ReLU
        z1 = X @ self.W1 + self.b1              # (batch, hidden)
        a1 = self._relu(z1)                      # (batch, hidden)

        # Layer 2: linear + softmax
        z2 = a1 @ self.W2 + self.b2              # (batch, output)
        a2 = self._softmax(z2)                    # (batch, output)

        # Cache for backward pass
        self.cache = {'X': X, 'z1': z1, 'a1': a1, 'z2': z2, 'a2': a2}

        return a2

    def backward(self, y_true):
        """
        Backward pass: compute gradients using chain rule.

        Args:
            y_true: np.ndarray of shape (batch_size,) - integer labels
        """
        batch_size = y_true.shape[0]
        X = self.cache['X']
        z1 = self.cache['z1']
        a1 = self.cache['a1']
        a2 = self.cache['a2']  # softmax output

        # Convert labels to one-hot
        one_hot = np.zeros_like(a2)
        one_hot[np.arange(batch_size), y_true] = 1

        # Output layer gradient
        # For softmax + cross-entropy, the gradient simplifies to:
        # dL/dz2 = a2 - one_hot (probability - target)
        dz2 = (a2 - one_hot) / batch_size  # (batch, output)

        # Gradients for W2, b2
        dW2 = a1.T @ dz2                   # (hidden, output)
        db2 = np.sum(dz2, axis=0)          # (output,)

        # Hidden layer gradient (backpropagate through ReLU)
        da1 = dz2 @ self.W2.T              # (batch, hidden)
        dz1 = da1 * self._relu_grad(z1)    # (batch, hidden) - element-wise

        # Gradients for W1, b1
        dW1 = X.T @ dz1                    # (input, hidden)
        db1 = np.sum(dz1, axis=0)          # (hidden,)

        # Update weights
        self.W2 -= self.lr * dW2
        self.b2 -= self.lr * db2
        self.W1 -= self.lr * dW1
        self.b1 -= self.lr * db1

    def compute_loss(self, y_pred, y_true):
        """Cross-entropy loss."""
        batch_size = y_true.shape[0]
        eps = 1e-12
        log_probs = -np.log(y_pred[np.arange(batch_size), y_true] + eps)
        return np.mean(log_probs)

    def train(self, X, y, epochs=100, verbose=True):
        """Training loop."""
        for epoch in range(epochs):
            # Forward pass
            y_pred = self.forward(X)
            loss = self.compute_loss(y_pred, y)

            # Backward pass
            self.backward(y)

            if verbose and epoch % 10 == 0:
                accuracy = np.mean(np.argmax(y_pred, axis=1) == y)
                print(f"Epoch {epoch}: loss={loss:.4f}, accuracy={accuracy:.4f}")

Neural Network Forward and Backward Pass

Instant Rejection

Two fatal errors in neural network implementations: (1) Forgetting to divide dz2 by batch_size (your gradients scale with batch size, causing explosive updates). (2) Computing dz1 = da1 * self._relu(z1) instead of da1 * self._relu_grad(z1) - applying the function instead of its derivative. The ReLU derivative is 1 for z > 0 and 0 for z <= 0, not the ReLU function itself.

Implementation 9 - Train-Test Split

def train_test_split(X, y, test_size=0.2, random_state=None, stratify=None):
    """
    Split data into training and test sets.

    Args:
        X: features (n_samples, n_features)
        y: labels (n_samples,)
        test_size: fraction of data for testing
        random_state: for reproducibility
        stratify: if provided, maintain class proportions
    Returns:
        X_train, X_test, y_train, y_test
    """
    rng = np.random.default_rng(random_state)
    n_samples = X.shape[0]

    if stratify is not None:
        # Stratified split: maintain class proportions
        train_indices = []
        test_indices = []

        for cls in np.unique(stratify):
            cls_indices = np.where(stratify == cls)[0]
            rng.shuffle(cls_indices)
            n_test = int(len(cls_indices) * test_size)
            test_indices.extend(cls_indices[:n_test])
            train_indices.extend(cls_indices[n_test:])

        train_indices = np.array(train_indices)
        test_indices = np.array(test_indices)
        rng.shuffle(train_indices)
        rng.shuffle(test_indices)
    else:
        # Random split
        indices = np.arange(n_samples)
        rng.shuffle(indices)
        n_test = int(n_samples * test_size)
        test_indices = indices[:n_test]
        train_indices = indices[n_test:]

    return X[train_indices], X[test_indices], y[train_indices], y[test_indices]

Implementation 10 - K-Fold Cross-Validation

def k_fold_cross_validation(X, y, model_class, model_params, k=5,
                            metric_fn=None, random_state=42):
    """
    K-fold cross-validation.

    Args:
        X: features (n_samples, n_features)
        y: labels (n_samples,)
        model_class: class with fit() and predict() methods
        model_params: dict of parameters for model_class
        k: number of folds
        metric_fn: function(y_true, y_pred) -> score (default: accuracy)
        random_state: for reproducibility
    Returns:
        scores: list of k scores
        mean_score: average score
        std_score: standard deviation of scores
    """
    if metric_fn is None:
        metric_fn = lambda y_true, y_pred: np.mean(y_true == y_pred)

    rng = np.random.default_rng(random_state)
    n_samples = X.shape[0]

    # Shuffle indices
    indices = np.arange(n_samples)
    rng.shuffle(indices)

    # Split into k folds
    fold_size = n_samples // k
    scores = []

    for i in range(k):
        # Define test fold
        test_start = i * fold_size
        test_end = test_start + fold_size if i < k - 1 else n_samples
        test_idx = indices[test_start:test_end]

        # Training fold: everything except the test fold
        train_idx = np.concatenate([indices[:test_start], indices[test_end:]])

        X_train, X_test = X[train_idx], X[test_idx]
        y_train, y_test = y[train_idx], y[test_idx]

        # Train model
        model = model_class(**model_params)
        model.fit(X_train, y_train)

        # Evaluate
        y_pred = model.predict(X_test)
        score = metric_fn(y_test, y_pred)
        scores.append(score)

    return {
        'scores': scores,
        'mean': np.mean(scores),
        'std': np.std(scores)
    }

# Usage:
# results = k_fold_cross_validation(
#     X, y,
#     model_class=KNNClassifier,
#     model_params={'k': 5},
#     k=5
# )
# print(f"Accuracy: {results['mean']:.3f} +/- {results['std']:.3f}")

Practice Problems

Problem 1: Implement Softmax Regression (Multi-Class Logistic Regression)

Extend binary logistic regression to handle multiple classes using softmax and cross-entropy loss.

Hint 1 - Direction

Replace sigmoid with softmax. Weights become a matrix (n_features, n_classes). The gradient is X.T @ (softmax_output - one_hot_labels).

Hint 2 - Shapes

X: (n, d), W: (d, c), bias: (c,). Z = X @ W + b has shape (n, c). Softmax of Z gives probabilities (n, c).

Hint 3 - Full Solution

class SoftmaxRegression:
    def __init__(self, learning_rate=0.01, n_iterations=1000):
        self.lr = learning_rate
        self.n_iter = n_iterations
        self.W = None
        self.b = None

    def _softmax(self, z):
        shifted = z - z.max(axis=1, keepdims=True)
        exp_z = np.exp(shifted)
        return exp_z / exp_z.sum(axis=1, keepdims=True)

    def fit(self, X, y):
        n_samples, n_features = X.shape
        n_classes = len(np.unique(y))

        self.W = np.random.randn(n_features, n_classes) * 0.01
        self.b = np.zeros(n_classes)

        # One-hot encode labels
        Y_one_hot = np.zeros((n_samples, n_classes))
        Y_one_hot[np.arange(n_samples), y] = 1

        for _ in range(self.n_iter):
            z = X @ self.W + self.b
            probs = self._softmax(z)

            dW = (1 / n_samples) * X.T @ (probs - Y_one_hot)
            db = (1 / n_samples) * np.sum(probs - Y_one_hot, axis=0)

            self.W -= self.lr * dW
            self.b -= self.lr * db

    def predict(self, X):
        z = X @ self.W + self.b
        probs = self._softmax(z)
        return np.argmax(probs, axis=1)

Problem 2: Add L2 Regularization to the Neural Network

Modify the neural network backward pass to include L2 weight decay.

Hint 1 - Direction

L2 regularization adds (lambda/2) * sum(w^2) to the loss. The gradient contribution is lambda * w. Add this to the weight gradients (not bias gradients).

Hint 2 - Where to modify

After computing dW1 and dW2, add reg_lambda * W1 and reg_lambda * W2 respectively. Do not regularize biases.

Hint 3 - Full Solution

# In the backward method, after computing dW1 and dW2:

def backward_with_l2(self, y_true, reg_lambda=0.01):
    # ... (same gradient computation as before) ...

    # Add L2 regularization gradient (weight decay)
    dW2 += reg_lambda * self.W2
    dW1 += reg_lambda * self.W1
    # Note: do NOT add regularization to db1, db2

    # Update (same as before)
    self.W2 -= self.lr * dW2
    self.b2 -= self.lr * db2
    self.W1 -= self.lr * dW1
    self.b1 -= self.lr * db1

# The loss should also include the regularization term:
def compute_loss_with_l2(self, y_pred, y_true, reg_lambda=0.01):
    ce_loss = self.compute_loss(y_pred, y_true)
    l2_loss = (reg_lambda / 2) * (
        np.sum(self.W1 ** 2) + np.sum(self.W2 ** 2)
    )
    return ce_loss + l2_loss

Key insight: L2 regularization is equivalent to adding a Gaussian prior on the weights (MAP estimation). The effect is weight shrinkage - large weights are penalized. This reduces overfitting by preventing the network from relying too heavily on any single feature or connection.

Problem 3: Implement Stratified K-Fold

Modify the k-fold implementation to ensure each fold has approximately the same class distribution.

Hint 1 - Direction

For each class, distribute its samples across the k folds as evenly as possible.

Hint 2 - Algorithm

Group indices by class. For each class, shuffle and split into k groups. Then merge fold i from each class into fold i of the final split.

Hint 3 - Full Solution

def stratified_k_fold(X, y, k=5, random_state=42):
    """
    Generate stratified k-fold splits.

    Yields:
        (train_indices, test_indices) for each fold
    """
    rng = np.random.default_rng(random_state)
    classes = np.unique(y)

    # For each class, get shuffled indices
    class_indices = {}
    for cls in classes:
        idx = np.where(y == cls)[0]
        rng.shuffle(idx)
        class_indices[cls] = idx

    # Distribute each class's samples across folds
    folds = [[] for _ in range(k)]
    for cls in classes:
        idx = class_indices[cls]
        fold_sizes = [len(idx) // k] * k
        # Distribute remainder
        for i in range(len(idx) % k):
            fold_sizes[i] += 1

        current = 0
        for fold_idx in range(k):
            folds[fold_idx].extend(idx[current:current + fold_sizes[fold_idx]])
            current += fold_sizes[fold_idx]

    # Yield train/test splits
    for i in range(k):
        test_idx = np.array(folds[i])
        train_idx = np.concatenate([np.array(folds[j]) for j in range(k) if j != i])
        rng.shuffle(train_idx)
        rng.shuffle(test_idx)
        yield train_idx, test_idx

Interview Cheat Sheet

Algorithm	Key Formula	Common Mistake	Time Complexity
Linear Regression (OLS)	w = (X^T X)^(-1) X^T y	Inverting X^T X instead of using lstsq	O(nd^2 + d^3)
Linear Regression (GD)	w -= lr * (1/n) * X^T @ (Xw - y)	Wrong sign on gradient	O(nd) per step
Logistic Regression	w -= lr * (1/n) * X^T @ (sigmoid(Xw) - y)	Forgetting sigmoid in gradient	O(nd) per step
Decision Tree	Split on max info gain	Not handling empty leaf nodes	O(n * d * n_thresholds * depth)
K-NN	argpartition for top-k	Full sort instead of partial sort	O(nd) distance + O(n) partition
K-Means	Assign, update centroids, repeat	Random init instead of k-means++	O(nkd) per iteration
PCA	Eigenvectors of covariance matrix	Forgetting to center data first	O(d^3) for eigendecomp
Neural Net (forward)	z = Wx + b, a = activation(z)	Wrong shapes in matrix multiply	O(batch * layer_dims)
Neural Net (backward)	dW = a_prev^T @ dz	Applying activation instead of its derivative	Same as forward
Cross-validation	k folds, train on k-1, test on 1	Not stratifying for imbalanced classes	k * training time

Spaced Repetition Checkpoints

Day 0 - Initial Learning

Read all implementations and verify the shapes at each step
Implement linear regression (both forms) from memory
Implement logistic regression from memory
Complete the self-assessment

Day 3 - First Recall

Implement K-NN with vectorized distance computation from memory
Implement K-means with k-means++ initialization from memory
Derive the logistic regression gradient on paper

Day 7 - Connections

Implement the neural network forward and backward pass from memory
Implement PCA and explain its connection to SVD
Implement cross-validation and explain why stratification matters

Day 14 - Application

Implement softmax regression from memory under timed conditions (15 min)
Add L2 regularization to logistic regression and explain the effect on the gradient
Implement a complete training pipeline: split, train, evaluate, cross-validate

Day 21 - Mock Interview

Have someone name an algorithm - implement it from scratch in 20 minutes
Explain the gradient derivation for logistic regression on a whiteboard
Implement the neural network backward pass and explain each shape

Key Takeaways

Implementing from scratch proves understanding. Anyone can call sklearn.fit(). Writing the gradient computation, managing the shapes, and handling numerical stability proves you understand what the library does internally - which is what lets you debug production models.
Shapes are everything. Every "implement from scratch" bug is a shape error. X is (n, d). Weights are (d,) for regression, (d, c) for multi-class. Gradients have the same shape as the parameters they update. If you track shapes at every step, you will not make errors.
The gradient pattern is universal. For all linear-model losses, the gradient has the form (1/n) * X.T @ (prediction - target). The only thing that changes is how "prediction" is computed (raw linear for regression, sigmoid for logistic, softmax for multi-class).
Numerical stability is not optional. Clip sigmoid inputs, subtract max before softmax, add epsilon before log. These are not academic concerns - they determine whether your implementation works on real data or produces NaN.

Next Steps

Continue to Complexity Analysis to master time and space complexity analysis for ML algorithms - the framework for reasoning about computational efficiency in interviews.

The Real Interview Moment​

What You Will Master​

Self-Assessment: Where Are You Now?​

Implementation 1 - Linear Regression​

Closed-Form Solution (Normal Equation)​

Gradient Descent Solution​

Implementation 2 - Logistic Regression​

Implementation 3 - Decision Tree​

Implementation 4 - K-Nearest Neighbors​

Implementation 5 - K-Means Clustering​

Implementation 6 - PCA​

Implementation 7 - Gradient Descent Variants​

Gradient Descent Comparison​

Implementation 8 - Neural Network (Forward and Backward Pass)​

Implementation 9 - Train-Test Split​

Implementation 10 - K-Fold Cross-Validation​

Practice Problems​

Problem 1: Implement Softmax Regression (Multi-Class Logistic Regression)​

Problem 2: Add L2 Regularization to the Neural Network​

Problem 3: Implement Stratified K-Fold​

Interview Cheat Sheet​

Spaced Repetition Checkpoints​

Day 0 - Initial Learning​

Day 3 - First Recall​

Day 7 - Connections​

Day 14 - Application​

Day 21 - Mock Interview​

Key Takeaways​

Next Steps​

The Real Interview Moment

What You Will Master

Self-Assessment: Where Are You Now?

Implementation 1 - Linear Regression

Closed-Form Solution (Normal Equation)

Gradient Descent Solution

Implementation 2 - Logistic Regression

Implementation 3 - Decision Tree

Implementation 4 - K-Nearest Neighbors

Implementation 5 - K-Means Clustering

Implementation 6 - PCA

Implementation 7 - Gradient Descent Variants

Gradient Descent Comparison

Implementation 8 - Neural Network (Forward and Backward Pass)

Implementation 9 - Train-Test Split

Implementation 10 - K-Fold Cross-Validation

Practice Problems

Problem 1: Implement Softmax Regression (Multi-Class Logistic Regression)

Problem 2: Add L2 Regularization to the Neural Network

Problem 3: Implement Stratified K-Fold

Interview Cheat Sheet

Spaced Repetition Checkpoints

Day 0 - Initial Learning

Day 3 - First Recall

Day 7 - Connections

Day 14 - Application

Day 21 - Mock Interview

Key Takeaways

Next Steps