Python ML Coding - Implement Algorithms From Scratch
Reading time: ~50 min | Interview relevance: Critical | Roles: MLE, AI Eng, Research Engineer, Applied Scientist
The Real Interview Moment
You are in an Anthropic Research Engineer interview. The interviewer says: "Implement logistic regression from scratch. No sklearn. Just NumPy. Include gradient descent, prediction, and show me the training loop." You have seen logistic regression a hundred times in courses, but now you are writing it under pressure. Do you remember the gradient formula? Is it X.T @ (predictions - labels) or (predictions - labels).T @ X? Do you divide by batch size? Do you apply sigmoid before or after the dot product?
This is the ultimate test of understanding. Anyone can call sklearn.LogisticRegression().fit(X, y). But implementing the algorithm from scratch - deriving the gradient, managing the shapes, handling numerical stability - proves you understand what the library is actually doing. When your model misbehaves in production, this is the knowledge that lets you debug it.
Every ML-focused interview at research labs (Anthropic, OpenAI, DeepMind, FAIR) and top companies (Google Brain, Meta AI) includes at least one "implement from scratch" question. Candidates who can write clean, correct implementations with proper vectorization get "strong hire." Candidates who get lost in the shapes or cannot remember the gradient get "no hire."
What You Will Master
- Implement linear regression with closed-form solution and gradient descent
- Implement logistic regression with binary cross-entropy and sigmoid
- Implement a decision tree classifier with information gain
- Implement K-nearest neighbors with vectorized distance
- Implement K-means clustering with convergence detection
- Implement PCA via eigendecomposition and SVD
- Implement gradient descent variants: batch, mini-batch, SGD
- Implement a neural network forward and backward pass
- Implement train-test split and k-fold cross-validation
- Explain every step - shapes, gradients, and design decisions
Self-Assessment: Where Are You Now?
| Skill | 1 - Cannot | 2 - Vaguely | 3 - Can Code | 4 - Can Derive + Code | 5 - Can Teach | Your Score |
|---|---|---|---|---|---|---|
| Linear regression (closed-form + GD) | ___ | |||||
| Logistic regression from scratch | ___ | |||||
| Decision tree with info gain | ___ | |||||
| K-NN prediction | ___ | |||||
| K-means clustering | ___ | |||||
| PCA / dimensionality reduction | ___ | |||||
| Neural network forward/backward | ___ | |||||
| Cross-validation implementation | ___ |
Target: All 4s and 5s before your interview.
Implementation 1 - Linear Regression
Closed-Form Solution (Normal Equation)
import numpy as np
class LinearRegressionClosedForm:
"""
Linear regression using the normal equation.
w* = (X^T X)^(-1) X^T y
"""
def __init__(self):
self.weights = None # Includes bias as last element
def fit(self, X, y):
"""
Fit the model using the normal equation.
Args:
X: np.ndarray of shape (n_samples, n_features)
y: np.ndarray of shape (n_samples,)
"""
# Add bias column (column of ones)
X_b = np.column_stack([X, np.ones(X.shape[0])])
# Normal equation: w = (X^T X)^(-1) X^T y
# Use lstsq for numerical stability (avoids inverting X^T X)
self.weights, _, _, _ = np.linalg.lstsq(X_b, y, rcond=None)
def predict(self, X):
"""Predict target values."""
X_b = np.column_stack([X, np.ones(X.shape[0])])
return X_b @ self.weights
"Linear regression minimizes squared error. The closed-form solution is w = (X^T X)^(-1) X^T y, which I implement using np.linalg.lstsq for numerical stability. For large datasets, I use gradient descent instead - the gradient of MSE with respect to w is (2/n) * X^T @ (X @ w - y). I add a bias term by appending a column of ones to X. The closed-form is O(n*d^2 + d^3), so it becomes impractical when d exceeds ~10,000 features."
Gradient Descent Solution
class LinearRegressionGD:
"""
Linear regression using gradient descent.
Loss: L = (1/2n) * ||Xw - y||^2
Gradient: dL/dw = (1/n) * X^T @ (Xw - y)
"""
def __init__(self, learning_rate=0.01, n_iterations=1000):
self.lr = learning_rate
self.n_iter = n_iterations
self.weights = None
self.bias = None
self.loss_history = []
def fit(self, X, y):
"""
Fit the model using gradient descent.
Args:
X: np.ndarray of shape (n_samples, n_features)
y: np.ndarray of shape (n_samples,)
"""
n_samples, n_features = X.shape
# Initialize weights to small random values
self.weights = np.random.randn(n_features) * 0.01
self.bias = 0.0
for i in range(self.n_iter):
# Forward pass: compute predictions
y_pred = X @ self.weights + self.bias # Shape: (n_samples,)
# Compute loss (MSE)
error = y_pred - y # Shape: (n_samples,)
loss = np.mean(error ** 2) / 2
self.loss_history.append(loss)
# Compute gradients
dw = (1 / n_samples) * (X.T @ error) # Shape: (n_features,)
db = (1 / n_samples) * np.sum(error) # Scalar
# Update weights
self.weights -= self.lr * dw
self.bias -= self.lr * db
def predict(self, X):
return X @ self.weights + self.bias
The gradient of MSE loss is (1/n) * X.T @ (predictions - targets), not (1/n) * X.T @ (targets - predictions). Getting the sign wrong means you do gradient ascent instead of descent - your loss will increase. This is the single most common mistake in "implement from scratch" interviews. Double-check: the gradient points in the direction of increasing loss, so you subtract it.
Implementation 2 - Logistic Regression
class LogisticRegression:
"""
Binary logistic regression using gradient descent.
Model: p(y=1|x) = sigmoid(x @ w + b)
Loss: Binary cross-entropy = -(1/n) * sum(y*log(p) + (1-y)*log(1-p))
Gradient: dL/dw = (1/n) * X^T @ (sigmoid(Xw+b) - y)
"""
def __init__(self, learning_rate=0.01, n_iterations=1000,
reg_lambda=0.0):
self.lr = learning_rate
self.n_iter = n_iterations
self.reg_lambda = reg_lambda # L2 regularization strength
self.weights = None
self.bias = None
self.loss_history = []
@staticmethod
def _sigmoid(z):
"""Numerically stable sigmoid."""
# Clip to prevent overflow in exp
z = np.clip(z, -500, 500)
return 1.0 / (1.0 + np.exp(-z))
def fit(self, X, y):
"""
Fit the model.
Args:
X: np.ndarray of shape (n_samples, n_features)
y: np.ndarray of shape (n_samples,) with values 0 or 1
"""
n_samples, n_features = X.shape
self.weights = np.zeros(n_features)
self.bias = 0.0
for i in range(self.n_iter):
# Forward pass
z = X @ self.weights + self.bias # Shape: (n_samples,)
y_pred = self._sigmoid(z) # Shape: (n_samples,)
# Compute binary cross-entropy loss
eps = 1e-12 # Prevent log(0)
loss = -(1 / n_samples) * np.sum(
y * np.log(y_pred + eps) +
(1 - y) * np.log(1 - y_pred + eps)
)
# Add L2 regularization term
loss += (self.reg_lambda / (2 * n_samples)) * np.sum(self.weights ** 2)
self.loss_history.append(loss)
# Compute gradients
# Key insight: gradient has the same form as linear regression
# but with sigmoid(Xw) instead of Xw
error = y_pred - y # Shape: (n_samples,)
dw = (1 / n_samples) * (X.T @ error) # Shape: (n_features,)
dw += (self.reg_lambda / n_samples) * self.weights # L2 gradient
db = (1 / n_samples) * np.sum(error)
# Update
self.weights -= self.lr * dw
self.bias -= self.lr * db
def predict_proba(self, X):
"""Return probability of class 1."""
return self._sigmoid(X @ self.weights + self.bias)
def predict(self, X, threshold=0.5):
"""Return binary predictions."""
return (self.predict_proba(X) >= threshold).astype(int)
Never confuse the logistic regression gradient with the linear regression gradient. The logistic regression gradient is X.T @ (sigmoid(Xw) - y), NOT X.T @ (Xw - y). Both have the form X.T @ (prediction - target), but the "prediction" is different - sigmoid output for logistic, raw linear output for linear. If you write the wrong one, the interviewer knows you are memorizing rather than deriving.
Implementation 3 - Decision Tree
class DecisionTreeClassifier:
"""
Decision tree using information gain (entropy-based).
"""
class Node:
def __init__(self, feature=None, threshold=None, left=None,
right=None, value=None):
self.feature = feature # Feature index to split on
self.threshold = threshold # Threshold value
self.left = left # Left child (<=threshold)
self.right = right # Right child (>threshold)
self.value = value # Class prediction (leaf node)
def __init__(self, max_depth=10, min_samples_split=2):
self.max_depth = max_depth
self.min_samples_split = min_samples_split
self.root = None
@staticmethod
def _entropy(y):
"""Compute entropy of a label distribution."""
if len(y) == 0:
return 0
proportions = np.bincount(y) / len(y)
proportions = proportions[proportions > 0] # Remove zeros
return -np.sum(proportions * np.log2(proportions))
def _information_gain(self, y, left_mask):
"""Compute information gain from a split."""
parent_entropy = self._entropy(y)
left_y = y[left_mask]
right_y = y[~left_mask]
if len(left_y) == 0 or len(right_y) == 0:
return 0
n = len(y)
child_entropy = (
(len(left_y) / n) * self._entropy(left_y) +
(len(right_y) / n) * self._entropy(right_y)
)
return parent_entropy - child_entropy
def _best_split(self, X, y):
"""Find the best feature and threshold to split on."""
best_gain = -1
best_feature = None
best_threshold = None
n_features = X.shape[1]
for feature in range(n_features):
# Get unique sorted thresholds (midpoints between unique values)
values = np.unique(X[:, feature])
thresholds = (values[:-1] + values[1:]) / 2
for threshold in thresholds:
left_mask = X[:, feature] <= threshold
gain = self._information_gain(y, left_mask)
if gain > best_gain:
best_gain = gain
best_feature = feature
best_threshold = threshold
return best_feature, best_threshold, best_gain
def _build_tree(self, X, y, depth=0):
"""Recursively build the decision tree."""
n_samples = len(y)
n_classes = len(np.unique(y))
# Stopping conditions
if (depth >= self.max_depth or
n_classes == 1 or
n_samples < self.min_samples_split):
# Leaf node: predict the majority class
leaf_value = np.argmax(np.bincount(y))
return self.Node(value=leaf_value)
# Find the best split
feature, threshold, gain = self._best_split(X, y)
if gain <= 0:
leaf_value = np.argmax(np.bincount(y))
return self.Node(value=leaf_value)
# Split the data
left_mask = X[:, feature] <= threshold
left_child = self._build_tree(X[left_mask], y[left_mask], depth + 1)
right_child = self._build_tree(X[~left_mask], y[~left_mask], depth + 1)
return self.Node(feature=feature, threshold=threshold,
left=left_child, right=right_child)
def fit(self, X, y):
"""Build the tree."""
self.root = self._build_tree(X, y.astype(int))
def _predict_single(self, x, node):
"""Predict for a single sample."""
if node.value is not None:
return node.value
if x[node.feature] <= node.threshold:
return self._predict_single(x, node.left)
return self._predict_single(x, node.right)
def predict(self, X):
"""Predict for multiple samples."""
return np.array([self._predict_single(x, self.root) for x in X])
Google and Meta rarely ask you to implement a full decision tree in an interview - it is too long. They may ask you to implement just the _information_gain or _best_split function. Research labs (Anthropic, DeepMind) may ask for the full tree or ask you to extend it (e.g., add Gini impurity, handle continuous vs categorical features, implement pruning). The key is knowing the algorithm well enough to implement any part on demand.
Implementation 4 - K-Nearest Neighbors
class KNNClassifier:
"""
K-nearest neighbors classifier using vectorized distance computation.
"""
def __init__(self, k=5):
self.k = k
self.X_train = None
self.y_train = None
def fit(self, X, y):
"""Store training data (lazy learner)."""
self.X_train = X
self.y_train = y.astype(int)
def predict(self, X):
"""
Predict labels for query points.
Args:
X: np.ndarray of shape (n_query, n_features)
Returns:
predictions: np.ndarray of shape (n_query,)
"""
# Compute pairwise distances using the expansion trick
# ||a-b||^2 = ||a||^2 + ||b||^2 - 2*a.b
X_sq = np.sum(X ** 2, axis=1)[:, None] # (n_query, 1)
Xt_sq = np.sum(self.X_train ** 2, axis=1)[None, :] # (1, n_train)
cross = X @ self.X_train.T # (n_query, n_train)
distances = np.sqrt(np.maximum(X_sq + Xt_sq - 2 * cross, 0))
# Find k nearest neighbors (O(n) partial sort per query)
knn_indices = np.argpartition(distances, self.k, axis=1)[:, :self.k]
# Get labels of nearest neighbors
knn_labels = self.y_train[knn_indices] # (n_query, k)
# Majority vote
n_classes = self.y_train.max() + 1
predictions = np.zeros(X.shape[0], dtype=int)
for i in range(X.shape[0]):
counts = np.bincount(knn_labels[i], minlength=n_classes)
predictions[i] = np.argmax(counts)
return predictions
Implementation 5 - K-Means Clustering
class KMeans:
"""
K-means clustering with k-means++ initialization.
"""
def __init__(self, n_clusters=3, max_iter=100, tol=1e-4, random_state=42):
self.n_clusters = n_clusters
self.max_iter = max_iter
self.tol = tol
self.rng = np.random.default_rng(random_state)
self.centroids = None
self.labels = None
self.inertia = None # Sum of squared distances to nearest centroid
def _init_centroids_plusplus(self, X):
"""K-means++ initialization: spread out initial centroids."""
n_samples = X.shape[0]
centroids = np.zeros((self.n_clusters, X.shape[1]))
# First centroid: random sample
centroids[0] = X[self.rng.integers(n_samples)]
for k in range(1, self.n_clusters):
# Compute distance from each point to nearest existing centroid
distances = np.min(
np.sum((X[:, None, :] - centroids[None, :k, :]) ** 2, axis=2),
axis=1
)
# Sample next centroid with probability proportional to distance^2
probs = distances / distances.sum()
centroids[k] = X[self.rng.choice(n_samples, p=probs)]
return centroids
def fit(self, X):
"""
Fit K-means to data.
Args:
X: np.ndarray of shape (n_samples, n_features)
"""
# Initialize centroids using k-means++
self.centroids = self._init_centroids_plusplus(X)
for iteration in range(self.max_iter):
# Step 1: Assign each point to nearest centroid
# Distances: (n_samples, n_clusters)
distances = np.sqrt(np.sum(
(X[:, None, :] - self.centroids[None, :, :]) ** 2,
axis=2
))
self.labels = np.argmin(distances, axis=1)
# Step 2: Update centroids to mean of assigned points
new_centroids = np.zeros_like(self.centroids)
for k in range(self.n_clusters):
cluster_points = X[self.labels == k]
if len(cluster_points) > 0:
new_centroids[k] = cluster_points.mean(axis=0)
else:
# Empty cluster: reinitialize to random point
new_centroids[k] = X[self.rng.integers(X.shape[0])]
# Step 3: Check convergence
centroid_shift = np.sqrt(np.sum((new_centroids - self.centroids) ** 2))
self.centroids = new_centroids
if centroid_shift < self.tol:
break
# Compute inertia (sum of squared distances to nearest centroid)
self.inertia = sum(
np.sum((X[self.labels == k] - self.centroids[k]) ** 2)
for k in range(self.n_clusters)
)
return self
def predict(self, X):
"""Assign new points to nearest centroid."""
distances = np.sqrt(np.sum(
(X[:, None, :] - self.centroids[None, :, :]) ** 2,
axis=2
))
return np.argmin(distances, axis=1)
Implementation 6 - PCA
class PCA:
"""
Principal Component Analysis via eigendecomposition.
"""
def __init__(self, n_components):
self.n_components = n_components
self.components = None # Top-k eigenvectors (k, d)
self.mean = None # Feature means (d,)
self.explained_variance = None
self.explained_variance_ratio = None
def fit(self, X):
"""
Fit PCA: compute principal components.
Args:
X: np.ndarray of shape (n_samples, n_features)
"""
n_samples = X.shape[0]
# Step 1: Center the data
self.mean = X.mean(axis=0)
X_centered = X - self.mean
# Step 2: Compute covariance matrix
# cov = (1/(n-1)) * X_centered^T @ X_centered
cov_matrix = (X_centered.T @ X_centered) / (n_samples - 1)
# Step 3: Eigendecomposition (eigh for symmetric matrices)
eigenvalues, eigenvectors = np.linalg.eigh(cov_matrix)
# Step 4: Sort by eigenvalue (descending)
# eigh returns in ascending order, so reverse
idx = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]
# Step 5: Select top-k components
self.components = eigenvectors[:, :self.n_components].T # Shape: (k, d)
self.explained_variance = eigenvalues[:self.n_components]
self.explained_variance_ratio = (
self.explained_variance / eigenvalues.sum()
)
return self
def transform(self, X):
"""Project data onto principal components."""
X_centered = X - self.mean
return X_centered @ self.components.T # (n, d) @ (d, k) = (n, k)
def fit_transform(self, X):
"""Fit and transform in one step."""
self.fit(X)
return self.transform(X)
def inverse_transform(self, X_reduced):
"""Reconstruct data from reduced representation."""
return X_reduced @ self.components + self.mean # (n, k) @ (k, d) + (d,)
Implementation 7 - Gradient Descent Variants
class GradientDescentOptimizer:
"""
Gradient descent variants: batch, mini-batch, SGD with momentum.
"""
def __init__(self, learning_rate=0.01, momentum=0.0, batch_size=None):
self.lr = learning_rate
self.momentum = momentum
self.batch_size = batch_size # None = full batch
self.velocity_w = None
self.velocity_b = None
def step(self, X, y, weights, bias, loss_fn, grad_fn):
"""
Perform one step of gradient descent.
Args:
X, y: data
weights: current weights
bias: current bias
loss_fn: function(X, y, w, b) -> scalar loss
grad_fn: function(X, y, w, b) -> (dw, db)
Returns:
updated weights, bias, loss
"""
n_samples = X.shape[0]
if self.batch_size is None or self.batch_size >= n_samples:
# Full batch gradient descent
batch_X, batch_y = X, y
else:
# Mini-batch: randomly sample a batch
indices = np.random.choice(n_samples, self.batch_size, replace=False)
batch_X, batch_y = X[indices], y[indices]
# Compute gradients on the batch
dw, db = grad_fn(batch_X, batch_y, weights, bias)
# Initialize velocity if needed
if self.velocity_w is None:
self.velocity_w = np.zeros_like(weights)
self.velocity_b = 0.0
# Update with momentum
self.velocity_w = self.momentum * self.velocity_w - self.lr * dw
self.velocity_b = self.momentum * self.velocity_b - self.lr * db
weights = weights + self.velocity_w
bias = bias + self.velocity_b
# Compute loss on full data for monitoring
loss = loss_fn(X, y, weights, bias)
return weights, bias, loss
Gradient Descent Comparison
| Variant | Batch Size | Gradient Noise | Convergence | Memory | When to Use |
|---|---|---|---|---|---|
| Batch GD | Full dataset | None (exact gradient) | Smooth, slow | High | Small datasets |
| Mini-batch GD | 32-256 | Moderate | Noisy but fast | Moderate | Standard choice |
| SGD | 1 | High | Very noisy, fast | Low | Online learning |
| SGD + Momentum | 1-256 | Moderate | Accelerated | Moderate | Deep learning default |
Implementation 8 - Neural Network (Forward and Backward Pass)
class NeuralNetwork:
"""
Simple feedforward neural network with backpropagation.
Architecture: input -> hidden (ReLU) -> output (softmax)
"""
def __init__(self, input_size, hidden_size, output_size, learning_rate=0.01):
self.lr = learning_rate
# Xavier initialization
self.W1 = np.random.randn(input_size, hidden_size) * np.sqrt(2.0 / input_size)
self.b1 = np.zeros(hidden_size)
self.W2 = np.random.randn(hidden_size, output_size) * np.sqrt(2.0 / hidden_size)
self.b2 = np.zeros(output_size)
# Cache for backward pass
self.cache = {}
def _relu(self, z):
return np.maximum(z, 0)
def _relu_grad(self, z):
return (z > 0).astype(float)
def _softmax(self, z):
shifted = z - z.max(axis=1, keepdims=True)
exp_z = np.exp(shifted)
return exp_z / exp_z.sum(axis=1, keepdims=True)
def forward(self, X):
"""
Forward pass.
Args:
X: np.ndarray of shape (batch_size, input_size)
Returns:
probabilities: np.ndarray of shape (batch_size, output_size)
"""
# Layer 1: linear + ReLU
z1 = X @ self.W1 + self.b1 # (batch, hidden)
a1 = self._relu(z1) # (batch, hidden)
# Layer 2: linear + softmax
z2 = a1 @ self.W2 + self.b2 # (batch, output)
a2 = self._softmax(z2) # (batch, output)
# Cache for backward pass
self.cache = {'X': X, 'z1': z1, 'a1': a1, 'z2': z2, 'a2': a2}
return a2
def backward(self, y_true):
"""
Backward pass: compute gradients using chain rule.
Args:
y_true: np.ndarray of shape (batch_size,) - integer labels
"""
batch_size = y_true.shape[0]
X = self.cache['X']
z1 = self.cache['z1']
a1 = self.cache['a1']
a2 = self.cache['a2'] # softmax output
# Convert labels to one-hot
one_hot = np.zeros_like(a2)
one_hot[np.arange(batch_size), y_true] = 1
# Output layer gradient
# For softmax + cross-entropy, the gradient simplifies to:
# dL/dz2 = a2 - one_hot (probability - target)
dz2 = (a2 - one_hot) / batch_size # (batch, output)
# Gradients for W2, b2
dW2 = a1.T @ dz2 # (hidden, output)
db2 = np.sum(dz2, axis=0) # (output,)
# Hidden layer gradient (backpropagate through ReLU)
da1 = dz2 @ self.W2.T # (batch, hidden)
dz1 = da1 * self._relu_grad(z1) # (batch, hidden) - element-wise
# Gradients for W1, b1
dW1 = X.T @ dz1 # (input, hidden)
db1 = np.sum(dz1, axis=0) # (hidden,)
# Update weights
self.W2 -= self.lr * dW2
self.b2 -= self.lr * db2
self.W1 -= self.lr * dW1
self.b1 -= self.lr * db1
def compute_loss(self, y_pred, y_true):
"""Cross-entropy loss."""
batch_size = y_true.shape[0]
eps = 1e-12
log_probs = -np.log(y_pred[np.arange(batch_size), y_true] + eps)
return np.mean(log_probs)
def train(self, X, y, epochs=100, verbose=True):
"""Training loop."""
for epoch in range(epochs):
# Forward pass
y_pred = self.forward(X)
loss = self.compute_loss(y_pred, y)
# Backward pass
self.backward(y)
if verbose and epoch % 10 == 0:
accuracy = np.mean(np.argmax(y_pred, axis=1) == y)
print(f"Epoch {epoch}: loss={loss:.4f}, accuracy={accuracy:.4f}")
Two fatal errors in neural network implementations: (1) Forgetting to divide dz2 by batch_size (your gradients scale with batch size, causing explosive updates). (2) Computing dz1 = da1 * self._relu(z1) instead of da1 * self._relu_grad(z1) - applying the function instead of its derivative. The ReLU derivative is 1 for z > 0 and 0 for z <= 0, not the ReLU function itself.
Implementation 9 - Train-Test Split
def train_test_split(X, y, test_size=0.2, random_state=None, stratify=None):
"""
Split data into training and test sets.
Args:
X: features (n_samples, n_features)
y: labels (n_samples,)
test_size: fraction of data for testing
random_state: for reproducibility
stratify: if provided, maintain class proportions
Returns:
X_train, X_test, y_train, y_test
"""
rng = np.random.default_rng(random_state)
n_samples = X.shape[0]
if stratify is not None:
# Stratified split: maintain class proportions
train_indices = []
test_indices = []
for cls in np.unique(stratify):
cls_indices = np.where(stratify == cls)[0]
rng.shuffle(cls_indices)
n_test = int(len(cls_indices) * test_size)
test_indices.extend(cls_indices[:n_test])
train_indices.extend(cls_indices[n_test:])
train_indices = np.array(train_indices)
test_indices = np.array(test_indices)
rng.shuffle(train_indices)
rng.shuffle(test_indices)
else:
# Random split
indices = np.arange(n_samples)
rng.shuffle(indices)
n_test = int(n_samples * test_size)
test_indices = indices[:n_test]
train_indices = indices[n_test:]
return X[train_indices], X[test_indices], y[train_indices], y[test_indices]
Implementation 10 - K-Fold Cross-Validation
def k_fold_cross_validation(X, y, model_class, model_params, k=5,
metric_fn=None, random_state=42):
"""
K-fold cross-validation.
Args:
X: features (n_samples, n_features)
y: labels (n_samples,)
model_class: class with fit() and predict() methods
model_params: dict of parameters for model_class
k: number of folds
metric_fn: function(y_true, y_pred) -> score (default: accuracy)
random_state: for reproducibility
Returns:
scores: list of k scores
mean_score: average score
std_score: standard deviation of scores
"""
if metric_fn is None:
metric_fn = lambda y_true, y_pred: np.mean(y_true == y_pred)
rng = np.random.default_rng(random_state)
n_samples = X.shape[0]
# Shuffle indices
indices = np.arange(n_samples)
rng.shuffle(indices)
# Split into k folds
fold_size = n_samples // k
scores = []
for i in range(k):
# Define test fold
test_start = i * fold_size
test_end = test_start + fold_size if i < k - 1 else n_samples
test_idx = indices[test_start:test_end]
# Training fold: everything except the test fold
train_idx = np.concatenate([indices[:test_start], indices[test_end:]])
X_train, X_test = X[train_idx], X[test_idx]
y_train, y_test = y[train_idx], y[test_idx]
# Train model
model = model_class(**model_params)
model.fit(X_train, y_train)
# Evaluate
y_pred = model.predict(X_test)
score = metric_fn(y_test, y_pred)
scores.append(score)
return {
'scores': scores,
'mean': np.mean(scores),
'std': np.std(scores)
}
# Usage:
# results = k_fold_cross_validation(
# X, y,
# model_class=KNNClassifier,
# model_params={'k': 5},
# k=5
# )
# print(f"Accuracy: {results['mean']:.3f} +/- {results['std']:.3f}")
Practice Problems
Problem 1: Implement Softmax Regression (Multi-Class Logistic Regression)
Extend binary logistic regression to handle multiple classes using softmax and cross-entropy loss.
Hint 1 - Direction
Replace sigmoid with softmax. Weights become a matrix (n_features, n_classes). The gradient is X.T @ (softmax_output - one_hot_labels).
Hint 2 - Shapes
X: (n, d), W: (d, c), bias: (c,). Z = X @ W + b has shape (n, c). Softmax of Z gives probabilities (n, c).
Hint 3 - Full Solution
class SoftmaxRegression:
def __init__(self, learning_rate=0.01, n_iterations=1000):
self.lr = learning_rate
self.n_iter = n_iterations
self.W = None
self.b = None
def _softmax(self, z):
shifted = z - z.max(axis=1, keepdims=True)
exp_z = np.exp(shifted)
return exp_z / exp_z.sum(axis=1, keepdims=True)
def fit(self, X, y):
n_samples, n_features = X.shape
n_classes = len(np.unique(y))
self.W = np.random.randn(n_features, n_classes) * 0.01
self.b = np.zeros(n_classes)
# One-hot encode labels
Y_one_hot = np.zeros((n_samples, n_classes))
Y_one_hot[np.arange(n_samples), y] = 1
for _ in range(self.n_iter):
z = X @ self.W + self.b
probs = self._softmax(z)
dW = (1 / n_samples) * X.T @ (probs - Y_one_hot)
db = (1 / n_samples) * np.sum(probs - Y_one_hot, axis=0)
self.W -= self.lr * dW
self.b -= self.lr * db
def predict(self, X):
z = X @ self.W + self.b
probs = self._softmax(z)
return np.argmax(probs, axis=1)
Problem 2: Add L2 Regularization to the Neural Network
Modify the neural network backward pass to include L2 weight decay.
Hint 1 - Direction
L2 regularization adds (lambda/2) * sum(w^2) to the loss. The gradient contribution is lambda * w. Add this to the weight gradients (not bias gradients).
Hint 2 - Where to modify
After computing dW1 and dW2, add reg_lambda * W1 and reg_lambda * W2 respectively. Do not regularize biases.
Hint 3 - Full Solution
# In the backward method, after computing dW1 and dW2:
def backward_with_l2(self, y_true, reg_lambda=0.01):
# ... (same gradient computation as before) ...
# Add L2 regularization gradient (weight decay)
dW2 += reg_lambda * self.W2
dW1 += reg_lambda * self.W1
# Note: do NOT add regularization to db1, db2
# Update (same as before)
self.W2 -= self.lr * dW2
self.b2 -= self.lr * db2
self.W1 -= self.lr * dW1
self.b1 -= self.lr * db1
# The loss should also include the regularization term:
def compute_loss_with_l2(self, y_pred, y_true, reg_lambda=0.01):
ce_loss = self.compute_loss(y_pred, y_true)
l2_loss = (reg_lambda / 2) * (
np.sum(self.W1 ** 2) + np.sum(self.W2 ** 2)
)
return ce_loss + l2_loss
Key insight: L2 regularization is equivalent to adding a Gaussian prior on the weights (MAP estimation). The effect is weight shrinkage - large weights are penalized. This reduces overfitting by preventing the network from relying too heavily on any single feature or connection.
Problem 3: Implement Stratified K-Fold
Modify the k-fold implementation to ensure each fold has approximately the same class distribution.
Hint 1 - Direction
For each class, distribute its samples across the k folds as evenly as possible.
Hint 2 - Algorithm
Group indices by class. For each class, shuffle and split into k groups. Then merge fold i from each class into fold i of the final split.
Hint 3 - Full Solution
def stratified_k_fold(X, y, k=5, random_state=42):
"""
Generate stratified k-fold splits.
Yields:
(train_indices, test_indices) for each fold
"""
rng = np.random.default_rng(random_state)
classes = np.unique(y)
# For each class, get shuffled indices
class_indices = {}
for cls in classes:
idx = np.where(y == cls)[0]
rng.shuffle(idx)
class_indices[cls] = idx
# Distribute each class's samples across folds
folds = [[] for _ in range(k)]
for cls in classes:
idx = class_indices[cls]
fold_sizes = [len(idx) // k] * k
# Distribute remainder
for i in range(len(idx) % k):
fold_sizes[i] += 1
current = 0
for fold_idx in range(k):
folds[fold_idx].extend(idx[current:current + fold_sizes[fold_idx]])
current += fold_sizes[fold_idx]
# Yield train/test splits
for i in range(k):
test_idx = np.array(folds[i])
train_idx = np.concatenate([np.array(folds[j]) for j in range(k) if j != i])
rng.shuffle(train_idx)
rng.shuffle(test_idx)
yield train_idx, test_idx
Interview Cheat Sheet
| Algorithm | Key Formula | Common Mistake | Time Complexity |
|---|---|---|---|
| Linear Regression (OLS) | w = (X^T X)^(-1) X^T y | Inverting X^T X instead of using lstsq | O(nd^2 + d^3) |
| Linear Regression (GD) | w -= lr * (1/n) * X^T @ (Xw - y) | Wrong sign on gradient | O(nd) per step |
| Logistic Regression | w -= lr * (1/n) * X^T @ (sigmoid(Xw) - y) | Forgetting sigmoid in gradient | O(nd) per step |
| Decision Tree | Split on max info gain | Not handling empty leaf nodes | O(n * d * n_thresholds * depth) |
| K-NN | argpartition for top-k | Full sort instead of partial sort | O(nd) distance + O(n) partition |
| K-Means | Assign, update centroids, repeat | Random init instead of k-means++ | O(nkd) per iteration |
| PCA | Eigenvectors of covariance matrix | Forgetting to center data first | O(d^3) for eigendecomp |
| Neural Net (forward) | z = Wx + b, a = activation(z) | Wrong shapes in matrix multiply | O(batch * layer_dims) |
| Neural Net (backward) | dW = a_prev^T @ dz | Applying activation instead of its derivative | Same as forward |
| Cross-validation | k folds, train on k-1, test on 1 | Not stratifying for imbalanced classes | k * training time |
Spaced Repetition Checkpoints
Day 0 - Initial Learning
- Read all implementations and verify the shapes at each step
- Implement linear regression (both forms) from memory
- Implement logistic regression from memory
- Complete the self-assessment
Day 3 - First Recall
- Implement K-NN with vectorized distance computation from memory
- Implement K-means with k-means++ initialization from memory
- Derive the logistic regression gradient on paper
Day 7 - Connections
- Implement the neural network forward and backward pass from memory
- Implement PCA and explain its connection to SVD
- Implement cross-validation and explain why stratification matters
Day 14 - Application
- Implement softmax regression from memory under timed conditions (15 min)
- Add L2 regularization to logistic regression and explain the effect on the gradient
- Implement a complete training pipeline: split, train, evaluate, cross-validate
Day 21 - Mock Interview
- Have someone name an algorithm - implement it from scratch in 20 minutes
- Explain the gradient derivation for logistic regression on a whiteboard
- Implement the neural network backward pass and explain each shape
Key Takeaways
-
Implementing from scratch proves understanding. Anyone can call
sklearn.fit(). Writing the gradient computation, managing the shapes, and handling numerical stability proves you understand what the library does internally - which is what lets you debug production models. -
Shapes are everything. Every "implement from scratch" bug is a shape error. X is (n, d). Weights are (d,) for regression, (d, c) for multi-class. Gradients have the same shape as the parameters they update. If you track shapes at every step, you will not make errors.
-
The gradient pattern is universal. For all linear-model losses, the gradient has the form
(1/n) * X.T @ (prediction - target). The only thing that changes is how "prediction" is computed (raw linear for regression, sigmoid for logistic, softmax for multi-class). -
Numerical stability is not optional. Clip sigmoid inputs, subtract max before softmax, add epsilon before log. These are not academic concerns - they determine whether your implementation works on real data or produces NaN.
Next Steps
Continue to Complexity Analysis to master time and space complexity analysis for ML algorithms - the framework for reasoning about computational efficiency in interviews.
