Skip to main content

Python ML Coding - Implement Algorithms From Scratch

Reading time: ~50 min | Interview relevance: Critical | Roles: MLE, AI Eng, Research Engineer, Applied Scientist

The Real Interview Moment

You are in an Anthropic Research Engineer interview. The interviewer says: "Implement logistic regression from scratch. No sklearn. Just NumPy. Include gradient descent, prediction, and show me the training loop." You have seen logistic regression a hundred times in courses, but now you are writing it under pressure. Do you remember the gradient formula? Is it X.T @ (predictions - labels) or (predictions - labels).T @ X? Do you divide by batch size? Do you apply sigmoid before or after the dot product?

This is the ultimate test of understanding. Anyone can call sklearn.LogisticRegression().fit(X, y). But implementing the algorithm from scratch - deriving the gradient, managing the shapes, handling numerical stability - proves you understand what the library is actually doing. When your model misbehaves in production, this is the knowledge that lets you debug it.

Every ML-focused interview at research labs (Anthropic, OpenAI, DeepMind, FAIR) and top companies (Google Brain, Meta AI) includes at least one "implement from scratch" question. Candidates who can write clean, correct implementations with proper vectorization get "strong hire." Candidates who get lost in the shapes or cannot remember the gradient get "no hire."

What You Will Master

  • Implement linear regression with closed-form solution and gradient descent
  • Implement logistic regression with binary cross-entropy and sigmoid
  • Implement a decision tree classifier with information gain
  • Implement K-nearest neighbors with vectorized distance
  • Implement K-means clustering with convergence detection
  • Implement PCA via eigendecomposition and SVD
  • Implement gradient descent variants: batch, mini-batch, SGD
  • Implement a neural network forward and backward pass
  • Implement train-test split and k-fold cross-validation
  • Explain every step - shapes, gradients, and design decisions

Self-Assessment: Where Are You Now?

Skill1 - Cannot2 - Vaguely3 - Can Code4 - Can Derive + Code5 - Can TeachYour Score
Linear regression (closed-form + GD)___
Logistic regression from scratch___
Decision tree with info gain___
K-NN prediction___
K-means clustering___
PCA / dimensionality reduction___
Neural network forward/backward___
Cross-validation implementation___

Target: All 4s and 5s before your interview.

Implementation 1 - Linear Regression

Closed-Form Solution (Normal Equation)

import numpy as np

class LinearRegressionClosedForm:
"""
Linear regression using the normal equation.
w* = (X^T X)^(-1) X^T y
"""

def __init__(self):
self.weights = None # Includes bias as last element

def fit(self, X, y):
"""
Fit the model using the normal equation.

Args:
X: np.ndarray of shape (n_samples, n_features)
y: np.ndarray of shape (n_samples,)
"""
# Add bias column (column of ones)
X_b = np.column_stack([X, np.ones(X.shape[0])])

# Normal equation: w = (X^T X)^(-1) X^T y
# Use lstsq for numerical stability (avoids inverting X^T X)
self.weights, _, _, _ = np.linalg.lstsq(X_b, y, rcond=None)

def predict(self, X):
"""Predict target values."""
X_b = np.column_stack([X, np.ones(X.shape[0])])
return X_b @ self.weights
60-Second Answer

"Linear regression minimizes squared error. The closed-form solution is w = (X^T X)^(-1) X^T y, which I implement using np.linalg.lstsq for numerical stability. For large datasets, I use gradient descent instead - the gradient of MSE with respect to w is (2/n) * X^T @ (X @ w - y). I add a bias term by appending a column of ones to X. The closed-form is O(n*d^2 + d^3), so it becomes impractical when d exceeds ~10,000 features."

Gradient Descent Solution

class LinearRegressionGD:
"""
Linear regression using gradient descent.
Loss: L = (1/2n) * ||Xw - y||^2
Gradient: dL/dw = (1/n) * X^T @ (Xw - y)
"""

def __init__(self, learning_rate=0.01, n_iterations=1000):
self.lr = learning_rate
self.n_iter = n_iterations
self.weights = None
self.bias = None
self.loss_history = []

def fit(self, X, y):
"""
Fit the model using gradient descent.

Args:
X: np.ndarray of shape (n_samples, n_features)
y: np.ndarray of shape (n_samples,)
"""
n_samples, n_features = X.shape

# Initialize weights to small random values
self.weights = np.random.randn(n_features) * 0.01
self.bias = 0.0

for i in range(self.n_iter):
# Forward pass: compute predictions
y_pred = X @ self.weights + self.bias # Shape: (n_samples,)

# Compute loss (MSE)
error = y_pred - y # Shape: (n_samples,)
loss = np.mean(error ** 2) / 2
self.loss_history.append(loss)

# Compute gradients
dw = (1 / n_samples) * (X.T @ error) # Shape: (n_features,)
db = (1 / n_samples) * np.sum(error) # Scalar

# Update weights
self.weights -= self.lr * dw
self.bias -= self.lr * db

def predict(self, X):
return X @ self.weights + self.bias
Common Trap

The gradient of MSE loss is (1/n) * X.T @ (predictions - targets), not (1/n) * X.T @ (targets - predictions). Getting the sign wrong means you do gradient ascent instead of descent - your loss will increase. This is the single most common mistake in "implement from scratch" interviews. Double-check: the gradient points in the direction of increasing loss, so you subtract it.

Implementation 2 - Logistic Regression

class LogisticRegression:
"""
Binary logistic regression using gradient descent.
Model: p(y=1|x) = sigmoid(x @ w + b)
Loss: Binary cross-entropy = -(1/n) * sum(y*log(p) + (1-y)*log(1-p))
Gradient: dL/dw = (1/n) * X^T @ (sigmoid(Xw+b) - y)
"""

def __init__(self, learning_rate=0.01, n_iterations=1000,
reg_lambda=0.0):
self.lr = learning_rate
self.n_iter = n_iterations
self.reg_lambda = reg_lambda # L2 regularization strength
self.weights = None
self.bias = None
self.loss_history = []

@staticmethod
def _sigmoid(z):
"""Numerically stable sigmoid."""
# Clip to prevent overflow in exp
z = np.clip(z, -500, 500)
return 1.0 / (1.0 + np.exp(-z))

def fit(self, X, y):
"""
Fit the model.

Args:
X: np.ndarray of shape (n_samples, n_features)
y: np.ndarray of shape (n_samples,) with values 0 or 1
"""
n_samples, n_features = X.shape
self.weights = np.zeros(n_features)
self.bias = 0.0

for i in range(self.n_iter):
# Forward pass
z = X @ self.weights + self.bias # Shape: (n_samples,)
y_pred = self._sigmoid(z) # Shape: (n_samples,)

# Compute binary cross-entropy loss
eps = 1e-12 # Prevent log(0)
loss = -(1 / n_samples) * np.sum(
y * np.log(y_pred + eps) +
(1 - y) * np.log(1 - y_pred + eps)
)
# Add L2 regularization term
loss += (self.reg_lambda / (2 * n_samples)) * np.sum(self.weights ** 2)
self.loss_history.append(loss)

# Compute gradients
# Key insight: gradient has the same form as linear regression
# but with sigmoid(Xw) instead of Xw
error = y_pred - y # Shape: (n_samples,)
dw = (1 / n_samples) * (X.T @ error) # Shape: (n_features,)
dw += (self.reg_lambda / n_samples) * self.weights # L2 gradient
db = (1 / n_samples) * np.sum(error)

# Update
self.weights -= self.lr * dw
self.bias -= self.lr * db

def predict_proba(self, X):
"""Return probability of class 1."""
return self._sigmoid(X @ self.weights + self.bias)

def predict(self, X, threshold=0.5):
"""Return binary predictions."""
return (self.predict_proba(X) >= threshold).astype(int)

Logistic Regression Training Loop

Instant Rejection

Never confuse the logistic regression gradient with the linear regression gradient. The logistic regression gradient is X.T @ (sigmoid(Xw) - y), NOT X.T @ (Xw - y). Both have the form X.T @ (prediction - target), but the "prediction" is different - sigmoid output for logistic, raw linear output for linear. If you write the wrong one, the interviewer knows you are memorizing rather than deriving.

Implementation 3 - Decision Tree

class DecisionTreeClassifier:
"""
Decision tree using information gain (entropy-based).
"""

class Node:
def __init__(self, feature=None, threshold=None, left=None,
right=None, value=None):
self.feature = feature # Feature index to split on
self.threshold = threshold # Threshold value
self.left = left # Left child (<=threshold)
self.right = right # Right child (>threshold)
self.value = value # Class prediction (leaf node)

def __init__(self, max_depth=10, min_samples_split=2):
self.max_depth = max_depth
self.min_samples_split = min_samples_split
self.root = None

@staticmethod
def _entropy(y):
"""Compute entropy of a label distribution."""
if len(y) == 0:
return 0
proportions = np.bincount(y) / len(y)
proportions = proportions[proportions > 0] # Remove zeros
return -np.sum(proportions * np.log2(proportions))

def _information_gain(self, y, left_mask):
"""Compute information gain from a split."""
parent_entropy = self._entropy(y)

left_y = y[left_mask]
right_y = y[~left_mask]

if len(left_y) == 0 or len(right_y) == 0:
return 0

n = len(y)
child_entropy = (
(len(left_y) / n) * self._entropy(left_y) +
(len(right_y) / n) * self._entropy(right_y)
)

return parent_entropy - child_entropy

def _best_split(self, X, y):
"""Find the best feature and threshold to split on."""
best_gain = -1
best_feature = None
best_threshold = None

n_features = X.shape[1]

for feature in range(n_features):
# Get unique sorted thresholds (midpoints between unique values)
values = np.unique(X[:, feature])
thresholds = (values[:-1] + values[1:]) / 2

for threshold in thresholds:
left_mask = X[:, feature] <= threshold
gain = self._information_gain(y, left_mask)

if gain > best_gain:
best_gain = gain
best_feature = feature
best_threshold = threshold

return best_feature, best_threshold, best_gain

def _build_tree(self, X, y, depth=0):
"""Recursively build the decision tree."""
n_samples = len(y)
n_classes = len(np.unique(y))

# Stopping conditions
if (depth >= self.max_depth or
n_classes == 1 or
n_samples < self.min_samples_split):
# Leaf node: predict the majority class
leaf_value = np.argmax(np.bincount(y))
return self.Node(value=leaf_value)

# Find the best split
feature, threshold, gain = self._best_split(X, y)

if gain <= 0:
leaf_value = np.argmax(np.bincount(y))
return self.Node(value=leaf_value)

# Split the data
left_mask = X[:, feature] <= threshold
left_child = self._build_tree(X[left_mask], y[left_mask], depth + 1)
right_child = self._build_tree(X[~left_mask], y[~left_mask], depth + 1)

return self.Node(feature=feature, threshold=threshold,
left=left_child, right=right_child)

def fit(self, X, y):
"""Build the tree."""
self.root = self._build_tree(X, y.astype(int))

def _predict_single(self, x, node):
"""Predict for a single sample."""
if node.value is not None:
return node.value
if x[node.feature] <= node.threshold:
return self._predict_single(x, node.left)
return self._predict_single(x, node.right)

def predict(self, X):
"""Predict for multiple samples."""
return np.array([self._predict_single(x, self.root) for x in X])
Company Variation

Google and Meta rarely ask you to implement a full decision tree in an interview - it is too long. They may ask you to implement just the _information_gain or _best_split function. Research labs (Anthropic, DeepMind) may ask for the full tree or ask you to extend it (e.g., add Gini impurity, handle continuous vs categorical features, implement pruning). The key is knowing the algorithm well enough to implement any part on demand.

Implementation 4 - K-Nearest Neighbors

class KNNClassifier:
"""
K-nearest neighbors classifier using vectorized distance computation.
"""

def __init__(self, k=5):
self.k = k
self.X_train = None
self.y_train = None

def fit(self, X, y):
"""Store training data (lazy learner)."""
self.X_train = X
self.y_train = y.astype(int)

def predict(self, X):
"""
Predict labels for query points.

Args:
X: np.ndarray of shape (n_query, n_features)
Returns:
predictions: np.ndarray of shape (n_query,)
"""
# Compute pairwise distances using the expansion trick
# ||a-b||^2 = ||a||^2 + ||b||^2 - 2*a.b
X_sq = np.sum(X ** 2, axis=1)[:, None] # (n_query, 1)
Xt_sq = np.sum(self.X_train ** 2, axis=1)[None, :] # (1, n_train)
cross = X @ self.X_train.T # (n_query, n_train)
distances = np.sqrt(np.maximum(X_sq + Xt_sq - 2 * cross, 0))

# Find k nearest neighbors (O(n) partial sort per query)
knn_indices = np.argpartition(distances, self.k, axis=1)[:, :self.k]

# Get labels of nearest neighbors
knn_labels = self.y_train[knn_indices] # (n_query, k)

# Majority vote
n_classes = self.y_train.max() + 1
predictions = np.zeros(X.shape[0], dtype=int)
for i in range(X.shape[0]):
counts = np.bincount(knn_labels[i], minlength=n_classes)
predictions[i] = np.argmax(counts)

return predictions

Implementation 5 - K-Means Clustering

class KMeans:
"""
K-means clustering with k-means++ initialization.
"""

def __init__(self, n_clusters=3, max_iter=100, tol=1e-4, random_state=42):
self.n_clusters = n_clusters
self.max_iter = max_iter
self.tol = tol
self.rng = np.random.default_rng(random_state)
self.centroids = None
self.labels = None
self.inertia = None # Sum of squared distances to nearest centroid

def _init_centroids_plusplus(self, X):
"""K-means++ initialization: spread out initial centroids."""
n_samples = X.shape[0]
centroids = np.zeros((self.n_clusters, X.shape[1]))

# First centroid: random sample
centroids[0] = X[self.rng.integers(n_samples)]

for k in range(1, self.n_clusters):
# Compute distance from each point to nearest existing centroid
distances = np.min(
np.sum((X[:, None, :] - centroids[None, :k, :]) ** 2, axis=2),
axis=1
)
# Sample next centroid with probability proportional to distance^2
probs = distances / distances.sum()
centroids[k] = X[self.rng.choice(n_samples, p=probs)]

return centroids

def fit(self, X):
"""
Fit K-means to data.

Args:
X: np.ndarray of shape (n_samples, n_features)
"""
# Initialize centroids using k-means++
self.centroids = self._init_centroids_plusplus(X)

for iteration in range(self.max_iter):
# Step 1: Assign each point to nearest centroid
# Distances: (n_samples, n_clusters)
distances = np.sqrt(np.sum(
(X[:, None, :] - self.centroids[None, :, :]) ** 2,
axis=2
))
self.labels = np.argmin(distances, axis=1)

# Step 2: Update centroids to mean of assigned points
new_centroids = np.zeros_like(self.centroids)
for k in range(self.n_clusters):
cluster_points = X[self.labels == k]
if len(cluster_points) > 0:
new_centroids[k] = cluster_points.mean(axis=0)
else:
# Empty cluster: reinitialize to random point
new_centroids[k] = X[self.rng.integers(X.shape[0])]

# Step 3: Check convergence
centroid_shift = np.sqrt(np.sum((new_centroids - self.centroids) ** 2))
self.centroids = new_centroids

if centroid_shift < self.tol:
break

# Compute inertia (sum of squared distances to nearest centroid)
self.inertia = sum(
np.sum((X[self.labels == k] - self.centroids[k]) ** 2)
for k in range(self.n_clusters)
)

return self

def predict(self, X):
"""Assign new points to nearest centroid."""
distances = np.sqrt(np.sum(
(X[:, None, :] - self.centroids[None, :, :]) ** 2,
axis=2
))
return np.argmin(distances, axis=1)

Implementation 6 - PCA

class PCA:
"""
Principal Component Analysis via eigendecomposition.
"""

def __init__(self, n_components):
self.n_components = n_components
self.components = None # Top-k eigenvectors (k, d)
self.mean = None # Feature means (d,)
self.explained_variance = None
self.explained_variance_ratio = None

def fit(self, X):
"""
Fit PCA: compute principal components.

Args:
X: np.ndarray of shape (n_samples, n_features)
"""
n_samples = X.shape[0]

# Step 1: Center the data
self.mean = X.mean(axis=0)
X_centered = X - self.mean

# Step 2: Compute covariance matrix
# cov = (1/(n-1)) * X_centered^T @ X_centered
cov_matrix = (X_centered.T @ X_centered) / (n_samples - 1)

# Step 3: Eigendecomposition (eigh for symmetric matrices)
eigenvalues, eigenvectors = np.linalg.eigh(cov_matrix)

# Step 4: Sort by eigenvalue (descending)
# eigh returns in ascending order, so reverse
idx = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]

# Step 5: Select top-k components
self.components = eigenvectors[:, :self.n_components].T # Shape: (k, d)
self.explained_variance = eigenvalues[:self.n_components]
self.explained_variance_ratio = (
self.explained_variance / eigenvalues.sum()
)

return self

def transform(self, X):
"""Project data onto principal components."""
X_centered = X - self.mean
return X_centered @ self.components.T # (n, d) @ (d, k) = (n, k)

def fit_transform(self, X):
"""Fit and transform in one step."""
self.fit(X)
return self.transform(X)

def inverse_transform(self, X_reduced):
"""Reconstruct data from reduced representation."""
return X_reduced @ self.components + self.mean # (n, k) @ (k, d) + (d,)

Implementation 7 - Gradient Descent Variants

class GradientDescentOptimizer:
"""
Gradient descent variants: batch, mini-batch, SGD with momentum.
"""

def __init__(self, learning_rate=0.01, momentum=0.0, batch_size=None):
self.lr = learning_rate
self.momentum = momentum
self.batch_size = batch_size # None = full batch
self.velocity_w = None
self.velocity_b = None

def step(self, X, y, weights, bias, loss_fn, grad_fn):
"""
Perform one step of gradient descent.

Args:
X, y: data
weights: current weights
bias: current bias
loss_fn: function(X, y, w, b) -> scalar loss
grad_fn: function(X, y, w, b) -> (dw, db)
Returns:
updated weights, bias, loss
"""
n_samples = X.shape[0]

if self.batch_size is None or self.batch_size >= n_samples:
# Full batch gradient descent
batch_X, batch_y = X, y
else:
# Mini-batch: randomly sample a batch
indices = np.random.choice(n_samples, self.batch_size, replace=False)
batch_X, batch_y = X[indices], y[indices]

# Compute gradients on the batch
dw, db = grad_fn(batch_X, batch_y, weights, bias)

# Initialize velocity if needed
if self.velocity_w is None:
self.velocity_w = np.zeros_like(weights)
self.velocity_b = 0.0

# Update with momentum
self.velocity_w = self.momentum * self.velocity_w - self.lr * dw
self.velocity_b = self.momentum * self.velocity_b - self.lr * db

weights = weights + self.velocity_w
bias = bias + self.velocity_b

# Compute loss on full data for monitoring
loss = loss_fn(X, y, weights, bias)

return weights, bias, loss

Gradient Descent Comparison

VariantBatch SizeGradient NoiseConvergenceMemoryWhen to Use
Batch GDFull datasetNone (exact gradient)Smooth, slowHighSmall datasets
Mini-batch GD32-256ModerateNoisy but fastModerateStandard choice
SGD1HighVery noisy, fastLowOnline learning
SGD + Momentum1-256ModerateAcceleratedModerateDeep learning default

Implementation 8 - Neural Network (Forward and Backward Pass)

class NeuralNetwork:
"""
Simple feedforward neural network with backpropagation.
Architecture: input -> hidden (ReLU) -> output (softmax)
"""

def __init__(self, input_size, hidden_size, output_size, learning_rate=0.01):
self.lr = learning_rate

# Xavier initialization
self.W1 = np.random.randn(input_size, hidden_size) * np.sqrt(2.0 / input_size)
self.b1 = np.zeros(hidden_size)
self.W2 = np.random.randn(hidden_size, output_size) * np.sqrt(2.0 / hidden_size)
self.b2 = np.zeros(output_size)

# Cache for backward pass
self.cache = {}

def _relu(self, z):
return np.maximum(z, 0)

def _relu_grad(self, z):
return (z > 0).astype(float)

def _softmax(self, z):
shifted = z - z.max(axis=1, keepdims=True)
exp_z = np.exp(shifted)
return exp_z / exp_z.sum(axis=1, keepdims=True)

def forward(self, X):
"""
Forward pass.

Args:
X: np.ndarray of shape (batch_size, input_size)
Returns:
probabilities: np.ndarray of shape (batch_size, output_size)
"""
# Layer 1: linear + ReLU
z1 = X @ self.W1 + self.b1 # (batch, hidden)
a1 = self._relu(z1) # (batch, hidden)

# Layer 2: linear + softmax
z2 = a1 @ self.W2 + self.b2 # (batch, output)
a2 = self._softmax(z2) # (batch, output)

# Cache for backward pass
self.cache = {'X': X, 'z1': z1, 'a1': a1, 'z2': z2, 'a2': a2}

return a2

def backward(self, y_true):
"""
Backward pass: compute gradients using chain rule.

Args:
y_true: np.ndarray of shape (batch_size,) - integer labels
"""
batch_size = y_true.shape[0]
X = self.cache['X']
z1 = self.cache['z1']
a1 = self.cache['a1']
a2 = self.cache['a2'] # softmax output

# Convert labels to one-hot
one_hot = np.zeros_like(a2)
one_hot[np.arange(batch_size), y_true] = 1

# Output layer gradient
# For softmax + cross-entropy, the gradient simplifies to:
# dL/dz2 = a2 - one_hot (probability - target)
dz2 = (a2 - one_hot) / batch_size # (batch, output)

# Gradients for W2, b2
dW2 = a1.T @ dz2 # (hidden, output)
db2 = np.sum(dz2, axis=0) # (output,)

# Hidden layer gradient (backpropagate through ReLU)
da1 = dz2 @ self.W2.T # (batch, hidden)
dz1 = da1 * self._relu_grad(z1) # (batch, hidden) - element-wise

# Gradients for W1, b1
dW1 = X.T @ dz1 # (input, hidden)
db1 = np.sum(dz1, axis=0) # (hidden,)

# Update weights
self.W2 -= self.lr * dW2
self.b2 -= self.lr * db2
self.W1 -= self.lr * dW1
self.b1 -= self.lr * db1

def compute_loss(self, y_pred, y_true):
"""Cross-entropy loss."""
batch_size = y_true.shape[0]
eps = 1e-12
log_probs = -np.log(y_pred[np.arange(batch_size), y_true] + eps)
return np.mean(log_probs)

def train(self, X, y, epochs=100, verbose=True):
"""Training loop."""
for epoch in range(epochs):
# Forward pass
y_pred = self.forward(X)
loss = self.compute_loss(y_pred, y)

# Backward pass
self.backward(y)

if verbose and epoch % 10 == 0:
accuracy = np.mean(np.argmax(y_pred, axis=1) == y)
print(f"Epoch {epoch}: loss={loss:.4f}, accuracy={accuracy:.4f}")

Neural Network Forward and Backward Pass

Instant Rejection

Two fatal errors in neural network implementations: (1) Forgetting to divide dz2 by batch_size (your gradients scale with batch size, causing explosive updates). (2) Computing dz1 = da1 * self._relu(z1) instead of da1 * self._relu_grad(z1) - applying the function instead of its derivative. The ReLU derivative is 1 for z > 0 and 0 for z <= 0, not the ReLU function itself.

Implementation 9 - Train-Test Split

def train_test_split(X, y, test_size=0.2, random_state=None, stratify=None):
"""
Split data into training and test sets.

Args:
X: features (n_samples, n_features)
y: labels (n_samples,)
test_size: fraction of data for testing
random_state: for reproducibility
stratify: if provided, maintain class proportions
Returns:
X_train, X_test, y_train, y_test
"""
rng = np.random.default_rng(random_state)
n_samples = X.shape[0]

if stratify is not None:
# Stratified split: maintain class proportions
train_indices = []
test_indices = []

for cls in np.unique(stratify):
cls_indices = np.where(stratify == cls)[0]
rng.shuffle(cls_indices)
n_test = int(len(cls_indices) * test_size)
test_indices.extend(cls_indices[:n_test])
train_indices.extend(cls_indices[n_test:])

train_indices = np.array(train_indices)
test_indices = np.array(test_indices)
rng.shuffle(train_indices)
rng.shuffle(test_indices)
else:
# Random split
indices = np.arange(n_samples)
rng.shuffle(indices)
n_test = int(n_samples * test_size)
test_indices = indices[:n_test]
train_indices = indices[n_test:]

return X[train_indices], X[test_indices], y[train_indices], y[test_indices]

Implementation 10 - K-Fold Cross-Validation

def k_fold_cross_validation(X, y, model_class, model_params, k=5,
metric_fn=None, random_state=42):
"""
K-fold cross-validation.

Args:
X: features (n_samples, n_features)
y: labels (n_samples,)
model_class: class with fit() and predict() methods
model_params: dict of parameters for model_class
k: number of folds
metric_fn: function(y_true, y_pred) -> score (default: accuracy)
random_state: for reproducibility
Returns:
scores: list of k scores
mean_score: average score
std_score: standard deviation of scores
"""
if metric_fn is None:
metric_fn = lambda y_true, y_pred: np.mean(y_true == y_pred)

rng = np.random.default_rng(random_state)
n_samples = X.shape[0]

# Shuffle indices
indices = np.arange(n_samples)
rng.shuffle(indices)

# Split into k folds
fold_size = n_samples // k
scores = []

for i in range(k):
# Define test fold
test_start = i * fold_size
test_end = test_start + fold_size if i < k - 1 else n_samples
test_idx = indices[test_start:test_end]

# Training fold: everything except the test fold
train_idx = np.concatenate([indices[:test_start], indices[test_end:]])

X_train, X_test = X[train_idx], X[test_idx]
y_train, y_test = y[train_idx], y[test_idx]

# Train model
model = model_class(**model_params)
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
score = metric_fn(y_test, y_pred)
scores.append(score)

return {
'scores': scores,
'mean': np.mean(scores),
'std': np.std(scores)
}

# Usage:
# results = k_fold_cross_validation(
# X, y,
# model_class=KNNClassifier,
# model_params={'k': 5},
# k=5
# )
# print(f"Accuracy: {results['mean']:.3f} +/- {results['std']:.3f}")

Practice Problems

Problem 1: Implement Softmax Regression (Multi-Class Logistic Regression)

Extend binary logistic regression to handle multiple classes using softmax and cross-entropy loss.

Hint 1 - Direction

Replace sigmoid with softmax. Weights become a matrix (n_features, n_classes). The gradient is X.T @ (softmax_output - one_hot_labels).

Hint 2 - Shapes

X: (n, d), W: (d, c), bias: (c,). Z = X @ W + b has shape (n, c). Softmax of Z gives probabilities (n, c).

Hint 3 - Full Solution
class SoftmaxRegression:
def __init__(self, learning_rate=0.01, n_iterations=1000):
self.lr = learning_rate
self.n_iter = n_iterations
self.W = None
self.b = None

def _softmax(self, z):
shifted = z - z.max(axis=1, keepdims=True)
exp_z = np.exp(shifted)
return exp_z / exp_z.sum(axis=1, keepdims=True)

def fit(self, X, y):
n_samples, n_features = X.shape
n_classes = len(np.unique(y))

self.W = np.random.randn(n_features, n_classes) * 0.01
self.b = np.zeros(n_classes)

# One-hot encode labels
Y_one_hot = np.zeros((n_samples, n_classes))
Y_one_hot[np.arange(n_samples), y] = 1

for _ in range(self.n_iter):
z = X @ self.W + self.b
probs = self._softmax(z)

dW = (1 / n_samples) * X.T @ (probs - Y_one_hot)
db = (1 / n_samples) * np.sum(probs - Y_one_hot, axis=0)

self.W -= self.lr * dW
self.b -= self.lr * db

def predict(self, X):
z = X @ self.W + self.b
probs = self._softmax(z)
return np.argmax(probs, axis=1)

Problem 2: Add L2 Regularization to the Neural Network

Modify the neural network backward pass to include L2 weight decay.

Hint 1 - Direction

L2 regularization adds (lambda/2) * sum(w^2) to the loss. The gradient contribution is lambda * w. Add this to the weight gradients (not bias gradients).

Hint 2 - Where to modify

After computing dW1 and dW2, add reg_lambda * W1 and reg_lambda * W2 respectively. Do not regularize biases.

Hint 3 - Full Solution
# In the backward method, after computing dW1 and dW2:

def backward_with_l2(self, y_true, reg_lambda=0.01):
# ... (same gradient computation as before) ...

# Add L2 regularization gradient (weight decay)
dW2 += reg_lambda * self.W2
dW1 += reg_lambda * self.W1
# Note: do NOT add regularization to db1, db2

# Update (same as before)
self.W2 -= self.lr * dW2
self.b2 -= self.lr * db2
self.W1 -= self.lr * dW1
self.b1 -= self.lr * db1

# The loss should also include the regularization term:
def compute_loss_with_l2(self, y_pred, y_true, reg_lambda=0.01):
ce_loss = self.compute_loss(y_pred, y_true)
l2_loss = (reg_lambda / 2) * (
np.sum(self.W1 ** 2) + np.sum(self.W2 ** 2)
)
return ce_loss + l2_loss

Key insight: L2 regularization is equivalent to adding a Gaussian prior on the weights (MAP estimation). The effect is weight shrinkage - large weights are penalized. This reduces overfitting by preventing the network from relying too heavily on any single feature or connection.

Problem 3: Implement Stratified K-Fold

Modify the k-fold implementation to ensure each fold has approximately the same class distribution.

Hint 1 - Direction

For each class, distribute its samples across the k folds as evenly as possible.

Hint 2 - Algorithm

Group indices by class. For each class, shuffle and split into k groups. Then merge fold i from each class into fold i of the final split.

Hint 3 - Full Solution
def stratified_k_fold(X, y, k=5, random_state=42):
"""
Generate stratified k-fold splits.

Yields:
(train_indices, test_indices) for each fold
"""
rng = np.random.default_rng(random_state)
classes = np.unique(y)

# For each class, get shuffled indices
class_indices = {}
for cls in classes:
idx = np.where(y == cls)[0]
rng.shuffle(idx)
class_indices[cls] = idx

# Distribute each class's samples across folds
folds = [[] for _ in range(k)]
for cls in classes:
idx = class_indices[cls]
fold_sizes = [len(idx) // k] * k
# Distribute remainder
for i in range(len(idx) % k):
fold_sizes[i] += 1

current = 0
for fold_idx in range(k):
folds[fold_idx].extend(idx[current:current + fold_sizes[fold_idx]])
current += fold_sizes[fold_idx]

# Yield train/test splits
for i in range(k):
test_idx = np.array(folds[i])
train_idx = np.concatenate([np.array(folds[j]) for j in range(k) if j != i])
rng.shuffle(train_idx)
rng.shuffle(test_idx)
yield train_idx, test_idx

Interview Cheat Sheet

AlgorithmKey FormulaCommon MistakeTime Complexity
Linear Regression (OLS)w = (X^T X)^(-1) X^T yInverting X^T X instead of using lstsqO(nd^2 + d^3)
Linear Regression (GD)w -= lr * (1/n) * X^T @ (Xw - y)Wrong sign on gradientO(nd) per step
Logistic Regressionw -= lr * (1/n) * X^T @ (sigmoid(Xw) - y)Forgetting sigmoid in gradientO(nd) per step
Decision TreeSplit on max info gainNot handling empty leaf nodesO(n * d * n_thresholds * depth)
K-NNargpartition for top-kFull sort instead of partial sortO(nd) distance + O(n) partition
K-MeansAssign, update centroids, repeatRandom init instead of k-means++O(nkd) per iteration
PCAEigenvectors of covariance matrixForgetting to center data firstO(d^3) for eigendecomp
Neural Net (forward)z = Wx + b, a = activation(z)Wrong shapes in matrix multiplyO(batch * layer_dims)
Neural Net (backward)dW = a_prev^T @ dzApplying activation instead of its derivativeSame as forward
Cross-validationk folds, train on k-1, test on 1Not stratifying for imbalanced classesk * training time

Spaced Repetition Checkpoints

Day 0 - Initial Learning

  • Read all implementations and verify the shapes at each step
  • Implement linear regression (both forms) from memory
  • Implement logistic regression from memory
  • Complete the self-assessment

Day 3 - First Recall

  • Implement K-NN with vectorized distance computation from memory
  • Implement K-means with k-means++ initialization from memory
  • Derive the logistic regression gradient on paper

Day 7 - Connections

  • Implement the neural network forward and backward pass from memory
  • Implement PCA and explain its connection to SVD
  • Implement cross-validation and explain why stratification matters

Day 14 - Application

  • Implement softmax regression from memory under timed conditions (15 min)
  • Add L2 regularization to logistic regression and explain the effect on the gradient
  • Implement a complete training pipeline: split, train, evaluate, cross-validate

Day 21 - Mock Interview

  • Have someone name an algorithm - implement it from scratch in 20 minutes
  • Explain the gradient derivation for logistic regression on a whiteboard
  • Implement the neural network backward pass and explain each shape

Key Takeaways

  1. Implementing from scratch proves understanding. Anyone can call sklearn.fit(). Writing the gradient computation, managing the shapes, and handling numerical stability proves you understand what the library does internally - which is what lets you debug production models.

  2. Shapes are everything. Every "implement from scratch" bug is a shape error. X is (n, d). Weights are (d,) for regression, (d, c) for multi-class. Gradients have the same shape as the parameters they update. If you track shapes at every step, you will not make errors.

  3. The gradient pattern is universal. For all linear-model losses, the gradient has the form (1/n) * X.T @ (prediction - target). The only thing that changes is how "prediction" is computed (raw linear for regression, sigmoid for logistic, softmax for multi-class).

  4. Numerical stability is not optional. Clip sigmoid inputs, subtract max before softmax, add epsilon before log. These are not academic concerns - they determine whether your implementation works on real data or produces NaN.

Next Steps

Continue to Complexity Analysis to master time and space complexity analysis for ML algorithms - the framework for reasoning about computational efficiency in interviews.

© 2026 EngineersOfAI. All rights reserved.