Skip to main content

Supervised, Unsupervised, and Reinforcement Learning

Reading time: ~22 minutes | Level: ML Foundations | Role: MLE, ML Engineer, Data Scientist, Research Engineer

A computer vision team was building a defect detection system for a semiconductor fab. They had 500,000 images but only 2,000 labeled as "defective" or "clean" - labeling required an expert who cost $150/hour. The team had three options they debated for two weeks:

Option A (Supervised): Train a classifier on just the 2,000 labeled images. The model trained fine but generalized poorly to new defect types the annotator had not seen.

Option B (Unsupervised first, supervised second): Use all 500,000 images to learn visual representations via self-supervised learning (SimCLR - predict whether two augmented views of the same image are the same image). Then fine-tune on the 2,000 labeled examples. The representation learning exposed the model to vastly more visual structure. Accuracy improved from 71% to 89%.

Option C (Active learning loop): Use the self-supervised model to find the images where the model was most uncertain, prioritize those for labeling, and iterate. 300 strategically selected labels outperformed 2,000 randomly selected labels.

The right answer was a combination of B and C - and knowing which paradigm to use, and when to combine them, is precisely what this lesson teaches.

What You Will Learn

  • Supervised learning: labeled data, loss functions, and ERM
  • Unsupervised learning: structure discovery without labels
  • Semi-supervised and self-supervised learning - why they dominate modern ML
  • Reinforcement learning: reward signals, exploration, and when it applies
  • Algorithmic classification by paradigm with data requirements and use cases
  • How to choose the right paradigm before writing any code

Part 1 - Supervised Learning

The Setup

Supervised learning is the most common paradigm. You have a labeled dataset:

D={(x1,y1),(x2,y2),,(xn,yn)}\mathcal{D} = \{(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)\}

Where xiXx_i \in \mathcal{X} are inputs (features) and yiYy_i \in \mathcal{Y} are labels (targets). You learn a function f:XYf: \mathcal{X} \to \mathcal{Y} that predicts yy from xx on unseen inputs.

The learning proceeds by minimizing a loss function over the training set - Empirical Risk Minimization (ERM):

θ^=argminθ1ni=1nL(fθ(xi),yi)\hat{\theta} = \arg\min_\theta \frac{1}{n} \sum_{i=1}^{n} \mathcal{L}(f_\theta(x_i), y_i)

Types of supervised learning

Classification: y{0,1,,K1}y \in \{0, 1, \ldots, K-1\} - predict a discrete class

  • Binary: spam/not spam, fraud/not fraud
  • Multiclass: digit recognition (10 classes), ImageNet (1,000 classes)
  • Multilabel: document tagging (multiple tags per document)

Regression: yRy \in \mathbb{R} - predict a continuous value

  • House price prediction, demand forecasting, temperature prediction
  • Time-to-event prediction (survival analysis)

Structured prediction: yy is a complex structure

  • Sequence labeling (NER: "Apple → ORG", "New York → LOC")
  • Machine translation (sequence-to-sequence)
  • Object detection (bounding boxes + class labels per image)

The role of the loss function

The loss function encodes what "good prediction" means. Choosing the wrong loss is a design error that no amount of model tuning will fix:

TaskLoss FunctionWhy
Binary classificationBinary cross-entropyMeasures log-likelihood under Bernoulli
Multiclass classificationCategorical cross-entropyMeasures log-likelihood under Categorical
RegressionMSE (L2 loss)Optimal under Gaussian noise assumption
Regression with outliersMAE (L1 loss) or HuberRobust to heavy-tailed noise
RankingPairwise ranking lossOptimizes relative ordering
Sequence generationPer-token cross-entropyDecomposes joint probability by chain rule
import numpy as np
import torch
import torch.nn as nn

# Classification: cross-entropy loss
# y_true is class indices, y_pred is logits
y_true = torch.tensor([0, 2, 1, 0]) # class indices
y_pred_logits = torch.randn(4, 3) # raw logits for 3 classes

ce_loss = nn.CrossEntropyLoss()
loss_val = ce_loss(y_pred_logits, y_true)
print(f"Cross-entropy loss: {loss_val.item():.4f}")

# Regression: MSE loss
y_true_reg = torch.tensor([2.3, 0.1, -1.4, 3.7])
y_pred_reg = torch.tensor([2.1, 0.5, -1.2, 3.9])

mse_loss = nn.MSELoss()
print(f"MSE loss: {mse_loss(y_pred_reg, y_true_reg).item():.4f}")

# sklearn: supervised learning baseline
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X, y = make_classification(n_samples=5000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

clf = GradientBoostingClassifier(n_estimators=100, max_depth=3)
clf.fit(X_train, y_train)
print(f"Test accuracy: {accuracy_score(y_test, clf.predict(X_test)):.3f}")

What supervised learning requires

  • Labeled data: nn (x, y) pairs. Typically need hundreds to millions depending on complexity.
  • IID assumption: Training examples are independently drawn from the same distribution as test examples. Violating this (temporal ordering, group structure) requires special handling (Lesson 10).
  • Label quality: Noisy labels directly degrade model performance. 20% random label noise can drop accuracy by 10–15%.

:::warning The labeling bottleneck In most real projects, getting labeled data is the primary constraint - not model complexity, not compute. This is why semi-supervised and self-supervised learning are critically important: they are strategies to reduce your dependence on expensive labeled data. :::

Part 2 - Unsupervised Learning

The Setup

Unsupervised learning has no labels. You observe {x1,x2,,xn}\{x_1, x_2, \ldots, x_n\} and want to discover structure in the data distribution P(x)P(x) without being told what to look for.

This is fundamentally harder than supervised learning: there is no objective label to optimize toward, and "what counts as structure" is problem-dependent.

The three main unsupervised tasks

1. Clustering: Group similar observations together

Find KK groups such that within-group observations are similar and between-group observations are different.

from sklearn.cluster import KMeans, DBSCAN
from sklearn.preprocessing import StandardScaler
import numpy as np

# Customer segmentation: group customers by behavior
# No labels - we don't know how many segments exist or what they are
np.random.seed(42)
n_customers = 1000
customer_features = np.column_stack([
np.random.randn(n_customers), # recency (days since last purchase)
np.random.randn(n_customers), # frequency (purchases per month)
np.random.randn(n_customers), # monetary (avg spend)
])

scaler = StandardScaler()
X_scaled = scaler.fit_transform(customer_features)

# K-Means: assumes spherical clusters of similar size
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
segments = kmeans.fit_predict(X_scaled)
print(f"Cluster sizes: {np.bincount(segments)}")

# DBSCAN: density-based - finds clusters of arbitrary shape, handles noise
dbscan = DBSCAN(eps=0.5, min_samples=5)
segments_db = dbscan.fit_predict(X_scaled)
n_clusters = len(set(segments_db)) - (1 if -1 in segments_db else 0)
n_noise = np.sum(segments_db == -1)
print(f"DBSCAN: {n_clusters} clusters, {n_noise} noise points")

2. Density estimation: Model the probability distribution P(x)P(x)

Learn the probability density of the data. Useful for:

  • Anomaly detection (low-density regions = anomalies)
  • Generative modeling (sample from learned distribution)
  • Compression and coding theory

3. Dimensionality reduction: Find a lower-dimensional representation

Map xRdx \in \mathbb{R}^d to zRkz \in \mathbb{R}^k where kdk \ll d, preserving important structure.

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import numpy as np

# Dimensionality reduction for visualization and compression
np.random.seed(42)
X_highd = np.random.randn(500, 100) # 500 samples, 100-dimensional

# PCA: linear, preserves maximum variance
pca = PCA(n_components=10)
X_pca = pca.fit_transform(X_highd)
variance_explained = pca.explained_variance_ratio_.cumsum()
print(f"PCA 10 components explain {variance_explained[-1]*100:.1f}% of variance")

# t-SNE: nonlinear, preserves local structure - great for visualization
# Warning: t-SNE is for visualization only, not for downstream ML tasks
# It does not preserve global structure or distances
tsne = TSNE(n_components=2, random_state=42, perplexity=30)
X_2d = tsne.fit_transform(X_highd) # shape: (500, 2)
print(f"t-SNE output shape: {X_2d.shape}") # (500, 2) - for plotting

When unsupervised learning is the right choice

Use CaseAlgorithmWhy
Customer segmentationK-Means, DBSCANNo ground truth "correct" segments exist
Anomaly detectionIsolation Forest, AutoencoderNormal data is plentiful; anomalies are rare and undefined
Topic modelingLDANo pre-existing topic taxonomy
Data compressionPCA, AutoencoderReduce storage/compute without labels
Exploratory data analysist-SNE, UMAPUnderstand data structure before deciding what to predict
Pretraining (see Part 3)SimCLR, BERTUnlabeled data vastly outnumbers labeled data

Part 3 - Semi-Supervised and Self-Supervised Learning

This is where most of the important recent progress in ML has happened. Both approaches bridge supervised and unsupervised learning, and they are hugely important in practice.

Semi-Supervised Learning

Setting: You have a small labeled set DL={(xi,yi)}i=1l\mathcal{D}_L = \{(x_i, y_i)\}_{i=1}^l and a large unlabeled set DU={xj}j=1u\mathcal{D}_U = \{x_j\}_{j=1}^u, where ulu \gg l.

The key assumption: the unlabeled data DU\mathcal{D}_U contains information about the structure of P(x)P(x) that helps learn P(yx)P(y \mid x).

Approaches:

Label propagation: If two points are close in input space, they likely share a label. Propagate labels from labeled to nearby unlabeled points.

Pseudo-labeling: Train on labeled data, use the trained model to generate "pseudo-labels" for unlabeled data, retrain on the combination.

Consistency regularization: Predict that augmented versions of the same unlabeled input should have the same label.

import numpy as np
from sklearn.semi_supervised import LabelPropagation, LabelSpreading
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Semi-supervised learning: 100 labeled, 900 unlabeled
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Mask 90% of training labels as "unknown" (-1 is the convention in sklearn)
n_labeled = int(0.1 * len(y_train))
y_partial = y_train.copy()
unlabeled_idx = np.random.choice(len(y_train), size=len(y_train) - n_labeled, replace=False)
y_partial[unlabeled_idx] = -1 # -1 = unlabeled in sklearn convention

print(f"Labeled: {np.sum(y_partial != -1)}, Unlabeled: {np.sum(y_partial == -1)}")

# Label Propagation: spread labels through the graph of similar points
lp = LabelPropagation(kernel='knn', n_neighbors=7, max_iter=1000)
lp.fit(X_train, y_partial)

# Compare against supervised-only (using only the labeled examples)
from sklearn.linear_model import LogisticRegression
labeled_mask = y_partial != -1
lr_supervised = LogisticRegression(max_iter=1000)
lr_supervised.fit(X_train[labeled_mask], y_train[labeled_mask])

print(f"Supervised only ({n_labeled} labels): {accuracy_score(y_test, lr_supervised.predict(X_test)):.3f}")
print(f"Label Propagation: {accuracy_score(y_test, lp.predict(X_test)):.3f}")

Self-Supervised Learning

Self-supervised learning is the most important idea in modern ML. It creates surrogate supervised tasks from the structure of unlabeled data - no human labels needed.

The mechanism: Define a pretext task where the "label" comes from the data itself:

  • Predict the next word (GPT)
  • Predict masked words (BERT)
  • Predict whether two image crops came from the same image (SimCLR)
  • Predict the rotation applied to an image (RotNet)
# Conceptual illustration of masked language modeling (BERT-style)

sentence = "The model learned to [MASK] embeddings from text."
# BERT's pretext task: predict the masked word

# Input to model:
# ["The", "model", "learned", "to", "[MASK]", "embeddings", "from", "text"]

# Target (from the data itself - no human label needed):
# position 4 → "generate"

# The model is forced to understand:
# - Grammar (verb follows "to")
# - Semantics (what verbs make sense with "embeddings")
# - Context (the whole sentence constrains the answer)

# After pretraining on 3.3B words, the learned representations
# are useful for dozens of downstream tasks:
# - Sentiment analysis
# - Named entity recognition
# - Question answering
# - Document classification
# All with minimal labeled fine-tuning data

Why self-supervised learning matters for engineers:

  1. The internet is labeled data: Every web page, every image with a caption, every codebase with comments - all of this becomes training data for self-supervised models. The scale of available data grows by orders of magnitude.

  2. The pretrain-then-fine-tune paradigm has won: In NLP, vision, speech, and code - the dominant approach is pretrain a large model with self-supervised objectives, then fine-tune on your small labeled dataset. Understanding why requires understanding self-supervised learning.

  3. Transfer learning: Self-supervised pretraining produces general representations. For your specific problem (defect detection, medical imaging, financial time-series), you fine-tune a pretrained model on your few hundred labeled examples. This approach routinely matches or exceeds training from scratch with 10-100x more labeled data.

Part 4 - Reinforcement Learning

The Setup

Reinforcement learning (RL) is fundamentally different from supervised and unsupervised learning. There is no fixed dataset. Instead:

  • An agent takes actions ata_t in an environment
  • The environment returns a state st+1s_{t+1} and a reward rtr_t
  • The agent's goal: learn a policy π(as)\pi(a \mid s) that maximizes cumulative reward

Gt=k=0γkrt+kG_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k}

Where γ[0,1)\gamma \in [0, 1) is the discount factor (future rewards are worth less than immediate rewards).

import numpy as np

# RL concepts illustrated with a simple multi-armed bandit
# No environment library needed - shows core exploration/exploitation tradeoff

class MultiArmedBandit:
"""Simple bandit with k arms, each with a fixed mean reward."""
def __init__(self, k: int = 10, seed: int = 42):
np.random.seed(seed)
self.k = k
self.true_means = np.random.randn(k) # true reward means (unknown to agent)

def pull(self, arm: int) -> float:
"""Pull an arm, get noisy reward."""
return self.true_means[arm] + np.random.randn()

def epsilon_greedy(bandit, n_steps=1000, epsilon=0.1):
"""Epsilon-greedy policy: exploit with prob 1-ε, explore with prob ε."""
k = bandit.k
q_estimates = np.zeros(k) # estimated mean reward for each arm
n_pulls = np.zeros(k) # number of pulls for each arm
total_reward = 0

for t in range(n_steps):
if np.random.random() < epsilon:
arm = np.random.randint(k) # EXPLORE: random arm
else:
arm = np.argmax(q_estimates) # EXPLOIT: best known arm

reward = bandit.pull(arm)
n_pulls[arm] += 1
# Online update of mean estimate
q_estimates[arm] += (reward - q_estimates[arm]) / n_pulls[arm]
total_reward += reward

return total_reward, q_estimates

bandit = MultiArmedBandit(k=10)
best_possible = np.max(bandit.true_means) * 1000 # theoretical max reward

for eps in [0.0, 0.01, 0.1, 0.3]:
total, estimates = epsilon_greedy(bandit, n_steps=1000, epsilon=eps)
print(f"ε={eps:.2f}: total_reward={total:.1f} (best_possible={best_possible:.1f})")

Exploration vs. Exploitation

The central challenge in RL is the exploration-exploitation tradeoff:

  • Exploitation: use the best action currently known → maximize short-term reward
  • Exploration: try new actions → potentially discover better long-term rewards

This tradeoff has no perfect solution - it is a fundamental tension. Key strategies:

StrategyIdeaWhen to use
ε-greedyRandom exploration with probability εSimple, works well with constant ε-schedule
UCB (Upper Confidence Bound)Explore actions with high uncertaintyWhen statistical bounds are tractable
Thompson SamplingSample from posterior, pick bestBayesian setting, strong performance
Entropy bonusReward agent for diverse behaviorAvoid local optima in complex environments

Deep RL in practice

Modern RL uses deep neural networks as function approximators for the policy πθ(as)\pi_\theta(a \mid s) and value function Vθ(s)V_\theta(s):

import torch
import torch.nn as nn

class PolicyNetwork(nn.Module):
"""Simple policy network for continuous state, discrete action spaces."""

def __init__(self, state_dim: int, n_actions: int, hidden_dim: int = 128):
super().__init__()
self.net = nn.Sequential(
nn.Linear(state_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, n_actions),
)

def forward(self, state: torch.Tensor) -> torch.Tensor:
"""Returns logits over actions given state."""
return self.net(state)

def get_action(self, state: torch.Tensor) -> tuple:
"""Sample action from policy distribution."""
logits = self.forward(state)
dist = torch.distributions.Categorical(logits=logits)
action = dist.sample()
log_prob = dist.log_prob(action)
return action.item(), log_prob

# Policy gradient (REINFORCE) update
def policy_gradient_update(policy, optimizer, log_probs, rewards, gamma=0.99):
"""Compute discounted returns and update policy via REINFORCE."""
# Compute discounted returns
returns = []
G = 0
for r in reversed(rewards):
G = r + gamma * G
returns.insert(0, G)
returns = torch.tensor(returns, dtype=torch.float32)
returns = (returns - returns.mean()) / (returns.std() + 1e-8) # normalize

# Policy gradient loss: -log_prob * G (maximize expected return)
policy_loss = []
for log_prob, G in zip(log_probs, returns):
policy_loss.append(-log_prob * G)

loss = torch.stack(policy_loss).sum()
optimizer.zero_grad()
loss.backward()
optimizer.step()
return loss.item()

When RL is the right choice

RL is powerful but expensive. Use it when:

ConditionExamples
Decisions in a sequence with delayed rewardsGame playing, robotics, dialogue systems
No labeled data, but a reward signal existsAlphaGo (win/loss), trading (P&L)
The environment is simulated and cheap to runVideo games, physics simulators
You need to optimize for a non-differentiable objectiveRLHF - optimize for human preference ratings

Do NOT reach for RL when:

  • You have labeled data - supervised learning is cheaper and more reliable
  • The reward is sparse and hard to define precisely
  • The real-world environment is expensive to interact with (manufacturing, clinical trials)
  • You need fast results - RL is notoriously sample-inefficient

:::note RLHF: Why RL Matters for LLMs Reinforcement Learning from Human Feedback (RLHF) is how ChatGPT and Claude are aligned to follow instructions and avoid harmful outputs. The reward signal comes from human preference comparisons (this response is better than that response). This is a hybrid: supervised pretraining, then RL for alignment. Understanding the RL paradigm is essential for working with modern LLMs. :::

Part 5 - Full Paradigm Comparison

ParadigmData RequiredLearnsExamplesWhen to Use
SupervisedLabeled (x, y) pairsP(yx)P(y \mid x)Logistic reg, GBT, CNN classifierLabeled data available; clear prediction target
UnsupervisedUnlabeled xx onlyP(x)P(x), structureK-Means, PCA, AutoencoderNo labels; discover structure; reduce dimensionality
Semi-supervisedSmall labeled + large unlabeledP(yx)P(y \mid x) using P(x)P(x)Label Propagation, Pseudo-labelingLabels expensive; unlabeled data abundant
Self-supervisedUnlabeled xx only (task from data)General representationsBERT, GPT, SimCLRPretrain at scale; fine-tune on small labeled set
ReinforcementReward signal from environmentPolicy π(as)\pi(a \mid s)Q-Learning, PPO, AlphaGoSequential decisions; reward but no labels; simulation available

Decision flowchart for paradigm selection

Do you have labeled (x, y) pairs for your task?

├── Yes → Is your labeled set large enough for your model?
│ │
│ ├── Yes → SUPERVISED LEARNING
│ │
│ └── No → Do you have unlabeled data?
│ │
│ ├── Yes → SEMI-SUPERVISED or
│ │ SELF-SUPERVISED PRETRAIN + fine-tune
│ │
│ └── No → Get more data. No labels → no supervised learning.

└── No → Is your goal to discover structure in the data?

├── Yes → UNSUPERVISED LEARNING (clustering, density, dim reduction)

└── No → Is your problem about sequential decisions with rewards?

├── Yes → REINFORCEMENT LEARNING

└── No → Reconsider problem framing.
You probably need labels.

Part 6 - The Pretrain-Then-Fine-Tune Paradigm

This is the dominant ML paradigm of 2020–present. You need to understand it.

Step 1 - Pretraining: Train a large model on a massive unlabeled dataset using a self-supervised objective. The model learns general-purpose representations.

Step 2 - Fine-tuning: Take the pretrained model, add a task-specific head, and train on a small labeled dataset. The pretrained representations accelerate learning and improve generalization.

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Example: BERT pretrained on 3.3B words (self-supervised masked language model)
# Fine-tuned on a sentiment analysis dataset with 1,000 labeled examples
# Achieves ~92% accuracy - training from scratch on 1,000 labels gives ~70%

model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Load pretrained model, add a 2-class classification head
model = AutoModelForSequenceClassification.from_pretrained(
model_name,
num_labels=2 # positive / negative sentiment
)

# In fine-tuning:
# - The pretrained transformer weights start at their pretrained values
# - The new classification head is randomly initialized
# - We train the whole thing on our labeled data with a small learning rate
# - The pretrained knowledge transfers; we only need to adapt it

def encode_text(texts, max_length=128):
return tokenizer(
texts,
padding=True,
truncation=True,
max_length=max_length,
return_tensors='pt'
)

# Fine-tuning would proceed as:
# for batch in fine_tune_dataloader:
# outputs = model(**batch)
# loss = outputs.loss
# loss.backward()
# optimizer.step()

Why this works: The pretrained model already understands language (or images, or code). Fine-tuning is cheap because we are not learning language from scratch - we are applying language understanding to a specific task.

:::tip Role-specific note Research Engineer: Self-supervised learning is an active research area - contrastive learning (SimCLR, MoCo), masked autoencoders (MAE), CLIP, DINO. Understanding the objective function of each is key to evaluating new architectures.

MLE: In practice, almost every production NLP/vision system starts with a pretrained model. Your skill is in choosing the right pretrained model, fine-tuning strategy (full fine-tuning vs. LoRA vs. prompt tuning), and evaluation.

ML Engineer: The pretrain-fine-tune paradigm has infrastructure implications - storing large pretrained models, version-controlled fine-tuned checkpoints, serving infrastructure that handles 7B+ parameter models. Plan for this. :::

Interview Questions

Q1: What is the difference between semi-supervised and self-supervised learning? Give a production example of each.

Semi-supervised learning uses a combination of a small labeled set and a large unlabeled set. The labeled set provides direct supervision; the unlabeled set helps learn the structure of P(x)P(x), which constrains and improves the learned P(yx)P(y \mid x).

Production example: Medical imaging company has 500 labeled X-rays (expensive radiologist annotations) and 50,000 unlabeled X-rays. They use label propagation or pseudo-labeling to leverage the unlabeled X-rays during classifier training.

Self-supervised learning creates a supervised pretext task from the structure of unlabeled data itself - no human labels needed. The "label" is derived from the data (next word, masked word, image rotation, contrastive pair).

Production example: Google trained BERT on 3.3 billion words from Wikipedia and BooksCorpus using masked language modeling - masking random words and predicting them. No human labeled any of this data. The resulting model is then fine-tuned on tiny labeled datasets for search, question answering, and entity recognition.

Key distinction: Semi-supervised assumes you have some human labels. Self-supervised assumes you have zero human labels and creates its own supervision from data structure. Self-supervised is more powerful at scale (it uses the entire internet as training data), but it requires designing the right pretext task.

Q2: Explain the exploration-exploitation tradeoff in reinforcement learning. How does ε-greedy address it, and what are its limitations?

The tradeoff: An RL agent must balance exploiting its current best knowledge (maximize immediate reward) against exploring unfamiliar actions (potentially discover higher long-term rewards). Pure exploitation never discovers better strategies. Pure exploration never converges to using good strategies.

ε-greedy approach: With probability ε, take a random action (explore). With probability 1-ε, take the current best-known action (exploit).

if random() < ε:
action = random_action() # explore
else:
action = argmax(Q[state]) # exploit

Strengths: Simple, tunable, empirically effective in many environments.

Limitations:

  1. ε is a hyperparameter that requires tuning: Too high → always random, never converges. Too low → never explores, gets stuck in local optima.
  2. Uniform exploration is inefficient: ε-greedy treats all actions equally during exploration. It wastes time re-exploring actions already known to be bad.
  3. No uncertainty modeling: ε-greedy does not track which actions are uncertain vs. well-estimated. UCB (Upper Confidence Bound) is more principled - explore actions where the estimate is uncertain, not just randomly.
  4. Fixed ε does not adapt: A decaying ε schedule (start high, decay over time) performs much better in practice because you should explore more early and exploit more late.

Better alternatives in practice: UCB for bandit settings, PPO/SAC for complex state-action spaces, Thompson Sampling when a Bayesian model is tractable.

Q3: Why does the pretrain-then-fine-tune paradigm outperform training from scratch on small labeled datasets?

The pretrain-then-fine-tune paradigm wins because of inductive bias and parameter initialization:

1. Better initialization: A randomly initialized model starts far from a useful solution in the loss landscape. A pretrained model starts near a good solution for language/vision tasks - one that has already compressed millions of examples into its weights. Fine-tuning is a small adjustment, not a large search.

2. Learned representations are transferable: Self-supervised pretraining (masked language modeling, contrastive vision) forces the model to learn general representations that capture syntax, semantics, spatial relationships, etc. These representations are useful across many downstream tasks.

3. Data efficiency: Training from scratch on 1,000 labeled sentiment examples requires the model to simultaneously learn "what language is" and "what sentiment means." Fine-tuning a BERT model only requires learning "what sentiment means" - the language understanding is pre-loaded.

4. Implicit regularization: Pretrained weights act as a prior. The model is regularized toward the pretraining distribution, which often corresponds to realistic, well-structured features. This reduces overfitting on small fine-tuning datasets.

Empirical magnitude: On GLUE benchmarks, BERT fine-tuned on 1,000 examples matches or exceeds LSTM models trained from scratch on 100,000 examples for many tasks. The 100x data efficiency is real and practically significant.

Q4: When would you use clustering as opposed to a supervised classifier, even if labeled data were available?

Several scenarios favor clustering over supervised classification even with labeled data available:

1. The true cluster structure does not match your labels: If your labels are coarse (e.g., "good customer" / "bad customer") but the data has rich substructure, clustering can reveal segment-specific patterns that your labels suppress. You might find that "bad customers" fall into three distinct groups requiring different interventions.

2. You need to discover unknown unknowns: A supervised classifier can only predict the classes it was trained on. Clustering can discover new groups you did not know existed - emerging customer segments, new fraud patterns, novel defect types.

3. Labels are expensive and the problem is exploratory: In early-stage projects, clustering lets you understand your data before committing to a labeling taxonomy. Labeling 50,000 examples as K=5 clusters that your team designed might miss the actual structure.

4. Anomaly detection without known anomaly classes: If you know what normal looks like but do not have a labeled set of anomalies, density-based clustering (DBSCAN) or Isolation Forest can flag outliers without supervised labels.

5. Unsupervised compression or preprocessing: PCA/K-Means for dimensionality reduction or feature compression before supervised training - the unsupervised step is not replacing supervised learning, it is making it more efficient.

The key principle: labeled data tells you what you already know. Clustering tells you what you do not know. Use both.

Q5: What is RLHF (Reinforcement Learning from Human Feedback) and why is it used to align language models?

RLHF is a three-stage process for aligning LLM behavior with human preferences:

Stage 1 - Supervised Fine-Tuning (SFT): Start with a pretrained LLM. Fine-tune it on high-quality human-written demonstrations of desired behavior (helpful, honest responses). This teaches the model the format and style of good responses.

Stage 2 - Reward Model Training: Collect human preference data: show pairs of model responses and ask humans which is better. Train a separate reward model Rϕ(x,y)R_\phi(x, y) to predict which response humans prefer - this model assigns a scalar score to (prompt, response) pairs.

Stage 3 - RL Fine-tuning (PPO): Use the reward model as a reward signal to fine-tune the LLM policy via Proximal Policy Optimization (PPO). The LLM learns to generate responses that score highly under the reward model, while a KL-divergence penalty prevents it from drifting too far from the SFT model.

Why RL is needed (not just supervised learning): Human preferences are not a fixed labeled dataset - the reward is a signal that depends on what the model generates. We want the model to optimize for "what humans prefer" in a way that generalizes, which is a sequential decision problem amenable to RL. If you just collected more demonstrations, you would miss the "optimization against the reward model" step that pushes toward better-than-human-average responses.

Practical limitation: RLHF is brittle. The reward model can be gamed (reward hacking - the LLM learns to produce responses that score highly on the reward model but are not actually good). This is an active research area, with DPO (Direct Preference Optimization) emerging as a simpler alternative that avoids explicit RL.

Key Takeaways

  • Supervised learning requires labeled (x,y)(x, y) pairs; the loss function encodes what "good" means; everything is Empirical Risk Minimization
  • Unsupervised learning discovers structure in P(x)P(x) without labels - useful for clustering, density estimation, and dimensionality reduction
  • Semi-supervised learning uses a small labeled set plus a large unlabeled set; self-supervised learning creates labels from the data itself with no human annotation
  • Reinforcement learning optimizes a policy π(as)\pi(a \mid s) to maximize cumulative reward - the right tool for sequential decisions with delayed rewards, not for problems with labeled data
  • The pretrain-then-fine-tune paradigm dominates modern NLP and vision: self-supervised pretraining provides general representations, labeled fine-tuning adapts them to specific tasks
  • Choosing the right paradigm before writing code is one of the highest-leverage decisions in an ML project

Next: Lesson 03 - The ML Workflow End-to-End →

:::tip 🎮 Interactive Playground

Visualize this concept: Try the K-Means Clustering demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.