Supervised, Unsupervised, and Reinforcement Learning
Reading time: ~22 minutes | Level: ML Foundations | Role: MLE, ML Engineer, Data Scientist, Research Engineer
A computer vision team was building a defect detection system for a semiconductor fab. They had 500,000 images but only 2,000 labeled as "defective" or "clean" - labeling required an expert who cost $150/hour. The team had three options they debated for two weeks:
Option A (Supervised): Train a classifier on just the 2,000 labeled images. The model trained fine but generalized poorly to new defect types the annotator had not seen.
Option B (Unsupervised first, supervised second): Use all 500,000 images to learn visual representations via self-supervised learning (SimCLR - predict whether two augmented views of the same image are the same image). Then fine-tune on the 2,000 labeled examples. The representation learning exposed the model to vastly more visual structure. Accuracy improved from 71% to 89%.
Option C (Active learning loop): Use the self-supervised model to find the images where the model was most uncertain, prioritize those for labeling, and iterate. 300 strategically selected labels outperformed 2,000 randomly selected labels.
The right answer was a combination of B and C - and knowing which paradigm to use, and when to combine them, is precisely what this lesson teaches.
What You Will Learn
- Supervised learning: labeled data, loss functions, and ERM
- Unsupervised learning: structure discovery without labels
- Semi-supervised and self-supervised learning - why they dominate modern ML
- Reinforcement learning: reward signals, exploration, and when it applies
- Algorithmic classification by paradigm with data requirements and use cases
- How to choose the right paradigm before writing any code
Part 1 - Supervised Learning
The Setup
Supervised learning is the most common paradigm. You have a labeled dataset:
Where are inputs (features) and are labels (targets). You learn a function that predicts from on unseen inputs.
The learning proceeds by minimizing a loss function over the training set - Empirical Risk Minimization (ERM):
Types of supervised learning
Classification: - predict a discrete class
- Binary: spam/not spam, fraud/not fraud
- Multiclass: digit recognition (10 classes), ImageNet (1,000 classes)
- Multilabel: document tagging (multiple tags per document)
Regression: - predict a continuous value
- House price prediction, demand forecasting, temperature prediction
- Time-to-event prediction (survival analysis)
Structured prediction: is a complex structure
- Sequence labeling (NER: "Apple → ORG", "New York → LOC")
- Machine translation (sequence-to-sequence)
- Object detection (bounding boxes + class labels per image)
The role of the loss function
The loss function encodes what "good prediction" means. Choosing the wrong loss is a design error that no amount of model tuning will fix:
| Task | Loss Function | Why |
|---|---|---|
| Binary classification | Binary cross-entropy | Measures log-likelihood under Bernoulli |
| Multiclass classification | Categorical cross-entropy | Measures log-likelihood under Categorical |
| Regression | MSE (L2 loss) | Optimal under Gaussian noise assumption |
| Regression with outliers | MAE (L1 loss) or Huber | Robust to heavy-tailed noise |
| Ranking | Pairwise ranking loss | Optimizes relative ordering |
| Sequence generation | Per-token cross-entropy | Decomposes joint probability by chain rule |
import numpy as np
import torch
import torch.nn as nn
# Classification: cross-entropy loss
# y_true is class indices, y_pred is logits
y_true = torch.tensor([0, 2, 1, 0]) # class indices
y_pred_logits = torch.randn(4, 3) # raw logits for 3 classes
ce_loss = nn.CrossEntropyLoss()
loss_val = ce_loss(y_pred_logits, y_true)
print(f"Cross-entropy loss: {loss_val.item():.4f}")
# Regression: MSE loss
y_true_reg = torch.tensor([2.3, 0.1, -1.4, 3.7])
y_pred_reg = torch.tensor([2.1, 0.5, -1.2, 3.9])
mse_loss = nn.MSELoss()
print(f"MSE loss: {mse_loss(y_pred_reg, y_true_reg).item():.4f}")
# sklearn: supervised learning baseline
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
X, y = make_classification(n_samples=5000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
clf = GradientBoostingClassifier(n_estimators=100, max_depth=3)
clf.fit(X_train, y_train)
print(f"Test accuracy: {accuracy_score(y_test, clf.predict(X_test)):.3f}")
What supervised learning requires
- Labeled data: (x, y) pairs. Typically need hundreds to millions depending on complexity.
- IID assumption: Training examples are independently drawn from the same distribution as test examples. Violating this (temporal ordering, group structure) requires special handling (Lesson 10).
- Label quality: Noisy labels directly degrade model performance. 20% random label noise can drop accuracy by 10–15%.
:::warning The labeling bottleneck In most real projects, getting labeled data is the primary constraint - not model complexity, not compute. This is why semi-supervised and self-supervised learning are critically important: they are strategies to reduce your dependence on expensive labeled data. :::
Part 2 - Unsupervised Learning
The Setup
Unsupervised learning has no labels. You observe and want to discover structure in the data distribution without being told what to look for.
This is fundamentally harder than supervised learning: there is no objective label to optimize toward, and "what counts as structure" is problem-dependent.
The three main unsupervised tasks
1. Clustering: Group similar observations together
Find groups such that within-group observations are similar and between-group observations are different.
from sklearn.cluster import KMeans, DBSCAN
from sklearn.preprocessing import StandardScaler
import numpy as np
# Customer segmentation: group customers by behavior
# No labels - we don't know how many segments exist or what they are
np.random.seed(42)
n_customers = 1000
customer_features = np.column_stack([
np.random.randn(n_customers), # recency (days since last purchase)
np.random.randn(n_customers), # frequency (purchases per month)
np.random.randn(n_customers), # monetary (avg spend)
])
scaler = StandardScaler()
X_scaled = scaler.fit_transform(customer_features)
# K-Means: assumes spherical clusters of similar size
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
segments = kmeans.fit_predict(X_scaled)
print(f"Cluster sizes: {np.bincount(segments)}")
# DBSCAN: density-based - finds clusters of arbitrary shape, handles noise
dbscan = DBSCAN(eps=0.5, min_samples=5)
segments_db = dbscan.fit_predict(X_scaled)
n_clusters = len(set(segments_db)) - (1 if -1 in segments_db else 0)
n_noise = np.sum(segments_db == -1)
print(f"DBSCAN: {n_clusters} clusters, {n_noise} noise points")
2. Density estimation: Model the probability distribution
Learn the probability density of the data. Useful for:
- Anomaly detection (low-density regions = anomalies)
- Generative modeling (sample from learned distribution)
- Compression and coding theory
3. Dimensionality reduction: Find a lower-dimensional representation
Map to where , preserving important structure.
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import numpy as np
# Dimensionality reduction for visualization and compression
np.random.seed(42)
X_highd = np.random.randn(500, 100) # 500 samples, 100-dimensional
# PCA: linear, preserves maximum variance
pca = PCA(n_components=10)
X_pca = pca.fit_transform(X_highd)
variance_explained = pca.explained_variance_ratio_.cumsum()
print(f"PCA 10 components explain {variance_explained[-1]*100:.1f}% of variance")
# t-SNE: nonlinear, preserves local structure - great for visualization
# Warning: t-SNE is for visualization only, not for downstream ML tasks
# It does not preserve global structure or distances
tsne = TSNE(n_components=2, random_state=42, perplexity=30)
X_2d = tsne.fit_transform(X_highd) # shape: (500, 2)
print(f"t-SNE output shape: {X_2d.shape}") # (500, 2) - for plotting
When unsupervised learning is the right choice
| Use Case | Algorithm | Why |
|---|---|---|
| Customer segmentation | K-Means, DBSCAN | No ground truth "correct" segments exist |
| Anomaly detection | Isolation Forest, Autoencoder | Normal data is plentiful; anomalies are rare and undefined |
| Topic modeling | LDA | No pre-existing topic taxonomy |
| Data compression | PCA, Autoencoder | Reduce storage/compute without labels |
| Exploratory data analysis | t-SNE, UMAP | Understand data structure before deciding what to predict |
| Pretraining (see Part 3) | SimCLR, BERT | Unlabeled data vastly outnumbers labeled data |
Part 3 - Semi-Supervised and Self-Supervised Learning
This is where most of the important recent progress in ML has happened. Both approaches bridge supervised and unsupervised learning, and they are hugely important in practice.
Semi-Supervised Learning
Setting: You have a small labeled set and a large unlabeled set , where .
The key assumption: the unlabeled data contains information about the structure of that helps learn .
Approaches:
Label propagation: If two points are close in input space, they likely share a label. Propagate labels from labeled to nearby unlabeled points.
Pseudo-labeling: Train on labeled data, use the trained model to generate "pseudo-labels" for unlabeled data, retrain on the combination.
Consistency regularization: Predict that augmented versions of the same unlabeled input should have the same label.
import numpy as np
from sklearn.semi_supervised import LabelPropagation, LabelSpreading
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Semi-supervised learning: 100 labeled, 900 unlabeled
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Mask 90% of training labels as "unknown" (-1 is the convention in sklearn)
n_labeled = int(0.1 * len(y_train))
y_partial = y_train.copy()
unlabeled_idx = np.random.choice(len(y_train), size=len(y_train) - n_labeled, replace=False)
y_partial[unlabeled_idx] = -1 # -1 = unlabeled in sklearn convention
print(f"Labeled: {np.sum(y_partial != -1)}, Unlabeled: {np.sum(y_partial == -1)}")
# Label Propagation: spread labels through the graph of similar points
lp = LabelPropagation(kernel='knn', n_neighbors=7, max_iter=1000)
lp.fit(X_train, y_partial)
# Compare against supervised-only (using only the labeled examples)
from sklearn.linear_model import LogisticRegression
labeled_mask = y_partial != -1
lr_supervised = LogisticRegression(max_iter=1000)
lr_supervised.fit(X_train[labeled_mask], y_train[labeled_mask])
print(f"Supervised only ({n_labeled} labels): {accuracy_score(y_test, lr_supervised.predict(X_test)):.3f}")
print(f"Label Propagation: {accuracy_score(y_test, lp.predict(X_test)):.3f}")
Self-Supervised Learning
Self-supervised learning is the most important idea in modern ML. It creates surrogate supervised tasks from the structure of unlabeled data - no human labels needed.
The mechanism: Define a pretext task where the "label" comes from the data itself:
- Predict the next word (GPT)
- Predict masked words (BERT)
- Predict whether two image crops came from the same image (SimCLR)
- Predict the rotation applied to an image (RotNet)
# Conceptual illustration of masked language modeling (BERT-style)
sentence = "The model learned to [MASK] embeddings from text."
# BERT's pretext task: predict the masked word
# Input to model:
# ["The", "model", "learned", "to", "[MASK]", "embeddings", "from", "text"]
# Target (from the data itself - no human label needed):
# position 4 → "generate"
# The model is forced to understand:
# - Grammar (verb follows "to")
# - Semantics (what verbs make sense with "embeddings")
# - Context (the whole sentence constrains the answer)
# After pretraining on 3.3B words, the learned representations
# are useful for dozens of downstream tasks:
# - Sentiment analysis
# - Named entity recognition
# - Question answering
# - Document classification
# All with minimal labeled fine-tuning data
Why self-supervised learning matters for engineers:
-
The internet is labeled data: Every web page, every image with a caption, every codebase with comments - all of this becomes training data for self-supervised models. The scale of available data grows by orders of magnitude.
-
The pretrain-then-fine-tune paradigm has won: In NLP, vision, speech, and code - the dominant approach is pretrain a large model with self-supervised objectives, then fine-tune on your small labeled dataset. Understanding why requires understanding self-supervised learning.
-
Transfer learning: Self-supervised pretraining produces general representations. For your specific problem (defect detection, medical imaging, financial time-series), you fine-tune a pretrained model on your few hundred labeled examples. This approach routinely matches or exceeds training from scratch with 10-100x more labeled data.
Part 4 - Reinforcement Learning
The Setup
Reinforcement learning (RL) is fundamentally different from supervised and unsupervised learning. There is no fixed dataset. Instead:
- An agent takes actions in an environment
- The environment returns a state and a reward
- The agent's goal: learn a policy that maximizes cumulative reward
Where is the discount factor (future rewards are worth less than immediate rewards).
import numpy as np
# RL concepts illustrated with a simple multi-armed bandit
# No environment library needed - shows core exploration/exploitation tradeoff
class MultiArmedBandit:
"""Simple bandit with k arms, each with a fixed mean reward."""
def __init__(self, k: int = 10, seed: int = 42):
np.random.seed(seed)
self.k = k
self.true_means = np.random.randn(k) # true reward means (unknown to agent)
def pull(self, arm: int) -> float:
"""Pull an arm, get noisy reward."""
return self.true_means[arm] + np.random.randn()
def epsilon_greedy(bandit, n_steps=1000, epsilon=0.1):
"""Epsilon-greedy policy: exploit with prob 1-ε, explore with prob ε."""
k = bandit.k
q_estimates = np.zeros(k) # estimated mean reward for each arm
n_pulls = np.zeros(k) # number of pulls for each arm
total_reward = 0
for t in range(n_steps):
if np.random.random() < epsilon:
arm = np.random.randint(k) # EXPLORE: random arm
else:
arm = np.argmax(q_estimates) # EXPLOIT: best known arm
reward = bandit.pull(arm)
n_pulls[arm] += 1
# Online update of mean estimate
q_estimates[arm] += (reward - q_estimates[arm]) / n_pulls[arm]
total_reward += reward
return total_reward, q_estimates
bandit = MultiArmedBandit(k=10)
best_possible = np.max(bandit.true_means) * 1000 # theoretical max reward
for eps in [0.0, 0.01, 0.1, 0.3]:
total, estimates = epsilon_greedy(bandit, n_steps=1000, epsilon=eps)
print(f"ε={eps:.2f}: total_reward={total:.1f} (best_possible={best_possible:.1f})")
Exploration vs. Exploitation
The central challenge in RL is the exploration-exploitation tradeoff:
- Exploitation: use the best action currently known → maximize short-term reward
- Exploration: try new actions → potentially discover better long-term rewards
This tradeoff has no perfect solution - it is a fundamental tension. Key strategies:
| Strategy | Idea | When to use |
|---|---|---|
| ε-greedy | Random exploration with probability ε | Simple, works well with constant ε-schedule |
| UCB (Upper Confidence Bound) | Explore actions with high uncertainty | When statistical bounds are tractable |
| Thompson Sampling | Sample from posterior, pick best | Bayesian setting, strong performance |
| Entropy bonus | Reward agent for diverse behavior | Avoid local optima in complex environments |
Deep RL in practice
Modern RL uses deep neural networks as function approximators for the policy and value function :
import torch
import torch.nn as nn
class PolicyNetwork(nn.Module):
"""Simple policy network for continuous state, discrete action spaces."""
def __init__(self, state_dim: int, n_actions: int, hidden_dim: int = 128):
super().__init__()
self.net = nn.Sequential(
nn.Linear(state_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, n_actions),
)
def forward(self, state: torch.Tensor) -> torch.Tensor:
"""Returns logits over actions given state."""
return self.net(state)
def get_action(self, state: torch.Tensor) -> tuple:
"""Sample action from policy distribution."""
logits = self.forward(state)
dist = torch.distributions.Categorical(logits=logits)
action = dist.sample()
log_prob = dist.log_prob(action)
return action.item(), log_prob
# Policy gradient (REINFORCE) update
def policy_gradient_update(policy, optimizer, log_probs, rewards, gamma=0.99):
"""Compute discounted returns and update policy via REINFORCE."""
# Compute discounted returns
returns = []
G = 0
for r in reversed(rewards):
G = r + gamma * G
returns.insert(0, G)
returns = torch.tensor(returns, dtype=torch.float32)
returns = (returns - returns.mean()) / (returns.std() + 1e-8) # normalize
# Policy gradient loss: -log_prob * G (maximize expected return)
policy_loss = []
for log_prob, G in zip(log_probs, returns):
policy_loss.append(-log_prob * G)
loss = torch.stack(policy_loss).sum()
optimizer.zero_grad()
loss.backward()
optimizer.step()
return loss.item()
When RL is the right choice
RL is powerful but expensive. Use it when:
| Condition | Examples |
|---|---|
| Decisions in a sequence with delayed rewards | Game playing, robotics, dialogue systems |
| No labeled data, but a reward signal exists | AlphaGo (win/loss), trading (P&L) |
| The environment is simulated and cheap to run | Video games, physics simulators |
| You need to optimize for a non-differentiable objective | RLHF - optimize for human preference ratings |
Do NOT reach for RL when:
- You have labeled data - supervised learning is cheaper and more reliable
- The reward is sparse and hard to define precisely
- The real-world environment is expensive to interact with (manufacturing, clinical trials)
- You need fast results - RL is notoriously sample-inefficient
:::note RLHF: Why RL Matters for LLMs Reinforcement Learning from Human Feedback (RLHF) is how ChatGPT and Claude are aligned to follow instructions and avoid harmful outputs. The reward signal comes from human preference comparisons (this response is better than that response). This is a hybrid: supervised pretraining, then RL for alignment. Understanding the RL paradigm is essential for working with modern LLMs. :::
Part 5 - Full Paradigm Comparison
| Paradigm | Data Required | Learns | Examples | When to Use |
|---|---|---|---|---|
| Supervised | Labeled (x, y) pairs | Logistic reg, GBT, CNN classifier | Labeled data available; clear prediction target | |
| Unsupervised | Unlabeled only | , structure | K-Means, PCA, Autoencoder | No labels; discover structure; reduce dimensionality |
| Semi-supervised | Small labeled + large unlabeled | using | Label Propagation, Pseudo-labeling | Labels expensive; unlabeled data abundant |
| Self-supervised | Unlabeled only (task from data) | General representations | BERT, GPT, SimCLR | Pretrain at scale; fine-tune on small labeled set |
| Reinforcement | Reward signal from environment | Policy | Q-Learning, PPO, AlphaGo | Sequential decisions; reward but no labels; simulation available |
Decision flowchart for paradigm selection
Do you have labeled (x, y) pairs for your task?
│
├── Yes → Is your labeled set large enough for your model?
│ │
│ ├── Yes → SUPERVISED LEARNING
│ │
│ └── No → Do you have unlabeled data?
│ │
│ ├── Yes → SEMI-SUPERVISED or
│ │ SELF-SUPERVISED PRETRAIN + fine-tune
│ │
│ └── No → Get more data. No labels → no supervised learning.
│
└── No → Is your goal to discover structure in the data?
│
├── Yes → UNSUPERVISED LEARNING (clustering, density, dim reduction)
│
└── No → Is your problem about sequential decisions with rewards?
│
├── Yes → REINFORCEMENT LEARNING
│
└── No → Reconsider problem framing.
You probably need labels.
Part 6 - The Pretrain-Then-Fine-Tune Paradigm
This is the dominant ML paradigm of 2020–present. You need to understand it.
Step 1 - Pretraining: Train a large model on a massive unlabeled dataset using a self-supervised objective. The model learns general-purpose representations.
Step 2 - Fine-tuning: Take the pretrained model, add a task-specific head, and train on a small labeled dataset. The pretrained representations accelerate learning and improve generalization.
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Example: BERT pretrained on 3.3B words (self-supervised masked language model)
# Fine-tuned on a sentiment analysis dataset with 1,000 labeled examples
# Achieves ~92% accuracy - training from scratch on 1,000 labels gives ~70%
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Load pretrained model, add a 2-class classification head
model = AutoModelForSequenceClassification.from_pretrained(
model_name,
num_labels=2 # positive / negative sentiment
)
# In fine-tuning:
# - The pretrained transformer weights start at their pretrained values
# - The new classification head is randomly initialized
# - We train the whole thing on our labeled data with a small learning rate
# - The pretrained knowledge transfers; we only need to adapt it
def encode_text(texts, max_length=128):
return tokenizer(
texts,
padding=True,
truncation=True,
max_length=max_length,
return_tensors='pt'
)
# Fine-tuning would proceed as:
# for batch in fine_tune_dataloader:
# outputs = model(**batch)
# loss = outputs.loss
# loss.backward()
# optimizer.step()
Why this works: The pretrained model already understands language (or images, or code). Fine-tuning is cheap because we are not learning language from scratch - we are applying language understanding to a specific task.
:::tip Role-specific note Research Engineer: Self-supervised learning is an active research area - contrastive learning (SimCLR, MoCo), masked autoencoders (MAE), CLIP, DINO. Understanding the objective function of each is key to evaluating new architectures.
MLE: In practice, almost every production NLP/vision system starts with a pretrained model. Your skill is in choosing the right pretrained model, fine-tuning strategy (full fine-tuning vs. LoRA vs. prompt tuning), and evaluation.
ML Engineer: The pretrain-fine-tune paradigm has infrastructure implications - storing large pretrained models, version-controlled fine-tuned checkpoints, serving infrastructure that handles 7B+ parameter models. Plan for this. :::
Interview Questions
Q1: What is the difference between semi-supervised and self-supervised learning? Give a production example of each.
Semi-supervised learning uses a combination of a small labeled set and a large unlabeled set. The labeled set provides direct supervision; the unlabeled set helps learn the structure of , which constrains and improves the learned .
Production example: Medical imaging company has 500 labeled X-rays (expensive radiologist annotations) and 50,000 unlabeled X-rays. They use label propagation or pseudo-labeling to leverage the unlabeled X-rays during classifier training.
Self-supervised learning creates a supervised pretext task from the structure of unlabeled data itself - no human labels needed. The "label" is derived from the data (next word, masked word, image rotation, contrastive pair).
Production example: Google trained BERT on 3.3 billion words from Wikipedia and BooksCorpus using masked language modeling - masking random words and predicting them. No human labeled any of this data. The resulting model is then fine-tuned on tiny labeled datasets for search, question answering, and entity recognition.
Key distinction: Semi-supervised assumes you have some human labels. Self-supervised assumes you have zero human labels and creates its own supervision from data structure. Self-supervised is more powerful at scale (it uses the entire internet as training data), but it requires designing the right pretext task.
Q2: Explain the exploration-exploitation tradeoff in reinforcement learning. How does ε-greedy address it, and what are its limitations?
The tradeoff: An RL agent must balance exploiting its current best knowledge (maximize immediate reward) against exploring unfamiliar actions (potentially discover higher long-term rewards). Pure exploitation never discovers better strategies. Pure exploration never converges to using good strategies.
ε-greedy approach: With probability ε, take a random action (explore). With probability 1-ε, take the current best-known action (exploit).
if random() < ε:
action = random_action() # explore
else:
action = argmax(Q[state]) # exploit
Strengths: Simple, tunable, empirically effective in many environments.
Limitations:
- ε is a hyperparameter that requires tuning: Too high → always random, never converges. Too low → never explores, gets stuck in local optima.
- Uniform exploration is inefficient: ε-greedy treats all actions equally during exploration. It wastes time re-exploring actions already known to be bad.
- No uncertainty modeling: ε-greedy does not track which actions are uncertain vs. well-estimated. UCB (Upper Confidence Bound) is more principled - explore actions where the estimate is uncertain, not just randomly.
- Fixed ε does not adapt: A decaying ε schedule (start high, decay over time) performs much better in practice because you should explore more early and exploit more late.
Better alternatives in practice: UCB for bandit settings, PPO/SAC for complex state-action spaces, Thompson Sampling when a Bayesian model is tractable.
Q3: Why does the pretrain-then-fine-tune paradigm outperform training from scratch on small labeled datasets?
The pretrain-then-fine-tune paradigm wins because of inductive bias and parameter initialization:
1. Better initialization: A randomly initialized model starts far from a useful solution in the loss landscape. A pretrained model starts near a good solution for language/vision tasks - one that has already compressed millions of examples into its weights. Fine-tuning is a small adjustment, not a large search.
2. Learned representations are transferable: Self-supervised pretraining (masked language modeling, contrastive vision) forces the model to learn general representations that capture syntax, semantics, spatial relationships, etc. These representations are useful across many downstream tasks.
3. Data efficiency: Training from scratch on 1,000 labeled sentiment examples requires the model to simultaneously learn "what language is" and "what sentiment means." Fine-tuning a BERT model only requires learning "what sentiment means" - the language understanding is pre-loaded.
4. Implicit regularization: Pretrained weights act as a prior. The model is regularized toward the pretraining distribution, which often corresponds to realistic, well-structured features. This reduces overfitting on small fine-tuning datasets.
Empirical magnitude: On GLUE benchmarks, BERT fine-tuned on 1,000 examples matches or exceeds LSTM models trained from scratch on 100,000 examples for many tasks. The 100x data efficiency is real and practically significant.
Q4: When would you use clustering as opposed to a supervised classifier, even if labeled data were available?
Several scenarios favor clustering over supervised classification even with labeled data available:
1. The true cluster structure does not match your labels: If your labels are coarse (e.g., "good customer" / "bad customer") but the data has rich substructure, clustering can reveal segment-specific patterns that your labels suppress. You might find that "bad customers" fall into three distinct groups requiring different interventions.
2. You need to discover unknown unknowns: A supervised classifier can only predict the classes it was trained on. Clustering can discover new groups you did not know existed - emerging customer segments, new fraud patterns, novel defect types.
3. Labels are expensive and the problem is exploratory: In early-stage projects, clustering lets you understand your data before committing to a labeling taxonomy. Labeling 50,000 examples as K=5 clusters that your team designed might miss the actual structure.
4. Anomaly detection without known anomaly classes: If you know what normal looks like but do not have a labeled set of anomalies, density-based clustering (DBSCAN) or Isolation Forest can flag outliers without supervised labels.
5. Unsupervised compression or preprocessing: PCA/K-Means for dimensionality reduction or feature compression before supervised training - the unsupervised step is not replacing supervised learning, it is making it more efficient.
The key principle: labeled data tells you what you already know. Clustering tells you what you do not know. Use both.
Q5: What is RLHF (Reinforcement Learning from Human Feedback) and why is it used to align language models?
RLHF is a three-stage process for aligning LLM behavior with human preferences:
Stage 1 - Supervised Fine-Tuning (SFT): Start with a pretrained LLM. Fine-tune it on high-quality human-written demonstrations of desired behavior (helpful, honest responses). This teaches the model the format and style of good responses.
Stage 2 - Reward Model Training: Collect human preference data: show pairs of model responses and ask humans which is better. Train a separate reward model to predict which response humans prefer - this model assigns a scalar score to (prompt, response) pairs.
Stage 3 - RL Fine-tuning (PPO): Use the reward model as a reward signal to fine-tune the LLM policy via Proximal Policy Optimization (PPO). The LLM learns to generate responses that score highly under the reward model, while a KL-divergence penalty prevents it from drifting too far from the SFT model.
Why RL is needed (not just supervised learning): Human preferences are not a fixed labeled dataset - the reward is a signal that depends on what the model generates. We want the model to optimize for "what humans prefer" in a way that generalizes, which is a sequential decision problem amenable to RL. If you just collected more demonstrations, you would miss the "optimization against the reward model" step that pushes toward better-than-human-average responses.
Practical limitation: RLHF is brittle. The reward model can be gamed (reward hacking - the LLM learns to produce responses that score highly on the reward model but are not actually good). This is an active research area, with DPO (Direct Preference Optimization) emerging as a simpler alternative that avoids explicit RL.
Key Takeaways
- Supervised learning requires labeled pairs; the loss function encodes what "good" means; everything is Empirical Risk Minimization
- Unsupervised learning discovers structure in without labels - useful for clustering, density estimation, and dimensionality reduction
- Semi-supervised learning uses a small labeled set plus a large unlabeled set; self-supervised learning creates labels from the data itself with no human annotation
- Reinforcement learning optimizes a policy to maximize cumulative reward - the right tool for sequential decisions with delayed rewards, not for problems with labeled data
- The pretrain-then-fine-tune paradigm dominates modern NLP and vision: self-supervised pretraining provides general representations, labeled fine-tuning adapts them to specific tasks
- Choosing the right paradigm before writing code is one of the highest-leverage decisions in an ML project
Next: Lesson 03 - The ML Workflow End-to-End →
:::tip 🎮 Interactive Playground
Visualize this concept: Try the K-Means Clustering demo on the EngineersOfAI Playground - no code required.
:::
