Skip to main content

:::tip 🎮 Interactive Playground Visualize this concept: Try the Active Learning Loop demo on the EngineersOfAI Playground - no code required. :::

Active Learning

The Annotation Budget That Would Never Close

A healthcare AI company had 180,000 clinical trial documents awaiting analysis. Each required a licensed pharmacologist to annotate drug interaction risks - 25 minutes per document, at 180/hourbilled.Fullannotationcost:180/hour billed. Full annotation cost: 13.5 million. Their total annotation budget for the year: $150,000. At that budget, they could afford to annotate 2,000 documents.

The naive approach - random sampling - would give them 2,000 documents that looked like the overall distribution: mostly routine cases with straightforward drug interactions, a handful of complex cases. A model trained on random samples would learn to handle routine cases well but fail on the complex interactions that were the entire point of the system. After three months of labeling, they would have a model that could handle 85% of documents automatically but would collapse on the 15% that actually required expert analysis. The engineers knew this before they started - but they did not know what to do about it.

The answer was active learning. Instead of selecting documents randomly, they built a system that selected the documents where the current model was most uncertain - the cases at the boundary between label classes, the cases that would teach the model the most about the decision surface it needed to learn. After 600 annotations selected by uncertainty sampling, the model outperformed the model trained on 1,800 randomly selected annotations. After 1,200 strategically selected annotations, the model handled 94% of the corpus automatically and identified the high-risk interaction cases with high recall - which was the actual goal. They had effectively turned a 13.5millionannotationproblemintoa13.5 million annotation problem into a 90,000 one.

Active learning is the field that makes this possible. It is the theory and practice of selecting which unlabeled examples to annotate next, given a limited annotation budget, in order to maximize model performance improvement per annotation dollar. The key insight is that not all labels are equally valuable. A label on an example the model already handles confidently adds almost no information. A label on an example the model is genuinely uncertain about - right at the decision boundary - may update the model's parameters significantly. Active learning exploits this information asymmetry to stretch annotation budgets by factors of 5 to 20 in typical applications.


Why This Exists

Before active learning, the standard assumption in supervised machine learning was that labels were given - that the practitioner had access to a labeled dataset and trained on it. The annotation process was treated as a preprocessing step that happened before machine learning began.

This assumption reflected the academic environment where the field developed: benchmark datasets like MNIST, ImageNet, and Penn Treebank were laboriously annotated once and then used for years. The cost and logistics of annotation were someone else's problem.

In production environments, this assumption fails almost immediately. Real deployment requires continuous annotation as the input distribution evolves. New domains require annotations that do not exist in any existing benchmark. High-expertise domains - medical imaging, legal document review, financial risk assessment - have annotation costs that make exhaustive labeling economically impossible.

Active learning emerged from this gap. Its intellectual roots go back to Angluin's (1988) work on query learning, which asked: how many queries does a learning algorithm need to identify an unknown concept, and what should those queries be? The modern framework - pool-based active learning - was formalized by Lewis and Gale (1994) for text classification and by Cohn, Ghahramani, and Jordan (1996) for neural networks. The field has grown substantially since, with particular acceleration after 2015 as deep learning created new annotation challenges and new uncertainty estimation techniques.

The core insight has remained constant since 1994: given a pool of unlabeled examples and a budget for annotation, the choice of which examples to annotate is itself an optimization problem. Solving that optimization problem well is the difference between a model that works and a model that works on a tenth of the annotation budget.


The Active Learning Loop

Active learning operates as a loop between model training, uncertainty estimation, sample selection, and annotation. Understanding this loop is essential to implementing it correctly.

Each iteration of this loop is called an active learning round. The number of examples selected per round (the batch size) and the selection strategy (the acquisition function) are the two primary design decisions. The annotation oracle can be a human expert, a crowd-sourcing platform, an automated system, or a combination - in modern deployments it is frequently a large language model supplemented by human review.


Uncertainty Sampling: The Foundational Strategy

Uncertainty sampling is the simplest and most widely used active learning strategy. The intuition is direct: if the model is uncertain about an example, that example is near a decision boundary and a label on it will move the model's parameters meaningfully. If the model is confident about an example, a label on it confirms what the model already knows and adds little information.

For a binary classifier producing probability pp for the positive class, three uncertainty measures are commonly used.

Least Confidence Sampling

Select the example where the model's confidence in its most likely prediction is lowest:

x=argminxU  Pθ(y^x)x^* = \underset{x \in U}{\arg\min} \; P_\theta(\hat{y} \mid x)

where y^=argmaxyPθ(yx)\hat{y} = \arg\max_y P_\theta(y \mid x) is the model's predicted label and UU is the pool of unlabeled examples.

For a binary classifier with output pp, least confidence is equivalent to selecting the example closest to p=0.5p = 0.5:

uncertainty(x)=1max(p,1p)\text{uncertainty}(x) = 1 - \max(p, 1-p)

Least confidence is simple to compute and interpret, but it ignores the distribution of probability mass across all classes. For multi-class problems, a model might have low confidence in its top class while the probability mass is spread fairly uniformly across three of ten classes - and this is different from a case where the probability is concentrated between two classes.

Margin Sampling

Select the example where the margin between the top-two predicted classes is smallest:

x=argminxU  [Pθ(y^1x)Pθ(y^2x)]x^* = \underset{x \in U}{\arg\min} \; \left[ P_\theta(\hat{y}_1 \mid x) - P_\theta(\hat{y}_2 \mid x) \right]

where y^1\hat{y}_1 and y^2\hat{y}_2 are the most and second-most probable classes.

margin(x)=Pθ(y^1x)Pθ(y^2x)\text{margin}(x) = P_\theta(\hat{y}_1 \mid x) - P_\theta(\hat{y}_2 \mid x)

Small margin means the model is nearly indecisive between two classes - these are typically decision boundary examples. Margin sampling is generally preferred over least confidence for multi-class problems because it captures the binary decision the model is struggling with, rather than just the confidence in the top class.

Entropy Sampling

Select the example that maximizes the Shannon entropy of the model's predicted class distribution:

x=argmaxxU  H[Pθ(yx)]x^* = \underset{x \in U}{\arg\max} \; H\left[P_\theta(y \mid x)\right]

H[Pθ(yx)]=yYPθ(yx)logPθ(yx)H\left[P_\theta(y \mid x)\right] = -\sum_{y \in \mathcal{Y}} P_\theta(y \mid x) \log P_\theta(y \mid x)

Entropy is maximized when probability mass is spread uniformly across all classes, and minimized (zero) when all mass is on a single class. Entropy captures uncertainty across the full class distribution, not just the top-one or top-two classes. For problems with many classes, entropy sampling is typically the best uncertainty measure because it accounts for all class probabilities.

Numerical Example

Consider a 4-class problem with model outputs for two candidate examples:

ExampleClass 1Class 2Class 3Class 4LC ScoreMarginEntropy
A0.700.200.070.030.300.500.96 bits
B0.450.400.100.050.550.051.36 bits

Example B has a higher entropy and lower margin - it is a better candidate for annotation than A, even though both have similar least confidence scores. All three measures would select B over A, but the margin and entropy measures are more informative about why B is more uncertain.


Diversity Sampling: Complementing Uncertainty

Uncertainty sampling has a critical weakness: it tends to select redundant examples. If a cluster of similar examples all lie near the same decision boundary, uncertainty sampling will select many of them in the same round. These examples are informative, but they are informative in the same way - each additional selection from the cluster adds diminishing marginal information.

Diversity sampling addresses this by ensuring that selected examples are representative of the full unlabeled pool - not just concentrated in the most uncertain region.

Core-Set Selection

Core-set selection (Sener and Savarese, 2018) addresses diversity sampling with a principled objective: select the smallest set of points such that a ball of radius δ\delta around each selected point covers all other unlabeled points in the feature space.

Equivalently, select points that minimize the maximum distance from any unlabeled point to its nearest selected point:

core-set objective=argminSU,S=K  maxxU  minsS  d(x,s)\text{core-set objective} = \underset{S \subseteq U, |S| = K}{\arg\min} \; \underset{x \in U}{\max} \; \underset{s \in S}{\min} \; d(x, s)

where dd is a distance function in the model's feature space (typically the representation layer before the classification head), SS is the set of selected examples, and KK is the budget.

This is the K-Center problem, which is NP-hard in general. In practice, a greedy approximation works well:

  1. Start with the current labeled set as the initial set of "centers"
  2. Repeatedly select the unlabeled point that is farthest from any current center
  3. Add that point to the set of centers
  4. Repeat until budget KK is exhausted

The greedy algorithm guarantees a 2-approximation: the radius of the solution is at most twice the optimal radius. For active learning purposes, this approximation quality is generally sufficient.


Core-Set Selection: Implementation

import numpy as np
from sklearn.metrics import pairwise_distances
from dataclasses import dataclass
from typing import Optional

@dataclass
class CoreSetResult:
selected_indices: list[int]
coverage_radius: float
selection_order: list[int]

def greedy_coreset(
embeddings: np.ndarray,
labeled_indices: list[int],
budget: int,
distance_metric: str = "euclidean",
verbose: bool = False
) -> CoreSetResult:
"""
Greedy core-set selection for active learning.

Selects `budget` examples from the unlabeled pool that maximize
coverage of the embedding space - each selected point should be
as far as possible from all previously selected points.

Args:
embeddings: Feature representations, shape (N, D)
labeled_indices: Indices of already-labeled examples
budget: Number of new examples to select
distance_metric: Distance metric for embedding space
verbose: Print progress

Returns:
CoreSetResult with selected indices and coverage radius
"""
n = len(embeddings)
unlabeled_indices = list(set(range(n)) - set(labeled_indices))

if budget >= len(unlabeled_indices):
# Edge case: budget exceeds unlabeled pool
return CoreSetResult(
selected_indices=unlabeled_indices,
coverage_radius=0.0,
selection_order=unlabeled_indices
)

# Start with labeled set as initial centers
center_indices = list(labeled_indices)
selected = []

# Compute initial distances: each unlabeled point to nearest labeled center
if center_indices:
dists = pairwise_distances(
embeddings[unlabeled_indices],
embeddings[center_indices],
metric=distance_metric
)
min_distances = dists.min(axis=1) # shape: (n_unlabeled,)
else:
# No labeled examples yet - cold start
# Seed with the geometric center of the embedding space
center = embeddings[unlabeled_indices].mean(axis=0, keepdims=True)
dists = pairwise_distances(
embeddings[unlabeled_indices],
center,
metric=distance_metric
)
min_distances = dists.flatten()

for i in range(budget):
# Select the unlabeled point farthest from any current center
farthest_local_idx = int(np.argmax(min_distances))
farthest_global_idx = unlabeled_indices[farthest_local_idx]

selected.append(farthest_global_idx)
center_indices.append(farthest_global_idx)

if verbose and (i % 50 == 0 or i == budget - 1):
print(f" Round {i+1}/{budget}: selected idx={farthest_global_idx}, "
f"coverage_radius={min_distances[farthest_local_idx]:.4f}")

# Update minimum distances: each unlabeled point's distance to the
# new center may be smaller than its previous minimum
new_dists = pairwise_distances(
embeddings[unlabeled_indices],
embeddings[[farthest_global_idx]],
metric=distance_metric
).flatten()

min_distances = np.minimum(min_distances, new_dists)
# Set distance of selected point to 0 so it won't be selected again
min_distances[farthest_local_idx] = 0.0

coverage_radius = float(min_distances.max())

return CoreSetResult(
selected_indices=selected,
coverage_radius=coverage_radius,
selection_order=selected
)

def badge_selection(
embeddings: np.ndarray,
class_probs: np.ndarray,
labeled_indices: list[int],
budget: int
) -> list[int]:
"""
BADGE: Batch Active learning by Diverse Gradient Embeddings (Ash et al., 2020).

Combines uncertainty and diversity: uses the gradient of the loss with respect
to model parameters as the embedding space for core-set selection. This
captures both where the model is uncertain (gradient magnitude) and
which inputs are structurally different (gradient direction).

Args:
embeddings: Feature layer representations, shape (N, D)
class_probs: Predicted class probabilities, shape (N, C)
labeled_indices: Already-labeled indices
budget: Number to select

Returns:
List of selected indices
"""
n, d = embeddings.shape
n, c = class_probs.shape

# Compute gradient embeddings: outer product of feature representation
# and (y_hat - y_pred) for the top predicted class
predicted_classes = class_probs.argmax(axis=1) # (N,)

# Gradient embedding: for classification, this is proportional to
# feature * (indicator - softmax probability)
# We use the full gradient embedding approximation
gradient_embeddings = np.zeros((n, d * c))

for i in range(n):
# Probability vector for this example
p = class_probs[i] # (C,)
# Predicted class
y_hat = predicted_classes[i]

# Gradient: feature * (e_y_hat - p) where e_y_hat is one-hot
indicator = np.zeros(c)
indicator[y_hat] = 1.0
delta = indicator - p # (C,)

# Outer product: (D,) x (C,) -> flattened (D*C,)
gradient_embeddings[i] = np.outer(embeddings[i], delta).flatten()

# Now apply core-set selection in gradient embedding space
result = greedy_coreset(
embeddings=gradient_embeddings,
labeled_indices=labeled_indices,
budget=budget,
distance_metric="euclidean"
)

return result.selected_indices

# Demonstration
np.random.seed(42)
N, D, C = 1000, 128, 5

# Simulate embeddings and model predictions
embeddings = np.random.randn(N, D).astype(np.float32)
# Simulate a model that is uncertain in one region of the embedding space
logits = embeddings @ np.random.randn(D, C)
# Make some examples more uncertain by pushing logits toward uniform
uncertain_mask = np.linalg.norm(embeddings, axis=1) < 1.0
logits[uncertain_mask] *= 0.2

exp_logits = np.exp(logits - logits.max(axis=1, keepdims=True))
class_probs = exp_logits / exp_logits.sum(axis=1, keepdims=True)

# Start with 50 random labeled examples
labeled = list(np.random.choice(N, 50, replace=False))
budget = 100

print("=== Core-Set Selection ===")
result = greedy_coreset(
embeddings=embeddings,
labeled_indices=labeled,
budget=budget,
verbose=True
)
print(f"Selected {len(result.selected_indices)} examples")
print(f"Coverage radius: {result.coverage_radius:.4f}")

print("\n=== BADGE Selection ===")
badge_selected = badge_selection(
embeddings=embeddings,
class_probs=class_probs,
labeled_indices=labeled,
budget=budget
)
print(f"Selected {len(badge_selected)} examples")

# Compare with pure uncertainty sampling (entropy)
entropies = -np.sum(class_probs * np.log(class_probs + 1e-9), axis=1)
unlabeled = list(set(range(N)) - set(labeled))
uncertainty_selected = sorted(unlabeled, key=lambda i: -entropies[i])[:budget]

# Check: how much do selections overlap?
coreset_set = set(result.selected_indices)
badge_set = set(badge_selected)
uncertainty_set = set(uncertainty_selected)

print(f"\nOverlap: coreset ∩ uncertainty: {len(coreset_set & uncertainty_set)}/{budget}")
print(f"Overlap: badge ∩ uncertainty: {len(badge_set & uncertainty_set)}/{budget}")
print(f"Overlap: coreset ∩ badge: {len(coreset_set & badge_set)}/{budget}")
print("\n(Low overlap = strategies select different, complementary examples)")

LLM Uncertainty Estimation

Traditional uncertainty estimation works well for models that produce calibrated class probabilities. LLMs present a different challenge: they produce token probabilities, not class probabilities, and their internal uncertainty does not map cleanly onto the uncertainty measures above.

Several techniques have emerged for LLM uncertainty estimation in the active learning context.

Verbalized Uncertainty with claude-haiku-4-5-20251001

The most practical approach for LLM-based active learning is to prompt the model to express its confidence, then use that expressed confidence as the acquisition function. This technique, while less theoretically grounded than probability-based measures, has been shown empirically to correlate with actual model accuracy for modern LLMs that are well-calibrated to express genuine uncertainty.

import anthropic
import json
import re
from dataclasses import dataclass
from typing import Optional

client = anthropic.Anthropic()

@dataclass
class LLMUncertaintyEstimate:
example_text: str
predicted_label: str
confidence: float
reasoning: str
alternative_labels: list[str]
select_for_annotation: bool

def estimate_llm_uncertainty(
example: str,
label_schema: list[str],
task_description: str,
confidence_threshold: float = 0.80
) -> LLMUncertaintyEstimate:
"""
Use claude-haiku-4-5-20251001 to estimate classification uncertainty
for active learning sample selection.

The model is asked to verbalize its uncertainty about the correct label.
Examples where the model reports low confidence are selected for annotation.

Args:
example: Text to classify
label_schema: List of valid label names
task_description: Description of the classification task
confidence_threshold: Select for annotation if confidence below this

Returns:
Uncertainty estimate with selection recommendation
"""
label_list = ", ".join(f'"{l}"' for l in label_schema)

response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=512,
messages=[{
"role": "user",
"content": f"""You are a classifier performing {task_description}.

Text to classify: {example}

Valid labels: {label_list}

Classify this text and express your genuine uncertainty. Be honest:
- If the label is clearly obvious, report high confidence (0.85-0.99)
- If there are plausible alternative interpretations, report moderate confidence (0.60-0.84)
- If you are genuinely unsure between two or more labels, report low confidence (below 0.60)

Do NOT inflate your confidence. Accurate uncertainty reporting is more important
than appearing confident.

Respond with JSON only:
{{
"predicted_label": "the label you would assign",
"confidence": 0.0 to 1.0,
"reasoning": "why you chose this label",
"alternative_labels": ["other labels you considered with their rough probabilities"],
"key_ambiguity": "what makes this classification difficult, or null if it is clear"
}}"""
}]
)

try:
raw = response.content[0].text.strip()
# Handle case where model wraps JSON in markdown
if raw.startswith("```"):
raw = re.sub(r"```(?:json)?", "", raw).strip().rstrip("```").strip()
result = json.loads(raw)
except (json.JSONDecodeError, IndexError):
# Parsing failure - treat as maximally uncertain
return LLMUncertaintyEstimate(
example_text=example,
predicted_label="unknown",
confidence=0.0,
reasoning="Parsing failed",
alternative_labels=[],
select_for_annotation=True
)

confidence = float(result.get("confidence", 0.5))

return LLMUncertaintyEstimate(
example_text=example,
predicted_label=result.get("predicted_label", "unknown"),
confidence=confidence,
reasoning=result.get("reasoning", ""),
alternative_labels=result.get("alternative_labels", []),
select_for_annotation=confidence < confidence_threshold
)

def batch_uncertainty_estimation(
examples: list[str],
label_schema: list[str],
task_description: str,
budget: int,
confidence_threshold: float = 0.80
) -> tuple[list[int], list[LLMUncertaintyEstimate]]:
"""
Run uncertainty estimation on a pool and select top-K most uncertain examples.

Returns:
(selected_indices, all_estimates)
"""
estimates = []
for i, example in enumerate(examples):
est = estimate_llm_uncertainty(
example=example,
label_schema=label_schema,
task_description=task_description,
confidence_threshold=confidence_threshold
)
estimates.append(est)
print(f" [{i+1}/{len(examples)}] conf={est.confidence:.2f} label={est.predicted_label}")

# Sort by confidence (ascending) - most uncertain first
indexed = sorted(enumerate(estimates), key=lambda x: x[1].confidence)
selected_indices = [idx for idx, _ in indexed[:budget]]

return selected_indices, estimates

# Demonstration: customer support ticket classification
label_schema = [
"billing_issue",
"technical_problem",
"account_access",
"product_question",
"cancellation_request",
"general_feedback"
]

task_description = "customer support ticket classification for an AI software platform"

tickets = [
"I can't log into my account since the password reset email never arrived",
"Your AI keeps giving wrong answers and I want a refund immediately",
"How do I increase my API rate limit? I need more requests per minute",
"The product is great but the onboarding could be clearer for new users",
"My credit card was charged twice for the same subscription month",
"Can you help me understand how the context window works in your models?",
"I've been locked out and also I need to cancel my subscription actually",
"The API latency seems higher than yesterday, is there an outage?",
]

print("=== LLM Uncertainty Estimation for Active Learning ===\n")
selected_indices, estimates = batch_uncertainty_estimation(
examples=tickets,
label_schema=label_schema,
task_description=task_description,
budget=3,
confidence_threshold=0.80
)

print(f"\n=== Top 3 most uncertain examples (selected for annotation) ===")
for idx in selected_indices:
est = estimates[idx]
print(f"\nExample: {est.example_text[:70]}...")
print(f" Predicted: {est.predicted_label} ({est.confidence:.0%} confidence)")
print(f" Reasoning: {est.reasoning}")
if est.alternative_labels:
print(f" Alternatives: {est.alternative_labels}")

print(f"\n=== Examples auto-labeled (high confidence) ===")
for i, est in enumerate(estimates):
if i not in selected_indices and est.confidence >= 0.80:
print(f" [{est.predicted_label}] ({est.confidence:.0%}) {est.example_text[:60]}...")

Consistency-Based Uncertainty

A more robust LLM uncertainty measure uses sampling consistency: run the same input through the model multiple times with non-zero temperature and measure how much the outputs vary. High variance across samples indicates genuine uncertainty about the correct answer.

import anthropic
import json
from collections import Counter
import math

client = anthropic.Anthropic()

def consistency_uncertainty(
example: str,
label_schema: list[str],
task_description: str,
n_samples: int = 5,
temperature: float = 0.8
) -> dict:
"""
Estimate uncertainty via sampling consistency.

Sample the model n_samples times at non-zero temperature and compute:
- Mode prediction (most frequent label)
- Consistency score (fraction of samples agreeing with mode)
- Entropy of label distribution across samples

High entropy / low consistency = more uncertain = better annotation candidate
"""
label_counts: Counter = Counter()

for sample_idx in range(n_samples):
response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=128,
temperature=temperature,
messages=[{
"role": "user",
"content": f"""Classify the following text for {task_description}.

Text: {example}

Valid labels: {", ".join(label_schema)}

Respond with JSON only: {{"label": "your classification"}}"""
}]
)

try:
result = json.loads(response.content[0].text)
label = result.get("label", "unknown")
if label in label_schema:
label_counts[label] += 1
else:
label_counts["unknown"] += 1
except json.JSONDecodeError:
label_counts["unknown"] += 1

total = sum(label_counts.values())
mode_label = label_counts.most_common(1)[0][0]
consistency = label_counts[mode_label] / total

# Entropy of sample distribution
entropy = 0.0
for count in label_counts.values():
p = count / total
if p > 0:
entropy -= p * math.log2(p)

max_entropy = math.log2(len(label_schema))

return {
"example": example,
"mode_label": mode_label,
"consistency": consistency,
"sample_entropy": entropy,
"normalized_entropy": entropy / max_entropy if max_entropy > 0 else 0.0,
"label_distribution": dict(label_counts),
"n_samples": n_samples,
"select_for_annotation": consistency < 0.80
}

# Test consistency-based uncertainty on an ambiguous ticket
ambiguous_ticket = "I've been charged twice and I also can't access any of my projects"

print("=== Consistency-Based Uncertainty Estimation ===\n")
result = consistency_uncertainty(
example=ambiguous_ticket,
label_schema=label_schema,
task_description=task_description,
n_samples=5
)

print(f"Text: {result['example']}")
print(f"Mode label: {result['mode_label']}")
print(f"Consistency: {result['consistency']:.0%} ({result['n_samples']} samples)")
print(f"Sample entropy: {result['sample_entropy']:.3f} bits (max: {math.log2(len(label_schema)):.3f})")
print(f"Normalized entropy: {result['normalized_entropy']:.3f}")
print(f"Label distribution: {result['label_distribution']}")
print(f"Select for annotation: {result['select_for_annotation']}")

Cold Start Strategies

Active learning assumes a model that can estimate uncertainty - but how do you start when you have no labeled data and no model? This is the cold start problem.

Cluster-Then-Sample Cold Start

import numpy as np
from sklearn.cluster import KMeans, MiniBatchKMeans
from sklearn.preprocessing import normalize

def cluster_cold_start(
embeddings: np.ndarray,
seed_budget: int,
use_mini_batch: bool = True,
n_init: int = 10
) -> list[int]:
"""
Cold-start active learning via K-Means clustering.

Cluster the embedding space into seed_budget clusters.
For each cluster, select the example closest to the cluster centroid.
This ensures the seed set is diverse and representative of the
full input space - better than random for starting AL.

Args:
embeddings: Feature representations, shape (N, D)
seed_budget: Number of seed examples to select
use_mini_batch: Use MiniBatchKMeans for large N

Returns:
List of selected indices (one per cluster)
"""
# Normalize embeddings for cosine-like behavior
embeddings_normed = normalize(embeddings, norm="l2")

K = min(seed_budget, len(embeddings))

if use_mini_batch and len(embeddings) > 10000:
kmeans = MiniBatchKMeans(
n_clusters=K,
n_init=n_init,
random_state=42,
batch_size=min(1000, len(embeddings))
)
else:
kmeans = KMeans(
n_clusters=K,
n_init=n_init,
random_state=42
)

cluster_labels = kmeans.fit_predict(embeddings_normed)
centroids = kmeans.cluster_centers_

selected = []
for cluster_id in range(K):
# Find examples in this cluster
cluster_mask = cluster_labels == cluster_id
cluster_indices = np.where(cluster_mask)[0]

if len(cluster_indices) == 0:
continue

# Select the example closest to the cluster centroid
cluster_embeddings = embeddings_normed[cluster_indices]
centroid = centroids[cluster_id]

distances = np.linalg.norm(cluster_embeddings - centroid, axis=1)
closest_local = int(np.argmin(distances))
closest_global = int(cluster_indices[closest_local])

selected.append(closest_global)

return selected

# Demonstration
np.random.seed(42)
N, D = 5000, 256
# Simulate embeddings with natural clusters
cluster_centers = np.random.randn(8, D) * 3
assignments = np.random.randint(0, 8, N)
embeddings = cluster_centers[assignments] + np.random.randn(N, D) * 0.5

seed_indices = cluster_cold_start(embeddings, seed_budget=16)
print(f"Cluster cold start: selected {len(seed_indices)} diverse seed examples")
print(f"Seed indices: {seed_indices}")

# Verify diversity: compute average pairwise distance
seed_embeddings = embeddings[seed_indices]
dists = np.linalg.norm(
seed_embeddings[:, np.newaxis] - seed_embeddings[np.newaxis, :],
axis=2
)
avg_dist = dists[np.triu_indices(len(seed_indices), k=1)].mean()
print(f"Average pairwise distance in seed set: {avg_dist:.3f}")

# Compare with random selection
random_indices = list(np.random.choice(N, 16, replace=False))
random_embeddings = embeddings[random_indices]
random_dists = np.linalg.norm(
random_embeddings[:, np.newaxis] - random_embeddings[np.newaxis, :],
axis=2
)
random_avg_dist = random_dists[np.triu_indices(16, k=1)].mean()
print(f"Average pairwise distance in random set: {random_avg_dist:.3f}")
print(f"Diversity improvement: {avg_dist/random_avg_dist:.2f}x")

LLM-Assisted Weak Labeling for Cold Start

When you need a functional starting model quickly, use LLM zero-shot predictions as weak labels. These will not be perfect, but they are usually sufficient to bootstrap the first round of active learning, after which the model can generate its own uncertainty estimates.

import anthropic
import json
from dataclasses import dataclass

client = anthropic.Anthropic()

@dataclass
class WeakLabel:
text: str
label: str
confidence: float
is_high_confidence: bool

def llm_weak_labeling(
texts: list[str],
label_schema: list[str],
task_description: str,
high_confidence_threshold: float = 0.85
) -> list[WeakLabel]:
"""
Generate weak labels using LLM for cold start.

Only use high-confidence predictions as training data -
low-confidence predictions should be routed to human annotation.

The resulting dataset is imperfect but sufficient to bootstrap
an uncertainty-capable model for subsequent active learning rounds.
"""
weak_labels = []

for text in texts:
response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=200,
messages=[{
"role": "user",
"content": f"""Classify for {task_description}.

Text: {text}
Valid labels: {", ".join(label_schema)}

JSON only:
{{"label": "classification", "confidence": 0.0-1.0}}

Confidence must reflect your genuine certainty. Low confidence = this example
should be reviewed by a human, not used for automated training."""
}]
)

try:
result = json.loads(response.content[0].text)
label = result.get("label", "unknown")
confidence = float(result.get("confidence", 0.5))

# Validate label
if label not in label_schema:
label = "unknown"
confidence = 0.0

except (json.JSONDecodeError, ValueError):
label = "unknown"
confidence = 0.0

weak_labels.append(WeakLabel(
text=text,
label=label,
confidence=confidence,
is_high_confidence=confidence >= high_confidence_threshold
))

return weak_labels

# Demonstration
sample_texts = [
"I need to update my payment method",
"The dashboard isn't loading on Chrome",
"How do I add team members to my workspace?",
"I want to cancel my subscription",
"Great product, just wish it had dark mode",
"Getting a 504 error when running large batch jobs",
]

print("=== LLM Weak Labeling for Cold Start ===\n")
weak_labels = llm_weak_labeling(
texts=sample_texts,
label_schema=label_schema,
task_description=task_description
)

high_conf = [wl for wl in weak_labels if wl.is_high_confidence]
low_conf = [wl for wl in weak_labels if not wl.is_high_confidence]

print(f"Total: {len(weak_labels)}")
print(f"High confidence (use for training): {len(high_conf)}")
print(f"Low confidence (route to human): {len(low_conf)}\n")

for wl in weak_labels:
status = "TRAIN" if wl.is_high_confidence else "REVIEW"
print(f"[{status}] [{wl.label}] ({wl.confidence:.0%}) {wl.text[:55]}...")

Acquisition Function Comparison

Different acquisition functions have different strengths. The right choice depends on your problem structure, model type, and annotation budget characteristics.

Acquisition FunctionUncertaintyDiversityComputational CostBest For
Least ConfidenceYesNoVery Low - single forward passBinary classification, tight budgets
Margin SamplingYesNoVery Low - single forward passMulti-class with clear runner-up
Entropy SamplingYesNoVery Low - single forward passMulti-class, many labels
Core-Set (K-Center)NoYesMedium - pairwise distancesCoverage-sensitive domains
BADGEYesYesMedium - gradient computationGeneral purpose, best balance
Query by CommitteeYesPartialHigh - train multiple modelsSmall labeled sets, high-variance models
Bayesian Active Learning (BALD)YesNoHigh - MC DropoutProbabilistic models, uncertainty calibration matters
LLM Verbalized UncertaintyYesNoMedium - LLM call per exampleText tasks, LLM-based classifiers
LLM Consistency SamplingYesNoHigh - N×LLM calls per exampleWhen verbalized uncertainty is unreliable
Cluster-then-Sample (Cold Start)NoYesLow - one KMeans fitInitial seed selection, no model available
tip

In practice, BADGE or a simple combination of uncertainty + diversity outperforms pure uncertainty or pure diversity sampling on most real-world tasks. The intuition: uncertainty sampling finds the right region of the input space to annotate, and diversity sampling ensures you do not over-annotate any single subregion of that space. For text tasks with LLMs, start with verbalized uncertainty and validate it against consistency sampling on a small sample - if they agree, verbalized uncertainty is reliable enough to use alone, which is much cheaper.


Query by Committee

Query by Committee (QBC) is an active learning framework that does not require a single model with calibrated probabilities. Instead, it trains a committee of models on the current labeled set and selects examples where committee members disagree most.

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.preprocessing import LabelEncoder
import warnings
warnings.filterwarnings("ignore")

def query_by_committee(
X_labeled: np.ndarray,
y_labeled: np.ndarray,
X_unlabeled: np.ndarray,
budget: int,
n_bootstrap: int = 5
) -> tuple[list[int], np.ndarray]:
"""
Query by Committee via bootstrap aggregation.

Train n_bootstrap committee members, each on a bootstrap resample
of the labeled set. Measure disagreement as vote entropy across
committee predictions. Select the most disagreed-upon examples.

Args:
X_labeled: Labeled features, shape (N_labeled, D)
y_labeled: Labels, shape (N_labeled,)
X_unlabeled: Unlabeled features, shape (N_unlabeled, D)
budget: Number of examples to select
n_bootstrap: Number of committee members

Returns:
(selected_indices, vote_entropy) pair
"""
n_unlabeled = len(X_unlabeled)
n_labeled = len(X_labeled)

le = LabelEncoder()
y_encoded = le.fit_transform(y_labeled)
n_classes = len(le.classes_)

# Collect vote counts from all committee members
vote_counts = np.zeros((n_unlabeled, n_classes))

for i in range(n_bootstrap):
# Bootstrap resample
bootstrap_idx = np.random.choice(n_labeled, n_labeled, replace=True)
X_boot = X_labeled[bootstrap_idx]
y_boot = y_encoded[bootstrap_idx]

# Train committee member - use diverse model types for better disagreement
if i % 3 == 0:
model = LogisticRegression(max_iter=1000, C=1.0, random_state=i)
elif i % 3 == 1:
model = RandomForestClassifier(n_estimators=50, random_state=i)
else:
model = GradientBoostingClassifier(n_estimators=50, random_state=i)

model.fit(X_boot, y_boot)
preds = model.predict(X_unlabeled)

# Tally votes
for j, pred in enumerate(preds):
vote_counts[j, pred] += 1

# Normalize to vote fractions
vote_fractions = vote_counts / vote_counts.sum(axis=1, keepdims=True)

# Vote entropy: high entropy = high disagreement = good annotation candidate
vote_entropy = -np.sum(
vote_fractions * np.log2(vote_fractions + 1e-9),
axis=1
)

# Select top-K most disagreed-upon examples
selected_indices = list(np.argsort(-vote_entropy)[:budget])

return selected_indices, vote_entropy

# Quick demonstration with synthetic data
np.random.seed(42)
N_labeled, N_unlabeled, D, C = 100, 500, 20, 3

X_labeled = np.random.randn(N_labeled, D)
y_labeled = np.random.choice(C, N_labeled)
X_unlabeled = np.random.randn(N_unlabeled, D)

selected, entropies = query_by_committee(
X_labeled=X_labeled,
y_labeled=y_labeled,
X_unlabeled=X_unlabeled,
budget=20
)

print("=== Query by Committee ===")
print(f"Selected {len(selected)} examples for annotation")
print(f"Average entropy of selected: {entropies[selected].mean():.3f}")
print(f"Average entropy of pool: {entropies.mean():.3f}")
print(f"Ratio: {entropies[selected].mean()/entropies.mean():.2f}x more uncertain than pool average")

Common Mistakes

danger

Mistake: Using uncertainty sampling alone on clustered data. If your unlabeled pool has a large, tight cluster of similar examples near a decision boundary, uncertainty sampling will select dozens of examples from that single cluster. These examples are all informative in the same way - you are paying annotation cost for marginal additional information. Always pair uncertainty sampling with a diversity criterion (BADGE, core-set, or simple clustering) to avoid this.

danger

Mistake: Neglecting the cold start problem. Jumping directly into uncertainty sampling with a randomly initialized model produces uniform confidence scores - the model cannot distinguish informative from uninformative examples because it has not learned anything yet. Always use a structured cold start strategy (cluster-then-sample, LLM weak labeling, or expert curation) to get the first functional model before starting active learning.

warning

Mistake: Treating LLM verbalized confidence as ground truth. LLM confidence expressions correlate with actual accuracy but are imperfect, especially for distribution-shifted examples. Before relying on LLM verbalized confidence as your acquisition function, validate it: collect 100+ examples where you have both LLM confidence estimates and ground-truth labels, and check that calibration. If ECE is high, consider using consistency sampling instead, or recalibrating the confidence scores.

warning

Mistake: Forgetting to retrain after each annotation round. Active learning is iterative - each round of labels should trigger a model retraining before the next acquisition function evaluation. Evaluating the acquisition function on a stale model wastes annotation budget by selecting examples that were informative for the old model but may not be for the updated one. In practice, batch sizes of 50-200 with retraining between rounds work well for most tasks.

tip

Best practice: Track learning curves across strategies. Always measure and compare your active learning strategy against a random sampling baseline. Plot model performance (accuracy, F1, etc.) as a function of number of annotated examples for both strategies. If your AL strategy does not beat random by at least 20-30% at the same annotation count, revisit the acquisition function or the embedding space it operates on.

tip

Best practice: Evaluate on the full unlabeled pool, not just a test set. Standard train/test evaluation tells you how well the model performs on the test distribution. For active learning, you also want to track how well the model performs on the examples it has NOT annotated yet - because those are the examples the production system will handle autonomously. Consider splitting the unlabeled pool: a validation pool from which you estimate acquisition function scores, and a held-out test pool for evaluation.


Interview Q&A

Q1: What is active learning and what problem does it solve?

Active learning is the machine learning discipline of selecting which unlabeled examples to annotate, given a limited annotation budget, in order to maximize model performance improvement per annotation dollar. It solves the annotation bottleneck problem: in many real-world domains, annotation is expensive (requiring domain expertise), slow (requiring careful review), or simply scarce (limited number of qualified annotators). The key insight is that not all labels are equally valuable - a label on an example the model already handles confidently adds almost no information, while a label on an example near a decision boundary may significantly update the model's parameters. Active learning exploits this information asymmetry to stretch annotation budgets by factors of 5 to 20 in typical applications.

The practical impact is large. A model trained on 500 strategically selected examples routinely outperforms a model trained on 2,000 randomly selected examples. This is not a marginal improvement - it changes the economics of annotation projects that would otherwise be infeasible.

Q2: Explain the difference between least confidence, margin, and entropy sampling. When would you use each?

All three are uncertainty-based acquisition functions that score examples by how uncertain the model is about the correct label, but they differ in what aspect of uncertainty they capture.

Least confidence scores by 1maxyP(yx)1 - \max_y P(y|x) - the gap between the highest class probability and certainty. It captures uncertainty about the top prediction but ignores how probability mass is distributed across other classes. Use it for binary classification or when you only care whether the model can commit to a single label.

Margin sampling scores by P(y^1x)P(y^2x)P(\hat{y}_1|x) - P(\hat{y}_2|x) - the gap between the top-two predicted classes. It captures the specific binary ambiguity the model is experiencing. Use it when you are working with a multi-class problem and the model's key difficulty is deciding between two competing labels, which is common in hierarchical classification tasks.

Entropy sampling scores by yP(yx)logP(yx)-\sum_y P(y|x) \log P(y|x) - the Shannon entropy of the full predicted distribution. It captures uncertainty across all classes simultaneously. Use it when you have many classes and examples can be uncertain between more than two of them, which is the common case in general text or image classification. Entropy is the most general measure and the default choice when you do not have strong prior knowledge about your problem structure.

In practice, the three measures often agree on the most uncertain examples. Entropy is the most theoretically well-grounded and the typical default; margin sampling is often used in NLP where the top-two comparison is meaningful.

Q3: What is the core-set selection algorithm and why does it outperform pure uncertainty sampling?

Core-set selection (Sener and Savarese, 2018) selects examples that maximize coverage of the embedding space - specifically, it solves the K-Center problem: find KK points such that every unlabeled point is as close as possible to at least one selected point. The greedy approximation works by repeatedly selecting the unlabeled point farthest from any current labeled center, which guarantees that the selected set is spread across the embedding space rather than concentrated in a single region.

Pure uncertainty sampling outperforms core-set selection in the early rounds of active learning when the model's uncertainty estimates are reliable - it efficiently finds the decision boundary. But it has a critical weakness: it is greedy at the example level, not the batch level. If ten examples in a tight cluster are all near the decision boundary, uncertainty sampling will select many or all of them in the same round, paying annotation cost ten times for essentially the same information.

Core-set selection avoids this by enforcing diversity - once a point near a cluster is selected, no other point from the same cluster will be selected until all other clusters have at least one representative. The result: with a limited annotation budget, core-set selection provides better coverage of the input space and typically produces more robust models on rare or underrepresented subpopulations.

BADGE (Ash et al., 2020) combines both: it applies core-set selection in the gradient embedding space, where both the magnitude (capturing uncertainty) and direction (capturing structural diversity) of gradients are represented. This is generally the best practical choice for batch active learning when computational cost is not a constraint.

Q4: What is the cold start problem in active learning and how do you address it?

The cold start problem is the circular dependency at the beginning of active learning: you need a model to generate uncertainty estimates for sample selection, but you need labeled data to train the model. With no labeled data, you have no model, and with no model, you cannot run uncertainty-based selection.

Three practical approaches:

Cluster-then-sample: embed all unlabeled examples (using a pretrained foundation model or feature extractor), cluster the embedding space into KK clusters, and select the example nearest each cluster centroid as the initial seed set. This gives a diverse, representative initial labeled set without requiring a task-specific model. It is the most commonly used cold-start strategy because it is fast, model-free, and reliable.

LLM zero-shot weak labeling: use a capable LLM to assign labels to several hundred examples without human annotation, filtering to only keep high-confidence predictions (above 85% expressed confidence). Use these weak labels to train the initial model. Weak labels are imperfect but usually sufficient to produce a model that can generate meaningful uncertainty estimates for the first active learning round.

Expert curation: have a domain expert select 50-100 examples that they consider the hardest, most representative, or most edge-case-like. Domain experts are often very good at identifying the examples that span the decision boundary - which is exactly what active learning needs for the first round. This is the highest-quality approach but requires expert time upfront.

In practice, combining approaches works well: cluster-then-sample for broad coverage, then LLM weak labeling to bootstrap the model quickly, then switch to uncertainty-based selection once the model has enough signal.

Q5: How do you use LLMs for uncertainty estimation in active learning? What are the limitations?

Two main approaches: verbalized uncertainty and consistency sampling.

Verbalized uncertainty prompts the LLM to classify an example and simultaneously report its confidence on a 0-1 scale. The expressed confidence is used directly as the acquisition function - low expressed confidence means select for annotation. This is computationally cheap (one LLM call per example) and works surprisingly well for modern LLMs that are well-calibrated to express genuine uncertainty. The key requirement is prompting the model to be honest: explicitly asking it not to inflate confidence, and framing low confidence as a feature rather than a failure.

Consistency sampling runs the same classification prompt multiple times at non-zero temperature and measures how much the predictions vary. High entropy across samples indicates genuine model uncertainty. This is more expensive (nn LLM calls per example) but more robust - it does not depend on the LLM's ability to introspect and accurately report its confidence, only on whether it produces consistent outputs.

Limitations of both approaches: LLM uncertainty does not necessarily reflect the uncertainty of the task-specific model you are trying to train - the LLM may be confident about examples that are hard for a smaller fine-tuned model, and vice versa. For tasks where you have a specific production model (not an LLM), use that model's uncertainty estimates, not the LLM's. LLM-based uncertainty estimation is most appropriate when the LLM itself is the production model, or when you are in a cold-start situation and need any uncertainty signal to bootstrap the first round.

Q6: How do you measure the effectiveness of your active learning strategy?

The primary measurement is the learning curve: plot model performance (accuracy, F1, AUC, or your domain-relevant metric) as a function of the number of annotated examples, for both your active learning strategy and a random sampling baseline. A good active learning strategy should beat random by a meaningful margin - typically 15-30% in terms of annotation efficiency (the number of examples needed to reach a given performance level).

Secondary measurements: (1) Coverage: does the selected set cover the full input distribution, or does it over-represent certain subregions? Measure by checking model performance on held-out subpopulations, not just aggregate metrics. (2) Calibration: as you annotate more examples, does the model's calibration improve along with its accuracy? Poor calibration suggests the selected examples are not providing diverse enough supervision. (3) Redundancy rate: of the examples selected in each batch, what fraction are within a small distance of previously selected examples in the embedding space? High redundancy indicates the acquisition function is selecting clustered examples.

In production, the ultimate measure is annotation ROI: how much model performance improvement did you get per dollar of annotation cost, compared to random sampling? This requires tracking both the cost of each annotation round and the downstream impact of the model improvements.

Q7: Describe a scenario where active learning provides the largest benefit versus one where it provides the smallest benefit.

Largest benefit scenario: you are building a medical coding model to classify radiology reports into ICD-10 codes. You have 200,000 unlabeled reports, an annotation budget for 3,000 (because each annotation requires a certified medical coder at $120/hour), and 15,000 possible ICD-10 codes, of which about 200 appear regularly and the rest rarely. Random sampling gives you mostly common codes with sparse representation of rare codes. Active learning - particularly uncertainty sampling combined with diversity sampling - identifies the examples where the model is confused between similar ICD-10 codes, including many rare code cases the model has never seen. After 1,000 strategically selected annotations, you have similar performance on common codes and dramatically better performance on rare codes compared to 3,000 random annotations. The benefit is maximized because: annotation is expensive, the class distribution is highly imbalanced, and model errors on rare classes are the ones that matter most clinically.

Smallest benefit scenario: you are training a spam filter for a well-established email domain where you have 50 million labeled historical examples and spam characteristics are stable. The training distribution already covers the full input space with many labeled examples per region. Annotating additional examples adds diminishing marginal value regardless of selection strategy because the model has already converged to near-optimal performance for this distribution. Active learning provides the smallest benefit when: you have abundant labeled data already, the training distribution is fully representative of the production distribution, and the model has already converged. Equivalently, active learning is most valuable when you are far from the optimal learning curve and annotation is the binding constraint - least valuable when you have labeled data in abundance.


Summary

Active learning is the systematic approach to making annotation budgets go further. The key ideas are:

  1. Not all labels are equally informative. Labels on examples near decision boundaries train models more efficiently than labels on examples the model already handles correctly.

  2. Uncertainty sampling - least confidence, margin, or entropy - finds the examples where the model is most uncertain. It is computationally cheap and effective, but it selects redundant examples when the uncertain region is clustered.

  3. Diversity sampling - core-set, BADGE - ensures the selected batch covers the input space broadly. It is best combined with uncertainty sampling for batch active learning.

  4. The cold start problem requires special handling: cluster-then-sample or LLM weak labeling to bootstrap the first model before uncertainty-based selection is possible.

  5. LLM-based uncertainty estimation - verbalized confidence or consistency sampling - extends active learning to text tasks where traditional probability-based uncertainty is not available or calibrated.

  6. Measure learning curves against a random sampling baseline to confirm that your strategy is actually providing annotation efficiency gains.

The practical payoff is substantial: annotation budgets can be stretched by 5-20x in favorable conditions, making feasible annotation projects that would otherwise be economically impossible. Active learning is not an academic curiosity - it is a standard tool in the production ML toolkit for any team that cannot annotate its entire unlabeled corpus.

In the next lesson, we will look at annotation quality - inter-rater reliability, label consolidation, and how to detect and correct annotation errors after the fact.

© 2026 EngineersOfAI. All rights reserved.