Probabilistic ML - Beyond Point Predictions
Reading time: ~30 min | Interview relevance: High | Roles: MLE, Research Scientist, Applied Scientist, AI Engineer
The Real Interview Moment
The interviewer presents a scenario: "Your autonomous driving system needs to decide whether to brake. Your object detection model outputs 72% confidence that there's a pedestrian ahead. Should the car brake? What if the model is poorly calibrated and 72% confidence actually corresponds to a 45% true probability? How would you design a system that accounts for prediction uncertainty?"
You realize this isn't a question about threshold tuning - it's about whether the model knows what it doesn't know. A standard neural network that outputs 72% hasn't told you anything about whether it's uncertain because the image is ambiguous (the object could be a pedestrian or a mailbox) or because the scene looks nothing like the training data (a scenario the model has never encountered). These two types of uncertainty - aleatoric and epistemic - require fundamentally different responses, and distinguishing them is the heart of probabilistic ML.
What You Will Master
- Bayes' theorem and its role as the foundation of probabilistic ML
- MLE vs. MAP estimation - when and why each matters
- Naive Bayes: assumptions, strengths, and surprising effectiveness
- Gaussian Processes: non-parametric Bayesian regression with uncertainty
- Bayesian Neural Networks: epistemic uncertainty in deep learning
- Aleatoric vs. epistemic uncertainty and practical quantification
- Calibration, reliability diagrams, and when probabilities matter
- Applications: A/B testing, anomaly detection, active learning
Self-Assessment: Where Are You Now?
| Level | Description | Target |
|---|---|---|
| Beginner | "I know Bayes' theorem from probability class" | Read all parts carefully |
| Intermediate | "I've used naive Bayes and know about priors, but GPs and BNNs are fuzzy" | Focus on Parts 2-3 and practice problems |
| Advanced | "I understand probabilistic models but want to nail uncertainty quantification" | Jump to Part 3 (uncertainty), calibration, and practice problems |
Part 1 - Foundations of Probabilistic Thinking
Bayes' Theorem - The Engine of Probabilistic ML
| Term | Name | Meaning |
|---|---|---|
| Posterior | Updated belief about parameters after seeing data | |
| Likelihood | Probability of observed data given parameters | |
| Prior | Belief about parameters before seeing data | |
| Evidence (marginal likelihood) | Normalizing constant; |
"Probabilistic ML treats model parameters and predictions as distributions rather than point values. Instead of finding the single best weights, we maintain a distribution over possible weights, which naturally gives us uncertainty estimates. Bayes' theorem is the update rule: we start with a prior belief, observe data, and compute the posterior belief. The key advantage over standard ML is that probabilistic models know what they don't know - they output high uncertainty for inputs far from the training distribution. This is critical for safety-critical applications like medical diagnosis, autonomous driving, and financial risk."
MLE vs. MAP Estimation
Maximum Likelihood Estimation (MLE): Find parameters that maximize the likelihood of observed data:
- No prior assumed (or equivalently, a uniform/flat prior)
- Can overfit with limited data
- Equivalent to standard neural network training with no regularization
Maximum A Posteriori (MAP): Find parameters that maximize the posterior:
- Incorporates prior belief about reasonable parameter values
- acts as a regularization term
- With a Gaussian prior , MAP is equivalent to L2 regularization
- With a Laplace prior , MAP is equivalent to L1 regularization
"What's the connection between L2 regularization and Bayesian inference?" This is a classic that separates surface-level from deep understanding. The answer: L2 regularization is equivalent to MAP estimation with a Gaussian prior on the weights. The regularization strength lambda corresponds to the prior precision (inverse variance). When lambda is large, the prior is tight (strong belief weights should be small), pulling the estimate toward zero. This is a beautiful bridge between frequentist and Bayesian perspectives.
Full Bayesian Inference vs. Point Estimates
| Approach | What You Get | Computation | Uncertainty |
|---|---|---|---|
| MLE | Single best parameters | Easy (gradient descent) | No |
| MAP | Single best parameters (regularized) | Easy (gradient descent + penalty) | No |
| Full Bayesian | Distribution over parameters | Hard (MCMC, variational inference) | Yes |
Why go full Bayesian?
- Point estimates (MLE/MAP) give you one prediction. But how confident is that prediction?
- Full Bayesian inference marginalizes over all possible parameters:
This integral averages predictions over all plausible parameter settings, weighted by how likely each setting is given the data. The spread of this predictive distribution is the model's uncertainty.
Part 2 - Probabilistic Models
Naive Bayes
Despite its "naive" assumption of feature independence, naive Bayes is surprisingly effective for many problems.
Model:
The "naive" assumption: features are conditionally independent given the class. This means .
Why does it work despite the wrong assumption?
- Classification only needs the correct ranking of class probabilities, not calibrated values
- The decision boundary can still be correct even with incorrect probability estimates
- Feature correlations often "cancel out" in the posterior
Variants:
| Variant | Feature Distribution | Use Case |
|---|---|---|
| Gaussian NB | Continuous, roughly Gaussian | General continuous features |
| Multinomial NB | Discrete counts | Text classification (word counts) |
| Bernoulli NB | Binary (0/1) | Text classification (word presence) |
| Complement NB | Discrete counts, adjusted | Imbalanced text classification |
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
# Text classification with Naive Bayes
vectorizer = TfidfVectorizer(max_features=10000)
X_train_tfidf = vectorizer.fit_transform(train_texts)
nb = MultinomialNB(alpha=1.0) # alpha = Laplace smoothing
nb.fit(X_train_tfidf, y_train)
# Probabilities are often poorly calibrated
probs = nb.predict_proba(X_test_tfidf) # Use with caution!
"Naive Bayes is bad because the independence assumption is always wrong." True, the assumption is almost always violated. But NB is fast, works well with limited data, handles high-dimensional sparse features excellently (text), and often competitive with more complex models. The real failure mode isn't the assumption - it's when the probability estimates are used directly (e.g., for calibrated risk scores). For classification decisions, NB is often good enough.
Gaussian Processes
A Gaussian Process (GP) is a non-parametric Bayesian approach that defines a distribution over functions.
Intuition: Instead of fitting a single function , a GP maintains a distribution over all possible functions that are consistent with the observed data. Predictions come with uncertainty bands that widen in regions far from training data.
Definition: A GP is fully specified by:
- Mean function : typically zero (the prior mean)
- Kernel (covariance) function : encodes similarity between inputs
Common kernels:
| Kernel | Formula | Properties |
|---|---|---|
| RBF (Squared Exponential) | Smooth, infinitely differentiable | |
| Matern | Various | Controls smoothness (nu parameter) |
| Linear | Equivalent to Bayesian linear regression | |
| Periodic | For periodic patterns |
GP Regression - making predictions:
Given training data and a test point :
The mean is the prediction, and is the uncertainty.
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, WhiteKernel
kernel = RBF(length_scale=1.0) + WhiteKernel(noise_level=0.1)
gp = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=10)
gp.fit(X_train, y_train)
# Predictions with uncertainty
y_pred, y_std = gp.predict(X_test, return_std=True)
# 95% confidence interval
lower = y_pred - 1.96 * y_std
upper = y_pred + 1.96 * y_std
Strengths:
- Natural uncertainty quantification (no extra work)
- Works extremely well with small datasets
- Non-parametric - automatically adapts complexity
- Kernel hyperparameters can be optimized via marginal likelihood
Limitations:
- computation (inverting the kernel matrix) - impractical for >10K samples
- Scales poorly to high dimensions (kernel length scales become hard to learn)
- Choice of kernel encodes strong assumptions about function smoothness
"When would you use a GP instead of a neural network?" Best answer: "When I have small data (<10K samples), need calibrated uncertainty estimates, and the input is low-dimensional (<20 features). GPs give uncertainty for free and work well with limited data. For large datasets or high-dimensional inputs (images, text), neural networks are more practical, but I'd add uncertainty via MC Dropout or ensembles."
Bayesian Neural Networks (BNNs)
Standard neural networks learn point estimates of weights. BNNs maintain distributions over weights.
Standard NN: (single value per weight) BNN: (distribution per weight, typically Gaussian)
Prediction with uncertainty:
Sample T different weight configurations, make a prediction with each, and average. The spread of predictions IS the epistemic uncertainty.
Practical approaches (since exact Bayesian inference is intractable for NNs):
1. MC Dropout (Monte Carlo Dropout): Use dropout at test time and make multiple forward passes. The variance across predictions approximates Bayesian uncertainty.
import torch
import torch.nn as nn
class MCDropoutModel(nn.Module):
def __init__(self, input_dim, hidden_dim, output_dim, dropout_rate=0.1):
super().__init__()
self.net = nn.Sequential(
nn.Linear(input_dim, hidden_dim),
nn.ReLU(),
nn.Dropout(dropout_rate), # Kept on during inference
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Dropout(dropout_rate),
nn.Linear(hidden_dim, output_dim)
)
def forward(self, x):
return self.net(x)
# Uncertainty estimation
model.train() # Keep dropout ON
T = 50 # Number of forward passes
predictions = []
with torch.no_grad():
for _ in range(T):
pred = model(x_test)
predictions.append(pred)
predictions = torch.stack(predictions)
mean_prediction = predictions.mean(dim=0)
uncertainty = predictions.std(dim=0) # Epistemic uncertainty
2. Deep Ensembles: Train M independent models with different random initializations. The variance across ensemble predictions captures uncertainty.
# Train 5 independent models
models = []
for seed in range(5):
torch.manual_seed(seed)
model = create_model()
train(model, X_train, y_train)
models.append(model)
# Predict with uncertainty
predictions = [model(x_test) for model in models]
mean_pred = torch.stack(predictions).mean(dim=0)
uncertainty = torch.stack(predictions).std(dim=0)
Google: Published seminal work on deep ensembles and uncertainty. Uses uncertainty for active learning in data labeling and for abstaining from predictions in safety-critical systems.
Meta: Uses MC Dropout for uncertainty in recommendation systems - low-uncertainty predictions served directly, high-uncertainty ones get human review.
Autonomous vehicles (Waymo, Tesla): Uncertainty quantification is critical for deciding when to hand control back to the human driver.
Drug discovery: Bayesian approaches are standard because data is scarce and each experiment is expensive - uncertainty guides which experiments to run next.
Part 3 - Uncertainty Quantification
Aleatoric vs. Epistemic Uncertainty
| Property | Aleatoric | Epistemic |
|---|---|---|
| Source | Inherent randomness in data | Lack of knowledge (limited data) |
| Reducible? | No (irreducible) | Yes (with more data) |
| Example | Noisy sensor readings, ambiguous labels | Out-of-distribution inputs, data-sparse regions |
| How to capture | Predict variance as model output | Bayesian inference, ensembles, MC Dropout |
| Action | Accept it; inform decision-makers | Collect more data; flag for human review |
Heteroscedastic model (captures both):
class UncertaintyModel(nn.Module):
"""Predicts mean AND variance (aleatoric + epistemic via MC Dropout)"""
def __init__(self, input_dim, hidden_dim):
super().__init__()
self.shared = nn.Sequential(
nn.Linear(input_dim, hidden_dim),
nn.ReLU(),
nn.Dropout(0.1),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Dropout(0.1)
)
self.mean_head = nn.Linear(hidden_dim, 1) # Predicted mean
self.logvar_head = nn.Linear(hidden_dim, 1) # Predicted log-variance (aleatoric)
def forward(self, x):
h = self.shared(x)
mean = self.mean_head(h)
log_var = self.logvar_head(h) # Aleatoric uncertainty
return mean, log_var
# Loss: negative log-likelihood of Gaussian
def nll_loss(mean, log_var, target):
precision = torch.exp(-log_var)
return 0.5 * (precision * (target - mean)**2 + log_var).mean()
# Uncertainty decomposition at test time:
# 1. Aleatoric: mean of predicted variances
# 2. Epistemic: variance of predicted means (via MC Dropout)
model.train()
means, logvars = [], []
for _ in range(50):
m, lv = model(x_test)
means.append(m)
logvars.append(lv)
means = torch.stack(means)
logvars = torch.stack(logvars)
aleatoric = torch.exp(logvars).mean(dim=0) # Mean of predicted variances
epistemic = means.var(dim=0) # Variance of predicted means
total = aleatoric + epistemic
"Uncertainty doesn't matter - we just need the most likely prediction." In safety-critical applications (medical, autonomous driving, finance), knowing how confident a prediction is can be more important than the prediction itself. A model that says "I'm 99% sure" when it's wrong is far more dangerous than one that says "I'm 50% sure - please escalate to a human." Dismissing uncertainty estimation shows a lack of awareness about responsible AI deployment.
When Probabilistic Approaches Beat Point Estimates
| Scenario | Why Probabilistic Wins |
|---|---|
| Small data | Prior regularizes; uncertainty quantifies data scarcity |
| Safety-critical | Model can abstain when uncertain, deferring to humans |
| Active learning | Uncertainty guides which samples to label next |
| Anomaly detection | High epistemic uncertainty = out-of-distribution input |
| A/B testing | Bayesian A/B tests provide posterior probability of variant winning |
| Decision-making under uncertainty | Expected utility maximization requires probability distributions |
| Online learning | Prior from previous model; posterior from new data; continuous updating |
Part 4 - Calibration and Reliability
What Is Calibration?
A model is calibrated if its predicted probabilities match observed frequencies:
- When it says "80% chance of rain," it should rain 80% of the time
- When it says "70% confidence this is spam," 70% of such emails should actually be spam
Why do models become miscalibrated?
- Neural networks are notoriously overconfident (modern NNs are more expressive but worse calibrated than older architectures)
- Training with cross-entropy loss doesn't guarantee calibration
- Temperature scaling, resampling, and other modifications can shift calibration
Reliability Diagrams
Bin predictions by confidence, then plot predicted probability vs. actual frequency:
from sklearn.calibration import calibration_curve
import matplotlib.pyplot as plt
prob_true, prob_pred = calibration_curve(y_test, y_probs, n_bins=10, strategy='uniform')
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# Reliability diagram
axes[0].plot(prob_pred, prob_true, 's-', label='Model')
axes[0].plot([0, 1], [0, 1], 'k--', label='Perfect calibration')
axes[0].set_xlabel('Mean predicted probability')
axes[0].set_ylabel('Fraction of positives')
axes[0].set_title('Reliability Diagram')
axes[0].legend()
# Histogram of predictions
axes[1].hist(y_probs, bins=50, range=(0, 1), edgecolor='black')
axes[1].set_xlabel('Predicted probability')
axes[1].set_ylabel('Count')
axes[1].set_title('Prediction Distribution')
Reading the diagram:
- Points above the diagonal: model is underconfident (says 60%, actual is 80%)
- Points below the diagonal: model is overconfident (says 80%, actual is 60%)
- Points on the diagonal: perfectly calibrated
Calibration Metrics
Expected Calibration Error (ECE):
Where is the accuracy in bin and is the mean confidence in bin .
def expected_calibration_error(y_true, y_prob, n_bins=10):
bin_boundaries = np.linspace(0, 1, n_bins + 1)
ece = 0.0
for i in range(n_bins):
mask = (y_prob >= bin_boundaries[i]) & (y_prob < bin_boundaries[i + 1])
if mask.sum() == 0:
continue
bin_acc = y_true[mask].mean()
bin_conf = y_prob[mask].mean()
bin_weight = mask.sum() / len(y_true)
ece += bin_weight * abs(bin_acc - bin_conf)
return ece
Calibration Methods
Temperature Scaling (most common for neural networks):
Learn a single scalar T on the validation set:
where is the logit output. T > 1 softens probabilities (reduces overconfidence). T < 1 sharpens them.
import torch
import torch.nn as nn
class TemperatureScaling(nn.Module):
def __init__(self):
super().__init__()
self.temperature = nn.Parameter(torch.ones(1))
def forward(self, logits):
return logits / self.temperature
# Optimize temperature on validation set
temp_model = TemperatureScaling()
optimizer = torch.optim.LBFGS([temp_model.temperature], lr=0.01, max_iter=50)
def closure():
optimizer.zero_grad()
scaled_logits = temp_model(val_logits)
loss = nn.CrossEntropyLoss()(scaled_logits, val_labels)
loss.backward()
return loss
optimizer.step(closure)
print(f"Optimal temperature: {temp_model.temperature.item():.3f}")
Platt Scaling: Fit a logistic regression on the model's raw outputs.
Isotonic Regression: Non-parametric; fits a non-decreasing step function. More flexible but needs more calibration data.
"I'll use temperature scaling, so now my model is calibrated forever." Calibration drifts over time as data distributions change. In production, you need to monitor calibration (ECE, reliability diagrams) and recalibrate periodically. Models can become uncalibrated due to data drift, concept drift, or changes in the population.
Part 5 - Applications
Bayesian A/B Testing
Frequentist A/B test: "Is the difference statistically significant?" (p-value) Bayesian A/B test: "What is the probability that variant B is better than A?" (posterior)
import numpy as np
from scipy import stats
# Observed data
conversions_A, trials_A = 120, 1000
conversions_B, trials_B = 145, 1000
# Beta posterior (conjugate prior for Bernoulli)
# Prior: Beta(1, 1) = uniform
alpha_A = 1 + conversions_A
beta_A = 1 + (trials_A - conversions_A)
alpha_B = 1 + conversions_B
beta_B = 1 + (trials_B - conversions_B)
# Monte Carlo estimate of P(B > A)
samples_A = np.random.beta(alpha_A, beta_A, size=100000)
samples_B = np.random.beta(alpha_B, beta_B, size=100000)
prob_B_better = (samples_B > samples_A).mean()
print(f"P(B > A) = {prob_B_better:.3f}")
print(f"Expected lift: {(samples_B - samples_A).mean() * 100:.2f}%")
print(f"95% credible interval for lift: "
f"[{np.percentile(samples_B - samples_A, 2.5)*100:.2f}%, "
f"{np.percentile(samples_B - samples_A, 97.5)*100:.2f}%]")
Advantages over frequentist:
- Directly answers "how likely is B better?" (not "would we reject H0?")
- Can incorporate prior knowledge (previous experiments)
- Can stop early without p-value inflation concerns (with proper sequential methods)
- Provides a full distribution over the effect size, not just a point estimate
Bayesian Anomaly Detection
Use a probabilistic model of "normal" behavior. Observations with low likelihood under the model are anomalies.
from sklearn.mixture import BayesianGaussianMixture
# Fit a Bayesian Gaussian Mixture Model on normal data
bgm = BayesianGaussianMixture(
n_components=10, # Maximum components (Bayesian will prune unused ones)
weight_concentration_prior_type='dirichlet_process',
weight_concentration_prior=0.01, # Smaller = fewer components
random_state=42
)
bgm.fit(X_normal)
# Anomaly score: negative log-likelihood
log_likelihood = bgm.score_samples(X_test)
anomaly_threshold = np.percentile(bgm.score_samples(X_normal), 1) # 1st percentile
anomalies = X_test[log_likelihood < anomaly_threshold]
Why Bayesian? The Dirichlet Process prior automatically determines the number of Gaussian components - you don't need to specify it. Components that don't explain data are pruned.
Active Learning with Uncertainty
Use model uncertainty to choose which samples to label next - label the most informative (uncertain) samples first.
def active_learning_loop(model, X_pool, X_train, y_train, n_iterations=10, batch_size=10):
for iteration in range(n_iterations):
# Train model on current labeled data
model.fit(X_train, y_train)
# Estimate uncertainty on unlabeled pool
# For BNNs/MC Dropout: use prediction variance
# For GPs: use predicted std
_, uncertainties = model.predict(X_pool, return_std=True)
# Select most uncertain samples
query_idx = np.argsort(uncertainties)[-batch_size:]
# Query labels (human annotation)
new_labels = oracle.label(X_pool[query_idx])
# Add to training set
X_train = np.vstack([X_train, X_pool[query_idx]])
y_train = np.concatenate([y_train, new_labels])
# Remove from pool
X_pool = np.delete(X_pool, query_idx, axis=0)
return model
Acquisition strategies:
- Uncertainty sampling: Select samples where the model is most uncertain (max entropy, max variance)
- Query by committee: Train multiple models, select samples where they disagree most
- Expected information gain: Select samples that would maximally reduce posterior uncertainty
- Batch Active Learning with Diversity (BADGE): Balances uncertainty and diversity in batch selection
Probabilistic Inference Pipeline
Practice Problems
Problem 1: MLE vs. MAP with Small Data (Mid-Level)
Scenario: You have 10 coin flips: 9 heads, 1 tail. What's the MLE estimate of P(heads)? What's the MAP estimate with a Beta(2, 2) prior? Which is better and why?
Hint 1 - Direction
MLE is just the frequency. For MAP with a Beta prior, the posterior of a Bernoulli likelihood with a Beta prior is also Beta (conjugate pair). The MAP of a Beta distribution is (alpha - 1) / (alpha + beta - 2).
Hint 2 - Insight
The MLE says P(heads) = 0.9, but you probably don't believe a coin is that biased based on just 10 flips. The Beta(2,2) prior expresses mild belief that the coin is roughly fair, pulling the estimate toward 0.5.
Hint 3 - Full Solution
MLE:
MAP with Beta(2, 2) prior: Posterior is Beta(alpha + heads, beta + tails) = Beta(2 + 9, 2 + 1) = Beta(11, 3)
MAP of Beta(a, b) = (a - 1) / (a + b - 2):
Why MAP is better here:
- With only 10 flips, the MLE of 0.9 is very sensitive to sampling variation
- If you flipped 10 more times, you'd likely see a different ratio
- The Beta(2,2) prior adds the equivalent of 2 pseudo-observations of each outcome, smoothing the estimate
- The MAP estimate (0.833) is pulled toward 0.5, reflecting that extreme probabilities are less likely a priori
- With more data (e.g., 9000 heads out of 10000), MLE (0.9) and MAP (0.8997) nearly agree - the prior becomes irrelevant
Full Bayesian approach: Rather than a point estimate, report the full posterior: Beta(11, 3). Mean = 11/14 = 0.786, 95% credible interval = [0.57, 0.95].
Scoring Rubric:
- Strong Hire: Computes both correctly, explains conjugacy, discusses when prior matters (small data) vs. doesn't (large data), mentions full posterior as ideal
- Lean Hire: Computes MLE correctly, gets MAP formula but struggles with derivation
- No Hire: Can't compute MLE from data or doesn't understand what a prior does
Problem 2: Uncertainty in Production (Senior-Level)
Scenario: Your medical imaging model outputs 0.85 probability of a tumor being malignant. The model was trained on 50K images. A doctor asks: "How confident is the model in this prediction?" Just saying "85%" isn't sufficient. Design a system that provides meaningful uncertainty information.
Hint 1 - Direction
Distinguish between "the model thinks 85%" (which could be miscalibrated) and "we're confident the model's estimate is around 85%" (epistemic uncertainty). You need to address calibration, epistemic uncertainty, and how to communicate this to a non-technical user.
Hint 2 - Insight
The 0.85 might have high aleatoric uncertainty (the image itself is ambiguous) or high epistemic uncertainty (the model has never seen a tumor like this). The doctor needs to know WHICH type of uncertainty is present because the action differs: aleatoric means "get a biopsy regardless," epistemic means "get a second opinion from a specialist."
Hint 3 - Full Solution
System design:
1. Calibrate the model:
- Apply temperature scaling on a held-out calibration set
- Verify with a reliability diagram that predicted probabilities match actual frequencies
- Report the calibrated probability (e.g., 0.85 might become 0.78 after calibration)
2. Quantify aleatoric uncertainty:
- Train the model to output mean AND variance (heteroscedastic output)
- High aleatoric uncertainty = the image is inherently ambiguous (e.g., borderline lesion)
3. Quantify epistemic uncertainty:
- Use MC Dropout (T=50 forward passes) or a deep ensemble (M=5 models)
- Compute prediction variance across passes/models
- High epistemic uncertainty = this image is unlike the training data (rare pathology, unusual imaging conditions)
4. Out-of-distribution detection:
- Compare the image's embedding to training data embeddings
- If the image is far from all training examples, flag for manual review regardless of prediction
5. Communicate to the doctor:
Prediction: 78% probability of malignancy (calibrated)
Confidence assessment:
- Image clarity: HIGH (low aleatoric uncertainty)
- Model familiarity: MEDIUM (moderate epistemic uncertainty -
this tumor type is underrepresented in training data)
Recommendation: Consider specialist consultation due to
moderate model uncertainty on this pathology type.
Similar cases in training data: 47 (23 malignant, 24 benign)
6. Monitoring:
- Track calibration drift monthly
- Flag cases where the model was highly confident but wrong (for review and retraining)
- Track epistemic uncertainty distribution - increasing average uncertainty suggests distribution shift
Scoring Rubric:
- Strong Hire: Separates aleatoric from epistemic uncertainty, calibrates the model, designs OOD detection, proposes doctor-friendly communication, discusses monitoring
- Lean Hire: Mentions calibration and uncertainty but doesn't separate types or design the communication
- No Hire: Thinks 0.85 is the answer or suggests "just set a threshold"
Problem 3: Bayesian Optimization (Senior-Level)
Scenario: You need to tune 8 hyperparameters for a deep learning model. Each training run takes 6 hours on a GPU. You have budget for 50 runs. Grid search is impractical (3^8 = 6,561 combinations). Random search is better but wasteful. Propose a more efficient approach.
Hint 1 - Direction
Bayesian optimization uses a probabilistic surrogate model (often a GP) to model the objective function and intelligently choose which hyperparameters to try next.
Hint 2 - Insight
The key idea: use a GP to model "validation accuracy as a function of hyperparameters." The GP's predictive uncertainty tells you where the objective is uncertain. An acquisition function balances exploiting known good regions and exploring uncertain regions.
Hint 3 - Full Solution
Bayesian Optimization with Gaussian Processes:
-
Surrogate model: A GP models the unknown function
-
Acquisition function: Decides which hyperparameters to try next
- Expected Improvement (EI):
- Balances exploitation (high predicted mean) and exploration (high predicted variance)
-
Loop:
Initialize: Try 5-10 random hyperparameter combinationsFor i = 11 to 50:Fit GP to (hyperparams, metrics) observed so farFind next hyperparams that maximize acquisition functionTrain model with those hyperparamsRecord validation metricReturn best hyperparams found
from skopt import gp_minimize
from skopt.space import Real, Integer, Categorical
# Define search space
space = [
Real(1e-5, 1e-1, prior='log-uniform', name='learning_rate'),
Integer(32, 256, name='batch_size'),
Real(0.0, 0.5, name='dropout'),
Integer(1, 5, name='num_layers'),
Integer(64, 512, name='hidden_dim'),
Real(1e-6, 1e-2, prior='log-uniform', name='weight_decay'),
Categorical(['relu', 'gelu', 'silu'], name='activation'),
Categorical(['adam', 'sgd', 'adamw'], name='optimizer'),
]
def objective(params):
lr, batch_size, dropout, n_layers, hidden, wd, act, opt = params
# Train model with these hyperparameters
val_metric = train_and_evaluate(lr, batch_size, dropout, n_layers,
hidden, wd, act, opt)
return -val_metric # Minimize negative metric
result = gp_minimize(objective, space, n_calls=50, n_initial_points=10,
random_state=42)
print(f"Best metric: {-result.fun:.4f}")
print(f"Best params: {result.x}")
Why this works better than random search:
- After 10 random points, the GP has a rough model of the objective surface
- It explores uncertain regions AND exploits promising regions
- Typically finds near-optimal hyperparameters in 3-5x fewer evaluations than random search
Advanced considerations:
- Multi-fidelity: Use short training runs (1 epoch) to discard bad configurations quickly, full runs only for promising ones (Hyperband, BOHB)
- Transfer learning: Use results from previous similar experiments as the GP prior
- Parallelism: Batch acquisition - select multiple candidates simultaneously for parallel GPUs
Scoring Rubric:
- Strong Hire: Explains GP surrogate, acquisition function (EI), the explore-exploit tradeoff, proposes multi-fidelity extension, implements correctly
- Lean Hire: Knows Bayesian optimization uses a surrogate model but can't explain the acquisition function or the GP
- No Hire: Suggests grid search or doesn't know about Bayesian optimization
Problem 4: Naive Bayes vs. Logistic Regression (Screening-Level)
Scenario: You have 1000 labeled emails (500 spam, 500 not spam) with TF-IDF features. Compare naive Bayes and logistic regression. When would you prefer each?
Hint 1 - Direction
Think about their assumptions, training speed, behavior with limited data, and the quality of probability estimates.
Hint 2 - Insight
Naive Bayes is a generative model (models P(x|y)); logistic regression is discriminative (models P(y|x)). They have different convergence rates and different assumptions. NB wins with very little data; LR wins with more data.
Hint 3 - Full Solution
| Aspect | Naive Bayes | Logistic Regression |
|---|---|---|
| Model type | Generative | Discriminative |
| Assumption | Feature independence given class | Linear decision boundary in feature space |
| Training | Counting (extremely fast) | Iterative optimization (fast) |
| Small data (<100 samples) | Better (fewer parameters) | Worse (more parameters, can overfit) |
| Moderate data (1000 samples) | Good | Usually better |
| Large data (100K+ samples) | Competitive | Better (tighter asymptotic fit) |
| Calibration | Poor (probabilities pushed to 0/1) | Better (but still imperfect) |
| Feature interactions | Ignores them | Captures linear interactions |
| Regularization | Laplace smoothing (alpha) | L1/L2 penalty |
| Missing features | Handles naturally | Requires imputation |
For 1000 labeled emails:
- Naive Bayes is a strong baseline - fast, works well with TF-IDF, handles high-dimensional sparse data
- Logistic regression will likely perform slightly better because 1000 samples is enough to fit a discriminative model
- Both will work well; the difference is probably <2% accuracy
Prefer Naive Bayes when:
- Very limited training data (<100 labeled examples)
- Need extremely fast training/inference (real-time system)
- Features are reasonably independent
- Need a baseline quickly
Prefer Logistic Regression when:
- Enough training data (500+)
- Need calibrated probabilities
- Feature interactions matter
- Want regularization control (L1 for sparsity, L2 for smoothness)
Scoring Rubric:
- Strong Hire: Explains generative vs. discriminative, discusses convergence rates (NB faster with small data, LR better asymptotically), mentions calibration difference
- Lean Hire: Knows the basics but can't explain why NB wins with small data
- No Hire: Dismisses naive Bayes as "too simple" without understanding its strengths
Problem 5: Designing an Active Learning System (Staff-Level)
Scenario: You're building an NLP model for classifying customer support tickets into 50 categories. You have 1 million unlabeled tickets and budget to label 10,000. Design the labeling strategy using active learning.
Hint 1 - Direction
Start with a small seed set, train an initial model, then iteratively select the most informative unlabeled examples based on model uncertainty. But with 50 categories, pure uncertainty sampling may not be enough - you also need diversity and coverage of rare categories.
Hint 2 - Insight
Consider: entropy-based uncertainty overselects from ambiguous categories. You need a strategy that ensures coverage of ALL 50 categories, especially rare ones. Also, the first few iterations are critical - the initial model is very poor, so early uncertainty estimates are unreliable. Use a warm-up phase with stratified random sampling.
Hint 3 - Full Solution
Phase 1: Warm-up (first 1,000 labels)
- Cluster the 1M unlabeled tickets using sentence embeddings (SBERT)
- Sample from each cluster proportionally to get diverse initial labels
- This ensures coverage even for rare categories
- Train initial model on these 1,000 labeled tickets
Phase 2: Active Learning Loop (remaining 9,000 labels, in batches of 500)
for batch in range(18): # 18 batches of 500 = 9,000
# 1. Get model predictions on unlabeled pool
probs = model.predict_proba(X_unlabeled)
# 2. Compute acquisition scores (hybrid strategy)
# Entropy for uncertainty
entropy = -np.sum(probs * np.log(probs + 1e-10), axis=1)
# Diversity via embedding clustering
embeddings = encoder.encode(X_unlabeled)
cluster_ids = KMeans(n_clusters=100).fit_predict(embeddings)
# 3. Select batch: balance uncertainty and diversity
selected = []
for cluster in range(100):
cluster_mask = cluster_ids == cluster
cluster_entropy = entropy[cluster_mask]
# Select top-5 most uncertain from each cluster
top_k = min(5, cluster_mask.sum())
top_idx = np.argsort(cluster_entropy)[-top_k:]
selected.extend(np.where(cluster_mask)[0][top_idx])
# Take top 500 overall
batch_idx = np.array(selected)
batch_entropy = entropy[batch_idx]
final_idx = batch_idx[np.argsort(batch_entropy)[-500:]]
# 4. Label and retrain
new_labels = human_annotate(X_unlabeled[final_idx])
X_train = np.vstack([X_train, X_unlabeled[final_idx]])
y_train = np.concatenate([y_train, new_labels])
X_unlabeled = np.delete(X_unlabeled, final_idx, axis=0)
model.fit(X_train, y_train)
# 5. Monitor category coverage
category_counts = Counter(y_train)
print(f"Batch {batch}: {len(category_counts)}/50 categories covered")
print(f"Rarest category: {min(category_counts.values())} samples")
Phase 3: Targeted Gap-Filling (last ~1,000 labels)
- Identify underrepresented categories (<20 labeled examples)
- Use model predictions to find unlabeled examples likely in those categories
- Prioritize labeling these to ensure minimum coverage
Key design decisions:
- Hybrid acquisition: Pure uncertainty sampling leads to redundant selections; adding diversity prevents this
- Batch size of 500: Small enough to update the model frequently, large enough for efficient annotation
- Category monitoring: Track coverage to ensure no category is left behind
- Stopping criterion: Stop when validation performance plateaus between batches
Expected outcome: With active learning, 10,000 strategically labeled examples can match performance of 30,000+ randomly labeled examples - a 3x efficiency gain.
Scoring Rubric:
- Strong Hire: Warm-up phase with clustering, hybrid uncertainty+diversity acquisition, category coverage monitoring, gap-filling phase, discusses annotation efficiency
- Lean Hire: Gets the basic active learning loop right but misses diversity or category coverage
- No Hire: Suggests random labeling or doesn't know about active learning
Interview Cheat Sheet
| Question | Key Points |
|---|---|
| What is Bayes' theorem? | Posterior = (Likelihood x Prior) / Evidence; updates beliefs with data |
| MLE vs. MAP? | MLE maximizes likelihood (no prior); MAP adds prior (= regularization); MAP = L2 with Gaussian prior |
| Why go full Bayesian? | Get uncertainty estimates; marginalize over parameters; better with small data |
| Naive Bayes: why "naive"? | Assumes feature independence given class; works well despite violation |
| What are Gaussian Processes? | Non-parametric Bayesian; distribution over functions; O(n^3); natural uncertainty |
| GPs vs neural nets? | GPs for small data, low dims, need uncertainty. NNs for large data, high dims, need scalability |
| Aleatoric vs. epistemic? | Aleatoric: data noise (irreducible). Epistemic: model ignorance (reducible with more data) |
| How to get NN uncertainty? | MC Dropout, deep ensembles, temperature scaling, heteroscedastic outputs |
| What is calibration? | Predicted probabilities match actual frequencies; check with reliability diagram |
| Temperature scaling? | Divide logits by T; T > 1 softens overconfident predictions; fit T on validation set |
| Bayesian A/B testing? | Beta posterior; P(B > A) directly; can stop early; credible intervals |
| Active learning? | Use uncertainty to select most informative samples to label; 3x labeling efficiency |
Spaced Repetition Checkpoints
Day 0 - Initial Learning
- Write Bayes' theorem and label each term
- Explain MLE vs. MAP with a coin flip example
- List two strengths of naive Bayes and two weaknesses
- Define aleatoric vs. epistemic uncertainty with examples
Day 3 - Recall
- Explain why L2 regularization is MAP with a Gaussian prior
- Describe how a Gaussian Process makes predictions with uncertainty
- Explain MC Dropout for uncertainty estimation (step by step)
- Define calibration and Expected Calibration Error
Day 7 - Application
- Implement Bayesian A/B testing with Beta posteriors
- Design an uncertainty-aware prediction system for a safety-critical application
- Explain when GPs beat neural networks (and vice versa)
- Solve Practice Problem 1 without hints
Day 14 - Integration
- Design a complete active learning system with uncertainty-based acquisition
- Explain the connection between Bayesian inference and regularization
- Compare 4 methods for uncertainty in neural networks (MC Dropout, ensembles, BNN, temperature scaling)
- Solve Practice Problem 2 with full system design
Day 21 - Mastery
- Teach Bayesian ML from Bayes' theorem to Gaussian Processes to BNNs
- Design a probabilistic production system with calibration, uncertainty, and monitoring
- Explain Bayesian optimization with GP surrogates and acquisition functions
- Confidently handle: "When is probabilistic ML overkill?"
