Skip to main content

Probabilistic ML - Beyond Point Predictions

Reading time: ~30 min | Interview relevance: High | Roles: MLE, Research Scientist, Applied Scientist, AI Engineer

The Real Interview Moment

The interviewer presents a scenario: "Your autonomous driving system needs to decide whether to brake. Your object detection model outputs 72% confidence that there's a pedestrian ahead. Should the car brake? What if the model is poorly calibrated and 72% confidence actually corresponds to a 45% true probability? How would you design a system that accounts for prediction uncertainty?"

You realize this isn't a question about threshold tuning - it's about whether the model knows what it doesn't know. A standard neural network that outputs 72% hasn't told you anything about whether it's uncertain because the image is ambiguous (the object could be a pedestrian or a mailbox) or because the scene looks nothing like the training data (a scenario the model has never encountered). These two types of uncertainty - aleatoric and epistemic - require fundamentally different responses, and distinguishing them is the heart of probabilistic ML.

What You Will Master

  • Bayes' theorem and its role as the foundation of probabilistic ML
  • MLE vs. MAP estimation - when and why each matters
  • Naive Bayes: assumptions, strengths, and surprising effectiveness
  • Gaussian Processes: non-parametric Bayesian regression with uncertainty
  • Bayesian Neural Networks: epistemic uncertainty in deep learning
  • Aleatoric vs. epistemic uncertainty and practical quantification
  • Calibration, reliability diagrams, and when probabilities matter
  • Applications: A/B testing, anomaly detection, active learning

Self-Assessment: Where Are You Now?

LevelDescriptionTarget
Beginner"I know Bayes' theorem from probability class"Read all parts carefully
Intermediate"I've used naive Bayes and know about priors, but GPs and BNNs are fuzzy"Focus on Parts 2-3 and practice problems
Advanced"I understand probabilistic models but want to nail uncertainty quantification"Jump to Part 3 (uncertainty), calibration, and practice problems

Part 1 - Foundations of Probabilistic Thinking

Bayes' Theorem - The Engine of Probabilistic ML

P(θD)=P(Dθ)P(θ)P(D)P(\theta | D) = \frac{P(D | \theta) \cdot P(\theta)}{P(D)}

TermNameMeaning
P(θD)P(\theta \| D)PosteriorUpdated belief about parameters after seeing data
P(Dθ)P(D \| \theta)LikelihoodProbability of observed data given parameters
P(θ)P(\theta)PriorBelief about parameters before seeing data
P(D)P(D)Evidence (marginal likelihood)Normalizing constant; P(Dθ)P(θ)dθ\int P(D \| \theta) P(\theta) d\theta
60-Second Answer

"Probabilistic ML treats model parameters and predictions as distributions rather than point values. Instead of finding the single best weights, we maintain a distribution over possible weights, which naturally gives us uncertainty estimates. Bayes' theorem is the update rule: we start with a prior belief, observe data, and compute the posterior belief. The key advantage over standard ML is that probabilistic models know what they don't know - they output high uncertainty for inputs far from the training distribution. This is critical for safety-critical applications like medical diagnosis, autonomous driving, and financial risk."

MLE vs. MAP Estimation

Maximum Likelihood Estimation (MLE): Find parameters that maximize the likelihood of observed data:

θ^MLE=argmaxθP(Dθ)=argmaxθi=1nlogP(xiθ)\hat{\theta}_{MLE} = \arg\max_\theta P(D | \theta) = \arg\max_\theta \sum_{i=1}^n \log P(x_i | \theta)

  • No prior assumed (or equivalently, a uniform/flat prior)
  • Can overfit with limited data
  • Equivalent to standard neural network training with no regularization

Maximum A Posteriori (MAP): Find parameters that maximize the posterior:

θ^MAP=argmaxθP(θD)=argmaxθ[logP(Dθ)+logP(θ)]\hat{\theta}_{MAP} = \arg\max_\theta P(\theta | D) = \arg\max_\theta [\log P(D | \theta) + \log P(\theta)]

  • Incorporates prior belief about reasonable parameter values
  • logP(θ)\log P(\theta) acts as a regularization term
  • With a Gaussian prior P(θ)=N(0,σ2)P(\theta) = N(0, \sigma^2), MAP is equivalent to L2 regularization
  • With a Laplace prior P(θ)=Laplace(0,b)P(\theta) = \text{Laplace}(0, b), MAP is equivalent to L1 regularization

MLE vs MAP vs Full Bayesian Inference

Interviewer's Perspective

"What's the connection between L2 regularization and Bayesian inference?" This is a classic that separates surface-level from deep understanding. The answer: L2 regularization is equivalent to MAP estimation with a Gaussian prior on the weights. The regularization strength lambda corresponds to the prior precision (inverse variance). When lambda is large, the prior is tight (strong belief weights should be small), pulling the estimate toward zero. This is a beautiful bridge between frequentist and Bayesian perspectives.

Full Bayesian Inference vs. Point Estimates

ApproachWhat You GetComputationUncertainty
MLESingle best parametersEasy (gradient descent)No
MAPSingle best parameters (regularized)Easy (gradient descent + penalty)No
Full BayesianDistribution over parametersHard (MCMC, variational inference)Yes

Why go full Bayesian?

  • Point estimates (MLE/MAP) give you one prediction. But how confident is that prediction?
  • Full Bayesian inference marginalizes over all possible parameters:

P(yx,D)=P(yx,θ)P(θD)dθP(y^* | x^*, D) = \int P(y^* | x^*, \theta) P(\theta | D) d\theta

This integral averages predictions over all plausible parameter settings, weighted by how likely each setting is given the data. The spread of this predictive distribution is the model's uncertainty.

Part 2 - Probabilistic Models

Naive Bayes

Despite its "naive" assumption of feature independence, naive Bayes is surprisingly effective for many problems.

Model:

P(yx1,...,xd)P(y)i=1dP(xiy)P(y | x_1, ..., x_d) \propto P(y) \prod_{i=1}^d P(x_i | y)

The "naive" assumption: features are conditionally independent given the class. This means P(x1,x2y)=P(x1y)P(x2y)P(x_1, x_2 | y) = P(x_1 | y) \cdot P(x_2 | y).

Why does it work despite the wrong assumption?

  1. Classification only needs the correct ranking of class probabilities, not calibrated values
  2. The decision boundary can still be correct even with incorrect probability estimates
  3. Feature correlations often "cancel out" in the posterior

Variants:

VariantFeature DistributionUse Case
Gaussian NBContinuous, roughly GaussianGeneral continuous features
Multinomial NBDiscrete countsText classification (word counts)
Bernoulli NBBinary (0/1)Text classification (word presence)
Complement NBDiscrete counts, adjustedImbalanced text classification
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer

# Text classification with Naive Bayes
vectorizer = TfidfVectorizer(max_features=10000)
X_train_tfidf = vectorizer.fit_transform(train_texts)

nb = MultinomialNB(alpha=1.0) # alpha = Laplace smoothing
nb.fit(X_train_tfidf, y_train)

# Probabilities are often poorly calibrated
probs = nb.predict_proba(X_test_tfidf) # Use with caution!
Common Trap

"Naive Bayes is bad because the independence assumption is always wrong." True, the assumption is almost always violated. But NB is fast, works well with limited data, handles high-dimensional sparse features excellently (text), and often competitive with more complex models. The real failure mode isn't the assumption - it's when the probability estimates are used directly (e.g., for calibrated risk scores). For classification decisions, NB is often good enough.

Gaussian Processes

A Gaussian Process (GP) is a non-parametric Bayesian approach that defines a distribution over functions.

Intuition: Instead of fitting a single function f(x)f(x), a GP maintains a distribution over all possible functions that are consistent with the observed data. Predictions come with uncertainty bands that widen in regions far from training data.

Definition: A GP is fully specified by:

  • Mean function m(x)m(x): typically zero (the prior mean)
  • Kernel (covariance) function k(x,x)k(x, x'): encodes similarity between inputs

f(x)GP(m(x),k(x,x))f(x) \sim \mathcal{GP}(m(x), k(x, x'))

Common kernels:

KernelFormulaProperties
RBF (Squared Exponential)k(x,x)=σ2exp(xx22l2)k(x,x') = \sigma^2 \exp(-\frac{\|x-x'\|^2}{2l^2})Smooth, infinitely differentiable
MaternVariousControls smoothness (nu parameter)
Lineark(x,x)=σ2xTxk(x,x') = \sigma^2 x^T x'Equivalent to Bayesian linear regression
Periodick(x,x)=σ2exp(2sin2(πxx/p)l2)k(x,x') = \sigma^2 \exp(-\frac{2\sin^2(\pi\|x-x'\|/p)}{l^2})For periodic patterns

GP Regression - making predictions:

Given training data (X,y)(X, y) and a test point xx^*:

P(fx,X,y)=N(μ,σ2)P(f^* | x^*, X, y) = \mathcal{N}(\mu^*, \sigma^{*2})

μ=k(x,X)[k(X,X)+σn2I]1y\mu^* = k(x^*, X) [k(X, X) + \sigma_n^2 I]^{-1} y

σ2=k(x,x)k(x,X)[k(X,X)+σn2I]1k(X,x)\sigma^{*2} = k(x^*, x^*) - k(x^*, X) [k(X, X) + \sigma_n^2 I]^{-1} k(X, x^*)

The mean μ\mu^* is the prediction, and σ2\sigma^{*2} is the uncertainty.

from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, WhiteKernel

kernel = RBF(length_scale=1.0) + WhiteKernel(noise_level=0.1)
gp = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=10)
gp.fit(X_train, y_train)

# Predictions with uncertainty
y_pred, y_std = gp.predict(X_test, return_std=True)

# 95% confidence interval
lower = y_pred - 1.96 * y_std
upper = y_pred + 1.96 * y_std

Strengths:

  • Natural uncertainty quantification (no extra work)
  • Works extremely well with small datasets
  • Non-parametric - automatically adapts complexity
  • Kernel hyperparameters can be optimized via marginal likelihood

Limitations:

  • O(n3)O(n^3) computation (inverting the kernel matrix) - impractical for >10K samples
  • Scales poorly to high dimensions (kernel length scales become hard to learn)
  • Choice of kernel encodes strong assumptions about function smoothness
Interviewer's Perspective

"When would you use a GP instead of a neural network?" Best answer: "When I have small data (<10K samples), need calibrated uncertainty estimates, and the input is low-dimensional (<20 features). GPs give uncertainty for free and work well with limited data. For large datasets or high-dimensional inputs (images, text), neural networks are more practical, but I'd add uncertainty via MC Dropout or ensembles."

Bayesian Neural Networks (BNNs)

Standard neural networks learn point estimates of weights. BNNs maintain distributions over weights.

Standard NN: w=w^w = \hat{w} (single value per weight) BNN: wq(w)w \sim q(w) (distribution per weight, typically Gaussian)

Prediction with uncertainty:

P(yx,D)1Tt=1TP(yx,wt),wtq(w)P(y^* | x^*, D) \approx \frac{1}{T} \sum_{t=1}^T P(y^* | x^*, w_t), \quad w_t \sim q(w)

Sample T different weight configurations, make a prediction with each, and average. The spread of predictions IS the epistemic uncertainty.

Practical approaches (since exact Bayesian inference is intractable for NNs):

1. MC Dropout (Monte Carlo Dropout): Use dropout at test time and make multiple forward passes. The variance across predictions approximates Bayesian uncertainty.

import torch
import torch.nn as nn

class MCDropoutModel(nn.Module):
def __init__(self, input_dim, hidden_dim, output_dim, dropout_rate=0.1):
super().__init__()
self.net = nn.Sequential(
nn.Linear(input_dim, hidden_dim),
nn.ReLU(),
nn.Dropout(dropout_rate), # Kept on during inference
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Dropout(dropout_rate),
nn.Linear(hidden_dim, output_dim)
)

def forward(self, x):
return self.net(x)

# Uncertainty estimation
model.train() # Keep dropout ON
T = 50 # Number of forward passes

predictions = []
with torch.no_grad():
for _ in range(T):
pred = model(x_test)
predictions.append(pred)

predictions = torch.stack(predictions)
mean_prediction = predictions.mean(dim=0)
uncertainty = predictions.std(dim=0) # Epistemic uncertainty

2. Deep Ensembles: Train M independent models with different random initializations. The variance across ensemble predictions captures uncertainty.

# Train 5 independent models
models = []
for seed in range(5):
torch.manual_seed(seed)
model = create_model()
train(model, X_train, y_train)
models.append(model)

# Predict with uncertainty
predictions = [model(x_test) for model in models]
mean_pred = torch.stack(predictions).mean(dim=0)
uncertainty = torch.stack(predictions).std(dim=0)
Company Variation

Google: Published seminal work on deep ensembles and uncertainty. Uses uncertainty for active learning in data labeling and for abstaining from predictions in safety-critical systems.

Meta: Uses MC Dropout for uncertainty in recommendation systems - low-uncertainty predictions served directly, high-uncertainty ones get human review.

Autonomous vehicles (Waymo, Tesla): Uncertainty quantification is critical for deciding when to hand control back to the human driver.

Drug discovery: Bayesian approaches are standard because data is scarce and each experiment is expensive - uncertainty guides which experiments to run next.

Part 3 - Uncertainty Quantification

Aleatoric vs. Epistemic Uncertainty

Aleatoric vs Epistemic Uncertainty

PropertyAleatoricEpistemic
SourceInherent randomness in dataLack of knowledge (limited data)
Reducible?No (irreducible)Yes (with more data)
ExampleNoisy sensor readings, ambiguous labelsOut-of-distribution inputs, data-sparse regions
How to capturePredict variance as model outputBayesian inference, ensembles, MC Dropout
ActionAccept it; inform decision-makersCollect more data; flag for human review

Heteroscedastic model (captures both):

class UncertaintyModel(nn.Module):
"""Predicts mean AND variance (aleatoric + epistemic via MC Dropout)"""
def __init__(self, input_dim, hidden_dim):
super().__init__()
self.shared = nn.Sequential(
nn.Linear(input_dim, hidden_dim),
nn.ReLU(),
nn.Dropout(0.1),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Dropout(0.1)
)
self.mean_head = nn.Linear(hidden_dim, 1) # Predicted mean
self.logvar_head = nn.Linear(hidden_dim, 1) # Predicted log-variance (aleatoric)

def forward(self, x):
h = self.shared(x)
mean = self.mean_head(h)
log_var = self.logvar_head(h) # Aleatoric uncertainty
return mean, log_var

# Loss: negative log-likelihood of Gaussian
def nll_loss(mean, log_var, target):
precision = torch.exp(-log_var)
return 0.5 * (precision * (target - mean)**2 + log_var).mean()

# Uncertainty decomposition at test time:
# 1. Aleatoric: mean of predicted variances
# 2. Epistemic: variance of predicted means (via MC Dropout)
model.train()
means, logvars = [], []
for _ in range(50):
m, lv = model(x_test)
means.append(m)
logvars.append(lv)

means = torch.stack(means)
logvars = torch.stack(logvars)

aleatoric = torch.exp(logvars).mean(dim=0) # Mean of predicted variances
epistemic = means.var(dim=0) # Variance of predicted means
total = aleatoric + epistemic
Instant Rejection

"Uncertainty doesn't matter - we just need the most likely prediction." In safety-critical applications (medical, autonomous driving, finance), knowing how confident a prediction is can be more important than the prediction itself. A model that says "I'm 99% sure" when it's wrong is far more dangerous than one that says "I'm 50% sure - please escalate to a human." Dismissing uncertainty estimation shows a lack of awareness about responsible AI deployment.

When Probabilistic Approaches Beat Point Estimates

ScenarioWhy Probabilistic Wins
Small dataPrior regularizes; uncertainty quantifies data scarcity
Safety-criticalModel can abstain when uncertain, deferring to humans
Active learningUncertainty guides which samples to label next
Anomaly detectionHigh epistemic uncertainty = out-of-distribution input
A/B testingBayesian A/B tests provide posterior probability of variant winning
Decision-making under uncertaintyExpected utility maximization requires probability distributions
Online learningPrior from previous model; posterior from new data; continuous updating

Part 4 - Calibration and Reliability

What Is Calibration?

A model is calibrated if its predicted probabilities match observed frequencies:

  • When it says "80% chance of rain," it should rain 80% of the time
  • When it says "70% confidence this is spam," 70% of such emails should actually be spam

Why do models become miscalibrated?

  • Neural networks are notoriously overconfident (modern NNs are more expressive but worse calibrated than older architectures)
  • Training with cross-entropy loss doesn't guarantee calibration
  • Temperature scaling, resampling, and other modifications can shift calibration

Reliability Diagrams

Bin predictions by confidence, then plot predicted probability vs. actual frequency:

from sklearn.calibration import calibration_curve
import matplotlib.pyplot as plt

prob_true, prob_pred = calibration_curve(y_test, y_probs, n_bins=10, strategy='uniform')

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Reliability diagram
axes[0].plot(prob_pred, prob_true, 's-', label='Model')
axes[0].plot([0, 1], [0, 1], 'k--', label='Perfect calibration')
axes[0].set_xlabel('Mean predicted probability')
axes[0].set_ylabel('Fraction of positives')
axes[0].set_title('Reliability Diagram')
axes[0].legend()

# Histogram of predictions
axes[1].hist(y_probs, bins=50, range=(0, 1), edgecolor='black')
axes[1].set_xlabel('Predicted probability')
axes[1].set_ylabel('Count')
axes[1].set_title('Prediction Distribution')

Reading the diagram:

  • Points above the diagonal: model is underconfident (says 60%, actual is 80%)
  • Points below the diagonal: model is overconfident (says 80%, actual is 60%)
  • Points on the diagonal: perfectly calibrated

Calibration Metrics

Expected Calibration Error (ECE):

ECE=b=1Bnbnacc(b)conf(b)ECE = \sum_{b=1}^{B} \frac{n_b}{n} |acc(b) - conf(b)|

Where acc(b)acc(b) is the accuracy in bin bb and conf(b)conf(b) is the mean confidence in bin bb.

def expected_calibration_error(y_true, y_prob, n_bins=10):
bin_boundaries = np.linspace(0, 1, n_bins + 1)
ece = 0.0
for i in range(n_bins):
mask = (y_prob >= bin_boundaries[i]) & (y_prob < bin_boundaries[i + 1])
if mask.sum() == 0:
continue
bin_acc = y_true[mask].mean()
bin_conf = y_prob[mask].mean()
bin_weight = mask.sum() / len(y_true)
ece += bin_weight * abs(bin_acc - bin_conf)
return ece

Calibration Methods

Temperature Scaling (most common for neural networks):

Learn a single scalar T on the validation set:

pcalibrated=softmax(z/T)p_{calibrated} = \text{softmax}(z / T)

where zz is the logit output. T > 1 softens probabilities (reduces overconfidence). T < 1 sharpens them.

import torch
import torch.nn as nn

class TemperatureScaling(nn.Module):
def __init__(self):
super().__init__()
self.temperature = nn.Parameter(torch.ones(1))

def forward(self, logits):
return logits / self.temperature

# Optimize temperature on validation set
temp_model = TemperatureScaling()
optimizer = torch.optim.LBFGS([temp_model.temperature], lr=0.01, max_iter=50)

def closure():
optimizer.zero_grad()
scaled_logits = temp_model(val_logits)
loss = nn.CrossEntropyLoss()(scaled_logits, val_labels)
loss.backward()
return loss

optimizer.step(closure)
print(f"Optimal temperature: {temp_model.temperature.item():.3f}")

Platt Scaling: Fit a logistic regression on the model's raw outputs.

Isotonic Regression: Non-parametric; fits a non-decreasing step function. More flexible but needs more calibration data.

Common Trap

"I'll use temperature scaling, so now my model is calibrated forever." Calibration drifts over time as data distributions change. In production, you need to monitor calibration (ECE, reliability diagrams) and recalibrate periodically. Models can become uncalibrated due to data drift, concept drift, or changes in the population.

Part 5 - Applications

Bayesian A/B Testing

Frequentist A/B test: "Is the difference statistically significant?" (p-value) Bayesian A/B test: "What is the probability that variant B is better than A?" (posterior)

import numpy as np
from scipy import stats

# Observed data
conversions_A, trials_A = 120, 1000
conversions_B, trials_B = 145, 1000

# Beta posterior (conjugate prior for Bernoulli)
# Prior: Beta(1, 1) = uniform
alpha_A = 1 + conversions_A
beta_A = 1 + (trials_A - conversions_A)
alpha_B = 1 + conversions_B
beta_B = 1 + (trials_B - conversions_B)

# Monte Carlo estimate of P(B > A)
samples_A = np.random.beta(alpha_A, beta_A, size=100000)
samples_B = np.random.beta(alpha_B, beta_B, size=100000)
prob_B_better = (samples_B > samples_A).mean()

print(f"P(B > A) = {prob_B_better:.3f}")
print(f"Expected lift: {(samples_B - samples_A).mean() * 100:.2f}%")
print(f"95% credible interval for lift: "
f"[{np.percentile(samples_B - samples_A, 2.5)*100:.2f}%, "
f"{np.percentile(samples_B - samples_A, 97.5)*100:.2f}%]")

Advantages over frequentist:

  • Directly answers "how likely is B better?" (not "would we reject H0?")
  • Can incorporate prior knowledge (previous experiments)
  • Can stop early without p-value inflation concerns (with proper sequential methods)
  • Provides a full distribution over the effect size, not just a point estimate

Bayesian Anomaly Detection

Use a probabilistic model of "normal" behavior. Observations with low likelihood under the model are anomalies.

from sklearn.mixture import BayesianGaussianMixture

# Fit a Bayesian Gaussian Mixture Model on normal data
bgm = BayesianGaussianMixture(
n_components=10, # Maximum components (Bayesian will prune unused ones)
weight_concentration_prior_type='dirichlet_process',
weight_concentration_prior=0.01, # Smaller = fewer components
random_state=42
)
bgm.fit(X_normal)

# Anomaly score: negative log-likelihood
log_likelihood = bgm.score_samples(X_test)
anomaly_threshold = np.percentile(bgm.score_samples(X_normal), 1) # 1st percentile

anomalies = X_test[log_likelihood < anomaly_threshold]

Why Bayesian? The Dirichlet Process prior automatically determines the number of Gaussian components - you don't need to specify it. Components that don't explain data are pruned.

Active Learning with Uncertainty

Use model uncertainty to choose which samples to label next - label the most informative (uncertain) samples first.

def active_learning_loop(model, X_pool, X_train, y_train, n_iterations=10, batch_size=10):
for iteration in range(n_iterations):
# Train model on current labeled data
model.fit(X_train, y_train)

# Estimate uncertainty on unlabeled pool
# For BNNs/MC Dropout: use prediction variance
# For GPs: use predicted std
_, uncertainties = model.predict(X_pool, return_std=True)

# Select most uncertain samples
query_idx = np.argsort(uncertainties)[-batch_size:]

# Query labels (human annotation)
new_labels = oracle.label(X_pool[query_idx])

# Add to training set
X_train = np.vstack([X_train, X_pool[query_idx]])
y_train = np.concatenate([y_train, new_labels])

# Remove from pool
X_pool = np.delete(X_pool, query_idx, axis=0)

return model

Acquisition strategies:

  • Uncertainty sampling: Select samples where the model is most uncertain (max entropy, max variance)
  • Query by committee: Train multiple models, select samples where they disagree most
  • Expected information gain: Select samples that would maximally reduce posterior uncertainty
  • Batch Active Learning with Diversity (BADGE): Balances uncertainty and diversity in batch selection

Probabilistic Inference Pipeline

Probabilistic Inference Pipeline

Practice Problems

Problem 1: MLE vs. MAP with Small Data (Mid-Level)

Scenario: You have 10 coin flips: 9 heads, 1 tail. What's the MLE estimate of P(heads)? What's the MAP estimate with a Beta(2, 2) prior? Which is better and why?

Hint 1 - Direction

MLE is just the frequency. For MAP with a Beta prior, the posterior of a Bernoulli likelihood with a Beta prior is also Beta (conjugate pair). The MAP of a Beta distribution is (alpha - 1) / (alpha + beta - 2).

Hint 2 - Insight

The MLE says P(heads) = 0.9, but you probably don't believe a coin is that biased based on just 10 flips. The Beta(2,2) prior expresses mild belief that the coin is roughly fair, pulling the estimate toward 0.5.

Hint 3 - Full Solution

MLE: p^MLE=headstotal=910=0.9\hat{p}_{MLE} = \frac{\text{heads}}{\text{total}} = \frac{9}{10} = 0.9

MAP with Beta(2, 2) prior: Posterior is Beta(alpha + heads, beta + tails) = Beta(2 + 9, 2 + 1) = Beta(11, 3)

MAP of Beta(a, b) = (a - 1) / (a + b - 2): p^MAP=11111+32=1012=0.833\hat{p}_{MAP} = \frac{11 - 1}{11 + 3 - 2} = \frac{10}{12} = 0.833

Why MAP is better here:

  • With only 10 flips, the MLE of 0.9 is very sensitive to sampling variation
  • If you flipped 10 more times, you'd likely see a different ratio
  • The Beta(2,2) prior adds the equivalent of 2 pseudo-observations of each outcome, smoothing the estimate
  • The MAP estimate (0.833) is pulled toward 0.5, reflecting that extreme probabilities are less likely a priori
  • With more data (e.g., 9000 heads out of 10000), MLE (0.9) and MAP (0.8997) nearly agree - the prior becomes irrelevant

Full Bayesian approach: Rather than a point estimate, report the full posterior: Beta(11, 3). Mean = 11/14 = 0.786, 95% credible interval = [0.57, 0.95].

Scoring Rubric:

  • Strong Hire: Computes both correctly, explains conjugacy, discusses when prior matters (small data) vs. doesn't (large data), mentions full posterior as ideal
  • Lean Hire: Computes MLE correctly, gets MAP formula but struggles with derivation
  • No Hire: Can't compute MLE from data or doesn't understand what a prior does

Problem 2: Uncertainty in Production (Senior-Level)

Scenario: Your medical imaging model outputs 0.85 probability of a tumor being malignant. The model was trained on 50K images. A doctor asks: "How confident is the model in this prediction?" Just saying "85%" isn't sufficient. Design a system that provides meaningful uncertainty information.

Hint 1 - Direction

Distinguish between "the model thinks 85%" (which could be miscalibrated) and "we're confident the model's estimate is around 85%" (epistemic uncertainty). You need to address calibration, epistemic uncertainty, and how to communicate this to a non-technical user.

Hint 2 - Insight

The 0.85 might have high aleatoric uncertainty (the image itself is ambiguous) or high epistemic uncertainty (the model has never seen a tumor like this). The doctor needs to know WHICH type of uncertainty is present because the action differs: aleatoric means "get a biopsy regardless," epistemic means "get a second opinion from a specialist."

Hint 3 - Full Solution

System design:

1. Calibrate the model:

  • Apply temperature scaling on a held-out calibration set
  • Verify with a reliability diagram that predicted probabilities match actual frequencies
  • Report the calibrated probability (e.g., 0.85 might become 0.78 after calibration)

2. Quantify aleatoric uncertainty:

  • Train the model to output mean AND variance (heteroscedastic output)
  • High aleatoric uncertainty = the image is inherently ambiguous (e.g., borderline lesion)

3. Quantify epistemic uncertainty:

  • Use MC Dropout (T=50 forward passes) or a deep ensemble (M=5 models)
  • Compute prediction variance across passes/models
  • High epistemic uncertainty = this image is unlike the training data (rare pathology, unusual imaging conditions)

4. Out-of-distribution detection:

  • Compare the image's embedding to training data embeddings
  • If the image is far from all training examples, flag for manual review regardless of prediction

5. Communicate to the doctor:

Prediction: 78% probability of malignancy (calibrated)

Confidence assessment:
- Image clarity: HIGH (low aleatoric uncertainty)
- Model familiarity: MEDIUM (moderate epistemic uncertainty -
this tumor type is underrepresented in training data)

Recommendation: Consider specialist consultation due to
moderate model uncertainty on this pathology type.

Similar cases in training data: 47 (23 malignant, 24 benign)

6. Monitoring:

  • Track calibration drift monthly
  • Flag cases where the model was highly confident but wrong (for review and retraining)
  • Track epistemic uncertainty distribution - increasing average uncertainty suggests distribution shift

Scoring Rubric:

  • Strong Hire: Separates aleatoric from epistemic uncertainty, calibrates the model, designs OOD detection, proposes doctor-friendly communication, discusses monitoring
  • Lean Hire: Mentions calibration and uncertainty but doesn't separate types or design the communication
  • No Hire: Thinks 0.85 is the answer or suggests "just set a threshold"

Problem 3: Bayesian Optimization (Senior-Level)

Scenario: You need to tune 8 hyperparameters for a deep learning model. Each training run takes 6 hours on a GPU. You have budget for 50 runs. Grid search is impractical (3^8 = 6,561 combinations). Random search is better but wasteful. Propose a more efficient approach.

Hint 1 - Direction

Bayesian optimization uses a probabilistic surrogate model (often a GP) to model the objective function and intelligently choose which hyperparameters to try next.

Hint 2 - Insight

The key idea: use a GP to model "validation accuracy as a function of hyperparameters." The GP's predictive uncertainty tells you where the objective is uncertain. An acquisition function balances exploiting known good regions and exploring uncertain regions.

Hint 3 - Full Solution

Bayesian Optimization with Gaussian Processes:

  1. Surrogate model: A GP models the unknown function f(hyperparams)=validation metricf(\text{hyperparams}) = \text{validation metric}

  2. Acquisition function: Decides which hyperparameters to try next

    • Expected Improvement (EI): EI(x)=E[max(f(x)fbest,0)]EI(x) = \mathbb{E}[\max(f(x) - f_{best}, 0)]
    • Balances exploitation (high predicted mean) and exploration (high predicted variance)
  3. Loop:

    Initialize: Try 5-10 random hyperparameter combinations
    For i = 11 to 50:
    Fit GP to (hyperparams, metrics) observed so far
    Find next hyperparams that maximize acquisition function
    Train model with those hyperparams
    Record validation metric
    Return best hyperparams found
from skopt import gp_minimize
from skopt.space import Real, Integer, Categorical

# Define search space
space = [
Real(1e-5, 1e-1, prior='log-uniform', name='learning_rate'),
Integer(32, 256, name='batch_size'),
Real(0.0, 0.5, name='dropout'),
Integer(1, 5, name='num_layers'),
Integer(64, 512, name='hidden_dim'),
Real(1e-6, 1e-2, prior='log-uniform', name='weight_decay'),
Categorical(['relu', 'gelu', 'silu'], name='activation'),
Categorical(['adam', 'sgd', 'adamw'], name='optimizer'),
]

def objective(params):
lr, batch_size, dropout, n_layers, hidden, wd, act, opt = params
# Train model with these hyperparameters
val_metric = train_and_evaluate(lr, batch_size, dropout, n_layers,
hidden, wd, act, opt)
return -val_metric # Minimize negative metric

result = gp_minimize(objective, space, n_calls=50, n_initial_points=10,
random_state=42)
print(f"Best metric: {-result.fun:.4f}")
print(f"Best params: {result.x}")

Why this works better than random search:

  • After 10 random points, the GP has a rough model of the objective surface
  • It explores uncertain regions AND exploits promising regions
  • Typically finds near-optimal hyperparameters in 3-5x fewer evaluations than random search

Advanced considerations:

  • Multi-fidelity: Use short training runs (1 epoch) to discard bad configurations quickly, full runs only for promising ones (Hyperband, BOHB)
  • Transfer learning: Use results from previous similar experiments as the GP prior
  • Parallelism: Batch acquisition - select multiple candidates simultaneously for parallel GPUs

Scoring Rubric:

  • Strong Hire: Explains GP surrogate, acquisition function (EI), the explore-exploit tradeoff, proposes multi-fidelity extension, implements correctly
  • Lean Hire: Knows Bayesian optimization uses a surrogate model but can't explain the acquisition function or the GP
  • No Hire: Suggests grid search or doesn't know about Bayesian optimization

Problem 4: Naive Bayes vs. Logistic Regression (Screening-Level)

Scenario: You have 1000 labeled emails (500 spam, 500 not spam) with TF-IDF features. Compare naive Bayes and logistic regression. When would you prefer each?

Hint 1 - Direction

Think about their assumptions, training speed, behavior with limited data, and the quality of probability estimates.

Hint 2 - Insight

Naive Bayes is a generative model (models P(x|y)); logistic regression is discriminative (models P(y|x)). They have different convergence rates and different assumptions. NB wins with very little data; LR wins with more data.

Hint 3 - Full Solution
AspectNaive BayesLogistic Regression
Model typeGenerativeDiscriminative
AssumptionFeature independence given classLinear decision boundary in feature space
TrainingCounting (extremely fast)Iterative optimization (fast)
Small data (<100 samples)Better (fewer parameters)Worse (more parameters, can overfit)
Moderate data (1000 samples)GoodUsually better
Large data (100K+ samples)CompetitiveBetter (tighter asymptotic fit)
CalibrationPoor (probabilities pushed to 0/1)Better (but still imperfect)
Feature interactionsIgnores themCaptures linear interactions
RegularizationLaplace smoothing (alpha)L1/L2 penalty
Missing featuresHandles naturallyRequires imputation

For 1000 labeled emails:

  • Naive Bayes is a strong baseline - fast, works well with TF-IDF, handles high-dimensional sparse data
  • Logistic regression will likely perform slightly better because 1000 samples is enough to fit a discriminative model
  • Both will work well; the difference is probably <2% accuracy

Prefer Naive Bayes when:

  • Very limited training data (<100 labeled examples)
  • Need extremely fast training/inference (real-time system)
  • Features are reasonably independent
  • Need a baseline quickly

Prefer Logistic Regression when:

  • Enough training data (500+)
  • Need calibrated probabilities
  • Feature interactions matter
  • Want regularization control (L1 for sparsity, L2 for smoothness)

Scoring Rubric:

  • Strong Hire: Explains generative vs. discriminative, discusses convergence rates (NB faster with small data, LR better asymptotically), mentions calibration difference
  • Lean Hire: Knows the basics but can't explain why NB wins with small data
  • No Hire: Dismisses naive Bayes as "too simple" without understanding its strengths

Problem 5: Designing an Active Learning System (Staff-Level)

Scenario: You're building an NLP model for classifying customer support tickets into 50 categories. You have 1 million unlabeled tickets and budget to label 10,000. Design the labeling strategy using active learning.

Hint 1 - Direction

Start with a small seed set, train an initial model, then iteratively select the most informative unlabeled examples based on model uncertainty. But with 50 categories, pure uncertainty sampling may not be enough - you also need diversity and coverage of rare categories.

Hint 2 - Insight

Consider: entropy-based uncertainty overselects from ambiguous categories. You need a strategy that ensures coverage of ALL 50 categories, especially rare ones. Also, the first few iterations are critical - the initial model is very poor, so early uncertainty estimates are unreliable. Use a warm-up phase with stratified random sampling.

Hint 3 - Full Solution

Phase 1: Warm-up (first 1,000 labels)

  • Cluster the 1M unlabeled tickets using sentence embeddings (SBERT)
  • Sample from each cluster proportionally to get diverse initial labels
  • This ensures coverage even for rare categories
  • Train initial model on these 1,000 labeled tickets

Phase 2: Active Learning Loop (remaining 9,000 labels, in batches of 500)

for batch in range(18): # 18 batches of 500 = 9,000
# 1. Get model predictions on unlabeled pool
probs = model.predict_proba(X_unlabeled)

# 2. Compute acquisition scores (hybrid strategy)
# Entropy for uncertainty
entropy = -np.sum(probs * np.log(probs + 1e-10), axis=1)

# Diversity via embedding clustering
embeddings = encoder.encode(X_unlabeled)
cluster_ids = KMeans(n_clusters=100).fit_predict(embeddings)

# 3. Select batch: balance uncertainty and diversity
selected = []
for cluster in range(100):
cluster_mask = cluster_ids == cluster
cluster_entropy = entropy[cluster_mask]
# Select top-5 most uncertain from each cluster
top_k = min(5, cluster_mask.sum())
top_idx = np.argsort(cluster_entropy)[-top_k:]
selected.extend(np.where(cluster_mask)[0][top_idx])

# Take top 500 overall
batch_idx = np.array(selected)
batch_entropy = entropy[batch_idx]
final_idx = batch_idx[np.argsort(batch_entropy)[-500:]]

# 4. Label and retrain
new_labels = human_annotate(X_unlabeled[final_idx])
X_train = np.vstack([X_train, X_unlabeled[final_idx]])
y_train = np.concatenate([y_train, new_labels])
X_unlabeled = np.delete(X_unlabeled, final_idx, axis=0)

model.fit(X_train, y_train)

# 5. Monitor category coverage
category_counts = Counter(y_train)
print(f"Batch {batch}: {len(category_counts)}/50 categories covered")
print(f"Rarest category: {min(category_counts.values())} samples")

Phase 3: Targeted Gap-Filling (last ~1,000 labels)

  • Identify underrepresented categories (<20 labeled examples)
  • Use model predictions to find unlabeled examples likely in those categories
  • Prioritize labeling these to ensure minimum coverage

Key design decisions:

  1. Hybrid acquisition: Pure uncertainty sampling leads to redundant selections; adding diversity prevents this
  2. Batch size of 500: Small enough to update the model frequently, large enough for efficient annotation
  3. Category monitoring: Track coverage to ensure no category is left behind
  4. Stopping criterion: Stop when validation performance plateaus between batches

Expected outcome: With active learning, 10,000 strategically labeled examples can match performance of 30,000+ randomly labeled examples - a 3x efficiency gain.

Scoring Rubric:

  • Strong Hire: Warm-up phase with clustering, hybrid uncertainty+diversity acquisition, category coverage monitoring, gap-filling phase, discusses annotation efficiency
  • Lean Hire: Gets the basic active learning loop right but misses diversity or category coverage
  • No Hire: Suggests random labeling or doesn't know about active learning

Interview Cheat Sheet

QuestionKey Points
What is Bayes' theorem?Posterior = (Likelihood x Prior) / Evidence; updates beliefs with data
MLE vs. MAP?MLE maximizes likelihood (no prior); MAP adds prior (= regularization); MAP = L2 with Gaussian prior
Why go full Bayesian?Get uncertainty estimates; marginalize over parameters; better with small data
Naive Bayes: why "naive"?Assumes feature independence given class; works well despite violation
What are Gaussian Processes?Non-parametric Bayesian; distribution over functions; O(n^3); natural uncertainty
GPs vs neural nets?GPs for small data, low dims, need uncertainty. NNs for large data, high dims, need scalability
Aleatoric vs. epistemic?Aleatoric: data noise (irreducible). Epistemic: model ignorance (reducible with more data)
How to get NN uncertainty?MC Dropout, deep ensembles, temperature scaling, heteroscedastic outputs
What is calibration?Predicted probabilities match actual frequencies; check with reliability diagram
Temperature scaling?Divide logits by T; T > 1 softens overconfident predictions; fit T on validation set
Bayesian A/B testing?Beta posterior; P(B > A) directly; can stop early; credible intervals
Active learning?Use uncertainty to select most informative samples to label; 3x labeling efficiency

Spaced Repetition Checkpoints

Day 0 - Initial Learning

  • Write Bayes' theorem and label each term
  • Explain MLE vs. MAP with a coin flip example
  • List two strengths of naive Bayes and two weaknesses
  • Define aleatoric vs. epistemic uncertainty with examples

Day 3 - Recall

  • Explain why L2 regularization is MAP with a Gaussian prior
  • Describe how a Gaussian Process makes predictions with uncertainty
  • Explain MC Dropout for uncertainty estimation (step by step)
  • Define calibration and Expected Calibration Error

Day 7 - Application

  • Implement Bayesian A/B testing with Beta posteriors
  • Design an uncertainty-aware prediction system for a safety-critical application
  • Explain when GPs beat neural networks (and vice versa)
  • Solve Practice Problem 1 without hints

Day 14 - Integration

  • Design a complete active learning system with uncertainty-based acquisition
  • Explain the connection between Bayesian inference and regularization
  • Compare 4 methods for uncertainty in neural networks (MC Dropout, ensembles, BNN, temperature scaling)
  • Solve Practice Problem 2 with full system design

Day 21 - Mastery

  • Teach Bayesian ML from Bayes' theorem to Gaussian Processes to BNNs
  • Design a probabilistic production system with calibration, uncertainty, and monitoring
  • Explain Bayesian optimization with GP surrogates and acquisition functions
  • Confidently handle: "When is probabilistic ML overkill?"
© 2026 EngineersOfAI. All rights reserved.