Skip to main content

Loss Functions - Translating Goals into Gradients

Reading time: ~40 min | Interview relevance: Critical | Roles: MLE, AI Eng, Data Scientist, Research Engineer

The Real Interview Moment

You are in a Meta ML Engineer interview. The interviewer describes a real problem: "We are building a content moderation system. 0.5% of posts violate our policies. We are using binary cross-entropy loss. The model achieves 99.5% accuracy but catches almost no violations. What is wrong, and how would you fix the loss function?"

The weak candidate says "use class weights." The decent candidate explains focal loss. The strong hire designs a custom loss function from first principles - articulating what properties the loss needs (high penalty for false negatives, robustness to label noise in the majority class, stable gradients), then constructing it mathematically and explaining how to validate that it works.

Loss functions are where ML theory meets product requirements. Every model you train is only as good as the loss function that guides it. This page gives you the depth to not just pick a loss function from a menu, but to understand why each works, when each fails, and how to design new ones.

What You Will Master

  • Derive MSE, MAE, Huber, cross-entropy, and hinge loss from first principles
  • Analyze the gradient behavior of each loss function and explain why it matters
  • Choose the right loss for any problem type using a systematic decision framework
  • Explain focal loss, contrastive loss, and triplet loss with mathematical precision
  • Design custom loss functions that encode business requirements
  • Debug training failures caused by loss function issues (vanishing gradients, explosion, misalignment)
  • Compare loss functions along axes that interviewers care about: robustness, convergence, interpretability
  • Answer loss function interview questions at any major company

Self-Assessment: Where Are You Now?

Skill1 - Cannot2 - Vaguely3 - Can Explain4 - Can Derive5 - Can TeachYour Score
Define MSE and compute its gradient___
Explain when MAE beats MSE___
Derive binary cross-entropy___
Explain hinge loss geometrically___
Describe focal loss and its motivation___
Explain contrastive/triplet loss___
Design a custom loss for a given problem___
Debug training issues from loss behavior___

Part 1 - Regression Losses

Mean Squared Error (MSE)

LMSE=1ni=1n(yiy^i)2L_{\text{MSE}} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2

Gradient: dL/dy_hat = -2(y - y_hat)/n

Properties:

  • Penalizes large errors quadratically - one outlier with error 10 contributes 100 to the loss, while ten errors of 1 contribute only 10
  • Gradient is proportional to error magnitude - large errors get corrected faster
  • Corresponds to Maximum Likelihood Estimation under Gaussian noise assumption: if y = f(x) + epsilon where epsilon ~ N(0, sigma^2), then minimizing MSE = maximizing the likelihood
  • Differentiable everywhere - clean optimization

When to use: Default for regression when you want to penalize large errors heavily and your noise is approximately Gaussian.

When NOT to use: When your data has outliers. A single outlier can dominate the entire loss.

Interviewer's Perspective

When I ask "why MSE?", I want to hear the Gaussian MLE connection. This shows the candidate understands that loss function choice is an assumption about the noise distribution. If they just say "it is the default for regression," that is a lean-hire answer.

Mean Absolute Error (MAE)

LMAE=1ni=1nyiy^iL_{\text{MAE}} = \frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i|

Gradient: dL/dy_hat = -sign(y - y_hat)/n

Properties:

  • Penalizes all errors linearly - outliers have proportional, not quadratic, influence
  • More robust to outliers than MSE
  • Gradient is constant in magnitude (always +1/n or -1/n) - does not adapt to error size
  • Not differentiable at y = y_hat - requires subgradient methods
  • Corresponds to MLE under Laplacian noise
  • Predicts the median rather than the mean

When to use: When your data has outliers or heavy-tailed noise. When you want the median prediction rather than the mean.

When NOT to use: When you need smooth gradients for stable optimization, or when large errors should be penalized more than proportionally.

Common Trap

Candidates often say "MAE is always better than MSE because it handles outliers." This is wrong. MAE's constant gradient magnitude means it does not correct large errors faster than small ones. In clean data, MSE converges faster because large errors produce large gradients. The choice depends on your data's noise distribution.

Huber Loss (Smooth Combination)

LHuber={12(yy^)2if yy^δδyy^12δ2otherwiseL_{\text{Huber}} = \begin{cases} \frac{1}{2}(y - \hat{y})^2 & \text{if } |y - \hat{y}| \leq \delta \\ \delta|y - \hat{y}| - \frac{1}{2}\delta^2 & \text{otherwise} \end{cases}

Gradient:

  • |error| <= delta: gradient = -(y - y_hat) (like MSE)
  • |error| > delta: gradient = -delta * sign(y - y_hat) (like MAE, capped)

Properties:

  • Quadratic for small errors (smooth, fast convergence near optimum)
  • Linear for large errors (robust to outliers)
  • Differentiable everywhere (unlike MAE)
  • delta is a hyperparameter: delta → 0 recovers MAE, delta → infinity recovers MSE
  • Best of both worlds - but adds a hyperparameter to tune

When to use: Regression with some outliers but you still want smooth optimization. The default choice for production regression systems at many companies.

Comparison Table: Regression Losses

PropertyMSEMAEHuber
Outlier robustnessLowHighMedium-High
Gradient magnitudeProportional to errorConstantProportional then capped
DifferentiabilityEverywhereNot at 0Everywhere
Convergence speedFast (near optimum)Slow (constant gradient)Fast
Statistical estimandMeanMedianMean (small errors), Median-like (large errors)
MLE assumptionGaussian noiseLaplacian noiseMixture
HyperparametersNoneNonedelta

Regression Loss Selection

Part 2 - Classification Losses

Binary Cross-Entropy (Log Loss)

LBCE=1ni=1n[yilog(p^i)+(1yi)log(1p^i)]L_{\text{BCE}} = -\frac{1}{n}\sum_{i=1}^{n}\left[y_i \log(\hat{p}_i) + (1-y_i)\log(1-\hat{p}_i)\right]

where y_i in {0, 1} and p_hat_i is the predicted probability.

Derivation from MLE:

If y ~ Bernoulli(p), the likelihood of observing a single data point is:

P(y|p) = p^y * (1-p)^(1-y)

The log-likelihood is:

log P(y|p) = y * log(p) + (1-y) * log(1-p)

Maximizing the log-likelihood = minimizing the negative log-likelihood = minimizing binary cross-entropy.

Gradient (with respect to predicted probability p_hat):

dL/dp_hat = -(y/p_hat) + (1-y)/(1-p_hat)

= (p_hat - y) / (p_hat * (1 - p_hat))

Key property: The gradient is large when the prediction is confident and wrong (p_hat near 0 when y=1, or p_hat near 1 when y=0). This is exactly what we want - strong correction for confident mistakes.

With logit parameterization (z = log(p/(1-p))):

dL/dz = p_hat - y

This is remarkably clean: the gradient is simply the difference between prediction and truth. This is why logistic regression with cross-entropy loss converges so well.

60-Second Answer

"Cross-entropy is the negative log-likelihood under a Bernoulli model. It penalizes confident wrong predictions exponentially - predicting 0.01 when the true label is 1 costs much more than predicting 0.4. This is the correct loss for classification because it directly optimizes the predicted probability distribution. The gradient with respect to logits is simply p_hat minus y, which makes optimization clean and stable."

Categorical Cross-Entropy (Multi-class)

LCCE=1ni=1nc=1Cyi,clog(p^i,c)L_{\text{CCE}} = -\frac{1}{n}\sum_{i=1}^{n}\sum_{c=1}^{C}y_{i,c}\log(\hat{p}_{i,c})

where y is one-hot encoded and p_hat is the softmax output.

Softmax: p_hat_c = exp(z_c) / sum_j(exp(z_j))

Gradient (with respect to logit z_c): dL/dz_c = p_hat_c - y_c (same clean form as binary)

Properties:

  • Natural extension of binary cross-entropy to C classes
  • Combined with softmax, produces well-calibrated probabilities
  • Gradient is clean and well-behaved
  • Corresponds to MLE under a Categorical distribution

Hinge Loss (SVM Loss)

Lhinge=1ni=1nmax(0,1yiy^i)L_{\text{hinge}} = \frac{1}{n}\sum_{i=1}^{n}\max(0, 1 - y_i \cdot \hat{y}_i)

where y_i in {-1, +1} and y_hat_i is the raw (pre-sigmoid) model output.

Gradient:

  • If y * y_hat >= 1: gradient = 0 (correctly classified with margin, no update)
  • If y * y_hat < 1: gradient = -y (push toward correct side)

Properties:

  • Creates a margin of 1 around the decision boundary
  • Correctly classified points beyond the margin contribute zero loss - the model stops updating on "easy" examples
  • Not differentiable at y * y_hat = 1 (use subgradient)
  • Does not produce probability estimates (outputs are not calibrated)
  • Sparse gradients - most training examples may contribute zero gradient

Geometric interpretation: Hinge loss maximizes the margin between classes. Points within the margin (support vectors) determine the boundary; points outside are ignored.

When to use: When you care about the decision boundary, not probability calibration. When you want sparse solutions. In practice, largely replaced by cross-entropy for neural networks.

Company Variation

Google and Meta almost never ask about hinge loss for practical use - they use cross-entropy everywhere. But they may ask about it to test your understanding of margins and SVMs. Research labs sometimes ask about hinge loss in the context of contrastive learning and RLHF, where margin-based losses have made a comeback.

Focal Loss (Handling Class Imbalance)

Lfocal=1ni=1nαi(1p^t,i)γlog(p^t,i)L_{\text{focal}} = -\frac{1}{n}\sum_{i=1}^{n}\alpha_i(1-\hat{p}_{t,i})^{\gamma}\log(\hat{p}_{t,i})

where p_t = p_hat if y=1, else 1-p_hat (the model's estimated probability for the true class), alpha is the class weight, and gamma is the focusing parameter.

How it works:

  • When p_t is high (easy, correctly classified example): (1-p_t)^gamma is small → loss contribution is downweighted
  • When p_t is low (hard, misclassified example): (1-p_t)^gamma is close to 1 → loss contribution is preserved
  • gamma = 0 recovers standard cross-entropy
  • gamma = 2 is a common default

Why it was invented: For object detection (RetinaNet paper), where the background class dominates by 1000:1. Cross-entropy with class weights still produces a large total loss from easy negatives. Focal loss downweights easy examples regardless of class, focusing training on hard examples.

Focal Loss Selection Flowchart

Interviewer's Perspective

When I ask about focal loss, I want to hear three things: (1) it downweights easy examples, not just the majority class, (2) the gamma parameter controls how aggressively it focuses, and (3) it was designed for dense object detection where the background overwhelms the signal. Candidates who only say "it handles class imbalance" get partial credit - class weights also do that. The insight about easy vs hard examples is key.

Part 3 - Metric Learning Losses

Contrastive Loss (Siamese Networks)

Lcontrastive=12ni=1n[yidi2+(1yi)max(0,mdi)2]L_{\text{contrastive}} = \frac{1}{2n}\sum_{i=1}^{n}\left[y_i \cdot d_i^2 + (1-y_i) \cdot \max(0, m - d_i)^2\right]

where d_i = ||f(x_i^a) - f(x_i^b)||_2 is the distance between embeddings, y_i = 1 if same class, y_i = 0 if different class, and m is the margin.

How it works:

  • Same class (y=1): Loss = d^2. Pushes embeddings closer together. Zero loss when identical.
  • Different class (y=0): Loss = max(0, m-d)^2. Pushes embeddings apart until they are at least m apart. Zero loss when already beyond margin.

Properties:

  • Operates on pairs of examples
  • Requires careful pair mining (hard negatives matter most)
  • The margin m defines the minimum separation between different classes
  • Embedding space is shaped by relative distances, not absolute positions

Triplet Loss

Ltriplet=1ni=1nmax(0,d(ai,pi)d(ai,ni)+m)L_{\text{triplet}} = \frac{1}{n}\sum_{i=1}^{n}\max\left(0, d(a_i, p_i) - d(a_i, n_i) + m\right)

where a is the anchor, p is a positive (same class), n is a negative (different class), d is a distance function, and m is the margin.

How it works:

  • For each anchor, the positive should be closer than the negative by at least margin m
  • Loss = 0 when d(anchor, positive) + m < d(anchor, negative)
  • Only non-zero when the condition is violated (hard or semi-hard triplets)

Triplet mining strategies:

  • Easy triplets: d(a,p) + m < d(a,n) - already satisfied, contribute zero loss, useless for learning
  • Hard negatives: d(a,n) < d(a,p) - negative is closer than positive, maximum gradient signal
  • Semi-hard negatives: d(a,p) < d(a,n) < d(a,p) + m - within the margin, most stable training

Metric Learning Loss Selection

InfoNCE / NT-Xent (Modern Contrastive Learning)

LInfoNCE=logexp(sim(zi,zj)/τ)k=12N1kiexp(sim(zi,zk)/τ)L_{\text{InfoNCE}} = -\log\frac{\exp(\text{sim}(z_i, z_j)/\tau)}{\sum_{k=1}^{2N}\mathbb{1}_{k \neq i}\exp(\text{sim}(z_i, z_k)/\tau)}

where sim is cosine similarity, tau is the temperature, z_i and z_j are augmented views of the same image, and the denominator sums over all other examples in the batch.

Why it matters: This is the loss behind SimCLR, CLIP, and most modern self-supervised learning. It treats every other example in the batch as a negative, avoiding explicit negative mining.

Temperature tau:

  • Low tau: sharper distribution, focuses on hardest negatives, but can be unstable
  • High tau: smoother distribution, more uniform gradient contribution, but less discriminative
  • Typical values: 0.05 to 0.5
Common Trap

Do not confuse contrastive loss (pairs) with InfoNCE (batch). Contrastive loss uses explicit positive/negative pairs. InfoNCE uses one positive pair and treats all other batch elements as negatives. InfoNCE scales much better because it provides N-1 negative comparisons per positive pair without explicit mining.

Part 4 - Loss Function Properties and Analysis

What Makes a Good Loss Function?

PropertyWhy It MattersExample
ConvexityGuarantees global optimumMSE with linear model is convex
SmoothnessStable gradientsCross-entropy is smooth; hinge loss is not
Bounded gradientsPrevents explosionHuber caps gradients; MSE does not
CalibrationPredicted probabilities are meaningfulCross-entropy is calibrated; hinge is not
RobustnessHandles outliersMAE is robust; MSE is not
AlignmentMatches business objectiveCustom loss for ranking vs classification

Gradient Behavior Comparison

Understanding gradient behavior is critical for diagnosing training issues:

LossGradient Magnitude Near OptimumGradient Magnitude for Large ErrorsTraining Behavior
MSESmall (proportional to error)Very large (can explode)Fast near optimum, unstable for outliers
MAEConstantConstantStable but slow convergence
HuberSmall (like MSE)Capped (like MAE)Fast + stable
Cross-Entropy (logit)Small (p_hat near y)ModerateWell-behaved everywhere
HingeZero (beyond margin)Constant (within margin)Sparse updates, stops at margin
FocalVery small (easy examples)Moderate (hard examples)Focuses on hard cases
Instant Rejection

Never say "the loss function does not matter much - any reasonable loss works." Loss function choice is one of the most important decisions in ML. A model trained with MSE on imbalanced classification data will fail. A model trained with cross-entropy on a regression problem does not even make sense. Interviewers view loss function carelessness as a sign that you do not understand ML at a fundamental level.

The Connection Between Loss and Probability Distribution

Every loss function implicitly assumes a noise distribution:

Loss FunctionNoise DistributionEstimand
MSEGaussian: N(0, sigma^2)Mean
MAELaplacian: Laplace(0, b)Median
HuberGaussian core + Laplacian tailsRobust mean
Cross-EntropyBernoulli / CategoricalMode (via probabilities)
Quantile LossAsymmetric LaplacianSpecified quantile

This connection is powerful: if you know your noise distribution, you know the optimal loss function, and vice versa.

Part 5 - The Complete Decision Tree

Complete Loss Function Decision Tree

Part 6 - Designing Custom Loss Functions

The Interview Question

"Design a loss function for [specific business problem]."

This is a senior-level question that tests whether you can think from first principles rather than picking from a menu.

Framework for Custom Loss Design

Step 1: Define what correct behavior looks like

  • What should the model predict?
  • What errors are costly? What errors are acceptable?
  • Are there asymmetric costs (false positive vs false negative)?

Step 2: Encode the requirements mathematically

  • Start with a base loss (usually cross-entropy or MSE)
  • Add terms for specific requirements
  • Ensure differentiability (or use subgradient-friendly formulations)

Step 3: Analyze the gradient

  • Does the gradient push the model in the right direction?
  • Are there vanishing or exploding gradient issues?
  • Does the loss produce the right behavior at the boundaries?

Step 4: Validate empirically

  • Does the model trained with this loss actually perform better on the business metric?
  • Are there unexpected failure modes?

Example: Asymmetric Classification

Problem: Medical diagnosis where false negatives (missing a disease) cost 10x more than false positives (unnecessary follow-up).

Custom loss:

L=1ni=1n[w+yilog(p^i)+w(1yi)log(1p^i)]L = -\frac{1}{n}\sum_{i=1}^{n}\left[w_+ \cdot y_i \log(\hat{p}_i) + w_- \cdot (1-y_i)\log(1-\hat{p}_i)\right]

with w+ = 10, w- = 1. This is weighted cross-entropy.

But can we do better? Yes - if we also want the model to be confident when it predicts positive:

L=1ni=1n[w+yilog(p^i)+w(1yi)log(1p^i)]+λentropy(p^)L = -\frac{1}{n}\sum_{i=1}^{n}\left[w_+ \cdot y_i \log(\hat{p}_i) + w_- \cdot (1-y_i)\log(1-\hat{p}_i)\right] + \lambda \cdot \text{entropy}(\hat{p})

The entropy term penalizes uncertain predictions, pushing the model toward confident decisions. Lambda controls the strength.

Example: Ordinal Regression

Problem: Predicting product ratings (1-5 stars). MSE treats the error from 1→5 the same as 1→2 cubed. But ratings are ordinal - the loss should increase with the distance between predicted and true rating, but not quadratically.

Custom loss:

L=1ni=1nyiy^i1.5L = \frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i|^{1.5}

This is between MAE (exponent 1) and MSE (exponent 2). Or use a more structured approach with cumulative link models.

60-Second Answer

"To design a custom loss, I follow four steps: (1) Define what correct behavior looks like, (2) Encode those requirements mathematically starting from a base loss, (3) Analyze the gradient to ensure it pushes the model correctly, and (4) Validate empirically that the custom loss improves the business metric. The most common customizations are asymmetric weighting for different error types, auxiliary terms for regularization or calibration, and temperature scaling for controlling prediction confidence."

Part 7 - Common Loss Function Bugs and Debugging

Bug 1: Loss is NaN

Cause: log(0) in cross-entropy when p_hat = 0 or p_hat = 1.

Fix: Add epsilon clipping: log(max(p_hat, 1e-7)). Most frameworks do this automatically, but custom implementations may not.

Bug 2: Loss decreases but metrics do not improve

Cause: Loss function is not aligned with the evaluation metric.

Example: Training with MSE on a ranking problem. MSE decreases (predictions get closer to labels) but NDCG does not improve (ranking order is not improving).

Fix: Use a loss function that correlates with the metric, or use a surrogate loss that better approximates the metric.

Bug 3: Training is unstable (loss oscillates)

Cause: Gradient magnitude is too variable. Common with MSE on data with outliers.

Fix: Switch to Huber loss or clip gradients.

Bug 4: Model predicts the majority class only

Cause: Cross-entropy on imbalanced data. The model learns that always predicting the majority class minimizes the average loss.

Fix: Class weights, focal loss, or resampling.

Bug 5: Embeddings collapse (all outputs are identical)

Cause: Contrastive or triplet loss without proper negative mining. If all negatives are easy, gradients vanish and the model stops learning.

Fix: Implement hard or semi-hard negative mining. Or use InfoNCE with large batch sizes.

Loss Training Issue Debugger

Practice Problems

Problem 1: The Outlier Problem

You are training a regression model to predict house prices. Your dataset has 10,000 houses with prices between 100Kand100K and 500K, plus 50 luxury houses priced between 2Mand2M and 10M. You train with MSE loss and the model performs terribly on the 10,000 regular houses.

(a) Explain mathematically why MSE fails here. (b) Propose three alternative loss functions, ranked by expected effectiveness. (c) If you must use MSE, what data preprocessing could you apply?

Hint 1 - Direction

Think about how MSE weights errors. What is the squared error for a 5Mpredictionerrorvsa5M prediction error vs a 100K prediction error?

Hint 2 - Insight

MSE on the 50 luxury houses dominates the total loss. A 2Merrorononeluxuryhousecontributes(2M)2=41012,whilea2M error on one luxury house contributes (2M)^2 = 4 * 10^12, while a 100K error on a regular house contributes (100K)^2 = 10^10. The 50 luxury houses contribute ~100x more to the loss than the 10,000 regular houses combined.

Hint 3 - Full Solution + Rubric

(a) Mathematical explanation:

Total MSE = (1/10050) * [sum of regular house errors^2 + sum of luxury house errors^2]

Even if the model perfectly fits all regular houses (0 error), the 50 luxury houses dominate the gradient. The model shifts its predictions toward the luxury range, degrading performance on the 9,950 regular houses.

Quantitatively: if average regular error is 50Kandaverageluxuryerroris50K and average luxury error is 3M:

  • Regular contribution: 10000 * (50K)^2 = 2.5 * 10^16
  • Luxury contribution: 50 * (3M)^2 = 4.5 * 10^17

The 50 luxury houses contribute ~18x more than the 10,000 regular houses.

(b) Three alternatives, ranked:

  1. Huber Loss (delta = $200K): Best choice. Treats regular house errors quadratically (fast convergence) but caps the gradient from luxury house errors. The model focuses on the bulk of the data.

  2. MAE (Mean Absolute Error): Good for robustness. Every house contributes linearly regardless of price, so luxury houses do not dominate. But convergence is slower near the optimum.

  3. Log-transformed MSE: Train on log(price) instead of price. MSE on log(price) makes relative errors equal. A 50% error on a 200Khouseanda200K house and a 5M house contribute equally. Transform predictions back with exp().

(c) Preprocessing approaches with MSE:

  • Log-transform the target: MSE on log(y) is equivalent to optimizing relative error
  • Winsorize: Cap prices at the 99th percentile ($500K)
  • Remove outliers: Exclude the 50 luxury houses and build a separate model for them
  • Stratified training: Weight samples inversely proportional to their target magnitude

Scoring Rubric:

  • Strong Hire: Quantifies the dominance of outliers, proposes 3+ solutions with tradeoffs, mentions log-transform as a clever preprocessing approach, and discusses the merits of building separate models.
  • Lean Hire: Correctly identifies the outlier problem and proposes Huber or MAE, but cannot quantify the issue or discuss tradeoffs.
  • No Hire: Does not understand why MSE fails with outliers, or proposes only "remove the outliers."

Problem 2: Custom Loss Design

You are building a medical imaging model that classifies X-rays as "normal" or "abnormal." The dataset is 95% normal, 5% abnormal. The business requirements are:

  • Missing an abnormal case (false negative) is 20x worse than flagging a normal case as abnormal (false positive)
  • The model must be well-calibrated (predicted probabilities should be meaningful)
  • False negatives on severe cases should be penalized even more than false negatives on mild cases (severity is available as a continuous label 1-10)

Design a loss function that satisfies all three requirements.

Hint 1 - Direction

Start with binary cross-entropy (it preserves calibration). Then add asymmetric weighting. Then incorporate the severity information.

Hint 2 - Insight

You need three modifications to BCE: (1) class weights for imbalance, (2) asymmetric penalty for FN vs FP, and (3) severity-dependent weighting. These can be combined into a single per-sample weight.

Hint 3 - Full Solution + Rubric

Designed loss function:

L=1ni=1nwi[yilog(p^i)+(1yi)log(1p^i)]L = -\frac{1}{n}\sum_{i=1}^{n}w_i\left[y_i\log(\hat{p}_i) + (1-y_i)\log(1-\hat{p}_i)\right]

where the per-sample weight is:

wi={20(1+αsi)if yi=1 (abnormal)1if yi=0 (normal)w_i = \begin{cases} 20 \cdot (1 + \alpha \cdot s_i) & \text{if } y_i = 1 \text{ (abnormal)} \\ 1 & \text{if } y_i = 0 \text{ (normal)} \end{cases}

Here s_i is the severity score (1-10) normalized to [0,1], and alpha controls how much severity affects the weight. With alpha = 1:

  • Normal case: weight = 1
  • Mild abnormal (severity 1): weight = 20 * (1 + 0.1) = 22
  • Severe abnormal (severity 10): weight = 20 * (1 + 1.0) = 40

Why this works:

  1. Imbalance: The base weight of 20 for abnormal cases counteracts the 95:5 ratio (effective ratio becomes ~50:50 in loss contribution)
  2. Asymmetry: False negatives (missing abnormal) are penalized 20x more than false positives
  3. Severity: The severity multiplier ensures the model prioritizes severe cases
  4. Calibration: Cross-entropy as the base loss preserves probability calibration (up to weight-induced shift, which can be corrected with temperature scaling post-training)

Additional considerations:

  • Monitor calibration during training with reliability diagrams
  • The severity weighting alpha should be tuned via cross-validation on the clinical metric (e.g., weighted recall)
  • Consider adding focal loss (gamma > 0) if many normal cases are easy

Scoring Rubric:

  • Strong Hire: Designs a principled loss that addresses all three requirements. Uses weighted BCE as the base. Correctly incorporates severity as a continuous weight. Discusses calibration implications. Mentions validation strategy.
  • Lean Hire: Addresses 2 of 3 requirements. Uses class weights for imbalance and asymmetry but does not incorporate severity, or incorporates severity but does not preserve calibration.
  • No Hire: Proposes only class weights without addressing severity, or proposes a loss that breaks calibration without acknowledging it.

Problem 3: Loss Function Debugging

Your team trains a BERT-based model for multi-label classification (each example can have 0 or more of 100 labels). They use categorical cross-entropy with softmax and report that the model always predicts exactly one label per example, even though most examples have 3-5 labels.

(a) Explain the bug. (b) What loss function should they use? (c) What activation function should replace softmax?

Hint 1 - Direction

Think about what softmax does - does it allow multiple labels to have high probability simultaneously?

Hint 2 - Insight

Softmax normalizes across classes so all probabilities sum to 1. This forces the model to "choose" one label. For multi-label classification, each label should be an independent binary decision.

Hint 3 - Full Solution + Rubric

(a) The bug:

Softmax + categorical cross-entropy assumes mutually exclusive classes. The softmax function normalizes probabilities to sum to 1:

p_c = exp(z_c) / sum_j(exp(z_j))

This means increasing the probability of one class necessarily decreases the probability of all others. For multi-label classification, labels are NOT mutually exclusive - an example can be "funny," "political," and "viral" simultaneously.

(b) Correct loss: Binary cross-entropy per label

L=1ni=1n1Cc=1C[yi,clog(p^i,c)+(1yi,c)log(1p^i,c)]L = -\frac{1}{n}\sum_{i=1}^{n}\frac{1}{C}\sum_{c=1}^{C}\left[y_{i,c}\log(\hat{p}_{i,c}) + (1-y_{i,c})\log(1-\hat{p}_{i,c})\right]

This treats each of the 100 labels as an independent binary classification problem. The loss for label c does not depend on the predictions for label c'.

(c) Correct activation: Sigmoid (per label)

p_hat_c = sigmoid(z_c) = 1 / (1 + exp(-z_c))

Each label gets an independent probability between 0 and 1. Multiple labels can have high probability simultaneously.

Key insight: multi-class (one-of-K) uses softmax + categorical CE. Multi-label (any-of-K) uses sigmoid + binary CE per label.

Scoring Rubric:

  • Strong Hire: Immediately identifies the softmax normalization as the bug. Clearly explains why sigmoid + binary CE is correct for multi-label. Notes that this is 100 independent binary problems.
  • Lean Hire: Identifies that softmax is the problem but is vague about the fix.
  • No Hire: Cannot identify the bug, or suggests "use a threshold" without changing the loss/activation.

Problem 4: The Ranking Problem

You are building a search ranking system. You have (query, document, relevance) triples where relevance is 0 (irrelevant), 1 (somewhat relevant), or 2 (highly relevant). Your colleague trains with MSE loss (predicting the relevance score) and reports that NDCG@10 is poor even though MSE is low.

Explain why MSE is a bad choice for ranking and propose a better loss function.

Hint 1 - Direction

Think about what ranking needs vs what MSE optimizes. Does getting the exact relevance score matter, or does getting the ORDER right matter?

Hint 2 - Insight

MSE optimizes for accurate score prediction. Ranking optimizes for correct ordering. A model that predicts [1.8, 1.2, 0.3] for true relevances [2, 1, 0] has lower MSE than [0.9, 0.5, 0.1], but both have perfect ranking. Conversely, [1.1, 0.9, 1.8] has decent MSE but terrible ranking (the irrelevant document is ranked first).

Hint 3 - Full Solution + Rubric

Why MSE fails:

MSE minimizes (predicted_score - true_relevance)^2. But NDCG cares about relative ordering, not absolute scores. MSE can decrease (scores get closer to true relevances) while NDCG stays flat or even decreases.

Example:

  • True relevances: [2, 1, 0] for documents [A, B, C]
  • Prediction 1: [1.5, 0.8, 0.3] - MSE = 0.127, ranking = A>B>C (correct, NDCG = 1.0)
  • Prediction 2: [1.9, 1.1, 0.1] - MSE = 0.010, ranking = A>B>C (correct, NDCG = 1.0)
  • Prediction 3: [1.0, 0.9, 1.1] - MSE = 0.687, ranking = C>A>B (wrong, NDCG is low)

Prediction 2 has the best MSE but the same NDCG as Prediction 1. The two objectives are misaligned.

Better loss functions for ranking:

  1. Pairwise hinge loss (RankSVM): L = sum over pairs (i,j) where relevance_i > relevance_j: max(0, 1 - (s_i - s_j)) Directly optimizes for correct pairwise ordering.

  2. LambdaRank / LambdaMART: Weights pairwise losses by the change in NDCG from swapping the pair. Pairs that matter more for NDCG get larger gradients.

  3. ListMLE: Models the probability of the correct ranking as a Plackett-Luce model and maximizes the likelihood.

  4. Softmax cross-entropy over relevance levels: Treat as a classification problem (3 classes) and rank by predicted probability of the highest class. Better than MSE but still not directly optimizing ranking.

Recommended: LambdaMART (used at Microsoft Bing, widely used). For neural models, LambdaRank-style gradients combined with pairwise losses.

Scoring Rubric:

  • Strong Hire: Clearly articulates the score vs order disconnect. Provides a concrete numerical example. Proposes 2+ ranking-specific losses with tradeoffs. Mentions LambdaRank and its connection to NDCG.
  • Lean Hire: Understands that ordering matters but can only propose pairwise hinge loss without deeper analysis.
  • No Hire: Cannot explain why MSE is wrong for ranking, or proposes "use a different metric" without changing the loss.

Interview Cheat Sheet

Loss FunctionFormula (Simplified)Best ForKey PropertyRed Flag Answer
MSE(y - y_hat)^2Regression, Gaussian noiseGradients proportional to error"Always use MSE for regression"
MAE|y - y_hat|Robust regression, outliersConstant gradient magnitude"MAE is always better than MSE"
HuberMSE if small, MAE if largeProduction regressionBest of both worldsNot knowing delta exists
Binary CE-[y log p + (1-y) log(1-p)]Binary classificationCalibrated probabilities"Same as MSE for classification"
Categorical CE-sum(y_c log p_c)Multi-class (one-of-K)Requires softmaxUsing for multi-label
Hingemax(0, 1 - y*y_hat)Margin-based (SVM)Sparse gradients"Hinge gives probabilities"
Focal-alpha*(1-p_t)^gamma * log(p_t)Extreme imbalanceDownweights easy examples"Same as class weights"
Contrastived^2 or max(0, m-d)^2Pair-based similarityNeeds pair mining"Works without hard negatives"
Tripletmax(0, d(a,p) - d(a,n) + m)Embedding learningNeeds triplet mining"Any negative works"
InfoNCE-log(sim(pos) / sum(sim(all)))Self-supervisedIn-batch negativesConfusing with contrastive

Spaced Repetition Checkpoints

Day 0 - Initial Learning

  • Read this entire page and complete the self-assessment
  • Derive binary cross-entropy from Maximum Likelihood
  • Draw the loss function decision tree from memory

Day 3 - First Recall

  • Without notes, list 8 loss functions with their formulas and use cases
  • Explain why MSE corresponds to Gaussian MLE and MAE to Laplacian MLE
  • Give the "60-Second Answer" for cross-entropy out loud

Day 7 - Connections

  • Explain how loss functions connect to: optimization (gradients), regularization (implicit), evaluation metrics (alignment)
  • Do Problem 2 (custom loss design) without looking at hints
  • Explain focal loss to someone with no ML background

Day 14 - Application

  • Do Problem 3 (multi-label debugging) under timed conditions (5 minutes)
  • Design a custom loss for a problem of your choice following the 4-step framework
  • Review any loss functions you cannot derive

Day 21 - Mock Interview

  • Answer: "When would you NOT use cross-entropy for classification?" (timed, 90 seconds)
  • Answer: "Design a loss function for content moderation" (timed, 5 minutes)
  • Do all 4 practice problems under timed conditions (30 minutes total)

Key Takeaways

  1. Loss functions are assumptions. MSE assumes Gaussian noise, MAE assumes Laplacian noise, cross-entropy assumes Bernoulli/Categorical. Choosing a loss function is choosing a statistical model of your data.

  2. Gradient behavior matters as much as the formula. Two losses with similar values can have wildly different training dynamics because of their gradients. Always analyze the gradient.

  3. Loss-metric alignment is critical. If your loss function optimizes something different from your evaluation metric, training loss will decrease while your metric stagnates. This is one of the most common production ML bugs.

  4. Custom loss design is a senior-level skill. Being able to design, justify, and validate a custom loss function for a specific business problem is what separates senior engineers from junior ones.

© 2026 EngineersOfAI. All rights reserved.