Framing ML Problems - Turning Business Goals into Training Objectives
Reading time: ~35 minutes | Level: ML System Design | Role: MLE, AI Engineer, Data Scientist
The Interview Room
It is 10:17 AM on a Tuesday. A Meta ML engineer - four years of experience, strong on transformers and attention mechanisms - is sitting across from an interviewer for a Staff ML Engineer role. The question lands clean and open-ended: "Design the Facebook News Feed ranking system."
The candidate takes a breath and starts well: "I'd use a BERT-based model to understand content semantics, probably fine-tuned on engagement signals. Then a two-tower architecture to capture user preferences..." The interviewer listens for two minutes, then holds up a hand.
"Before we go further - what's your objective function? What is the model actually predicting?"
Silence. Not a short pause. A long, uncomfortable silence. The candidate has been designing the model architecture - the how - without ever answering the fundamental question: what exactly are we training this model to do?
"I guess... engagement?" the candidate offers. The interviewer nods, but the expression doesn't change. "What does engagement mean, numerically? What is the label in your training data? What does a positive example look like? What does a negative example look like?"
More silence. The candidate never recovers. The debrief feedback is almost identical to what it always is for this failure mode: "Strong technical knowledge, but jumped to solution before defining the problem."
This is the most common failure mode in ML system design interviews - and in real ML projects that fail in production. The model architecture is the last thing you should be thinking about. The objective function is the first. And getting it wrong means you will train the wrong model, optimize the wrong metric, and ship something that looks good on paper but makes the business worse.
This lesson teaches you to get the framing right - every time, under pressure, in front of any interviewer at any company.
Why Framing Comes First
Here is the uncomfortable truth about ML in production: the majority of ML projects that fail do not fail because of a bad model. They fail because the model was optimizing the wrong thing.
YouTube's early recommendation system was optimizing for clicks. It was very good at predicting clicks. But click-optimized recommendations led users into increasingly extreme content because extreme content gets more clicks. The model was doing exactly what it was trained to do. The framing was wrong.
Amazon's early review ranking system optimized for helpfulness votes. Turns out, lengthy, detailed reviews get more helpfulness votes regardless of accuracy. The model surfaced long reviews. Some were excellent. Some were confidently wrong. The proxy metric - helpfulness votes - was an imperfect proxy for the actual goal - helping customers make good decisions.
Uber's surge pricing model optimized for driver availability. In practice, this meant surge prices in areas where drivers didn't want to go regardless of price (unsafe neighborhoods, low-demand destinations). The model was technically correct and operationally wrong.
In each case, the engineers who built these systems were not bad engineers. They made a specific class of mistake: they conflated the proxy metric - the thing they could measure and optimize - with the actual business goal - the thing they actually cared about. Learning to see this gap, and to design proxy metrics that faithfully represent business goals, is the core skill this lesson develops.
The Framing Hierarchy
Every ML problem lives inside a three-level hierarchy. You must work through all three levels before touching a line of code.
Level 1: The Business Goal
Business goals are always expressed in terms of the company's mission and economic model. They are usually vague in the way that matters for ML. Examples:
- "Increase user time on platform" (Facebook, TikTok)
- "Increase purchase conversion" (Amazon, Shopify)
- "Reduce fraud losses" (Stripe, PayPal)
- "Improve driver and rider match quality" (Uber, Lyft)
- "Help users find relevant content" (YouTube, Spotify, Netflix)
None of these are directly trainable. You cannot write a loss function over "user satisfaction." You need to decompose.
Level 2: The Proxy Metric
A proxy metric is something you can actually measure - typically from logs - that correlates with the business goal. The art is choosing a proxy that is:
- Measurable - you can compute it from your data
- Correlated with the business goal - when it goes up, the business goal goes up
- Not gameable - optimizing it doesn't create perverse incentives
- Attributable - you can trace a model decision to a metric change
For "increase user time on platform" the obvious proxy is total watch time or session length. But watch time can be inflated by showing users content they cannot stop watching even if they feel bad about it (outrage content, clickbait). A better proxy might be watch time on content the user later rates positively, or watch time that doesn't lead to immediate app uninstall.
Level 3: The ML Objective
Once you have a proxy metric, you define the ML objective: what does the model predict, what is the label, what is the loss function?
News Feed Ranking example:
| Level | Definition |
|---|---|
| Business goal | Increase meaningful social interactions (Zuckerberg's 2018 announcement) |
| Proxy metric | P(user comments or shares within 1 hour of seeing post) × estimated post quality |
| ML objective | Binary classification: will user comment or share this post within 1 hour? Label = 1 if user commented/shared, 0 otherwise |
Fraud Detection example:
| Level | Definition |
|---|---|
| Business goal | Minimize fraud losses while minimizing false declines (lost legitimate revenue) |
| Proxy metric | F1 score weighted toward recall at high-precision operating point |
| ML objective | Binary classification: is this transaction fraudulent? Label = 1 if transaction was disputed and confirmed fraud |
The Proxy Metric Trap
The proxy metric trap is the gap between what you optimize and what you actually want. Every ML system lives in this gap. The question is how wide the gap is.
The trap has a specific structure: the proxy metric and the business goal are correlated in the training distribution but decorrelated by optimization pressure. When you push the model to maximize the proxy metric, the correlation breaks down.
Classic examples:
YouTube watch time. Correlated with satisfaction in the training distribution - people watch videos they enjoy. But as the recommendation engine gets better at maximizing watch time, it discovers that outrage and conspiracy content keeps people watching longer than educational content. The model exploits this. Watch time goes up. User satisfaction and mental health go down.
App store ratings. Apps that prompt users to rate them immediately after a positive in-app event get higher average ratings. The rating reflects the prompting strategy as much as the app quality. Developers learned to optimize the prompt timing rather than the app itself.
Click-through rate (CTR). Clickbait headlines have high CTR. Substantive, accurate headlines have lower CTR. An ML system optimizing CTR learns to recommend clickbait. This is why every platform that optimized purely for CTR ended up with a quality problem.
The fix: Multi-objective optimization. Instead of a single proxy, use a weighted combination:
The negative terms penalize the model for showing content users want to suppress. The weights encode your product judgment about what matters. This is not a fully automated decision - the weights embed human values - and that is appropriate.
Prediction Types and When to Use Each
Choosing the right prediction type is part of problem framing, not model selection. It follows from the ML objective.
Binary Classification
Use when: the outcome is yes/no and the positive class has a clear definition.
Examples:
- CTR prediction: will this user click on this ad? (label = click)
- Fraud detection: is this transaction fraudulent? (label = confirmed fraud)
- Churn prediction: will this user cancel within 30 days? (label = cancellation)
- Spam detection: is this email spam? (label = spam)
Label construction: define the positive class carefully. For churn, does "cancel" include users who just stopped using the app but didn't formally cancel? Does it include free-tier users? The label definition is part of the model design.
Key consideration: class imbalance. Most transactions are not fraud (< 0.1%). Most emails are not spam (varies by domain). Naive training on imbalanced data produces models that predict the majority class always and achieve high accuracy but zero recall on the rare class. Use stratified sampling, class weights, or oversampling (SMOTE).
Multi-Class Classification
Use when: the outcome is one of K discrete categories (K > 2, typically K < 10,000).
Examples:
- Intent classification: what does the user want to do? (book_flight, check_balance, report_problem)
- Category prediction: which product category does this item belong to?
- Language identification: what language is this text?
Label construction: the categories must be mutually exclusive and exhaustive. If they are not, you need multi-label classification (each example can have multiple labels). Multi-label requires predicting a binary value for each class independently.
Regression
Use when: the outcome is a continuous value.
Examples:
- Demand forecasting: how many units will we sell next week?
- Bid price prediction: what is the optimal CPM to bid for this impression?
- Time-to-event: how many days until this customer churns?
- ETA prediction: how many minutes until this driver arrives?
Key consideration: regression is often not what you actually want even when the output is continuous. If you are predicting bid prices, you probably care more about rank-ordering bids correctly than about exact value accuracy. A ranking model is often better than a regression model for ranking tasks.
Ranking
Use when: you need to order a set of items by relevance.
Examples:
- Search ranking: order documents by relevance to query
- Feed ranking: order posts by likelihood of engagement
- Recommendation: order items by likelihood of purchase
- Ads ranking: order ads by expected value (CTR × bid price)
Ranking is different from regression because the loss function cares about relative order, not absolute values. Pointwise ranking treats each item independently and predicts a score. Pairwise ranking trains on pairs (item A should rank above item B). Listwise ranking optimizes over the entire ranked list (direct NDCG or MAP optimization).
For most production systems, pointwise ranking with a binary or multi-class label is used because it is the simplest to implement with existing classification infrastructure. The ranker outputs a score per item; items are sorted by score.
Structured Output
Use when: the output is a structured object - a sequence, a graph, a bounding box, a parse tree.
Examples:
- Machine translation: sequence of tokens in target language
- Object detection: bounding box coordinates + class labels
- Named entity recognition: per-token label sequence
These typically require specialized architectures (seq2seq, detection heads) and are less common in ML system design interviews for non-NLP/CV roles.
Decomposing Complex Objectives: The Netflix Playbook
When the business goal is complex, decompose it into a product of simpler predictions. Netflix's approach to optimizing "will a user watch and enjoy this title?" is one of the clearest examples in the industry.
A naive approach: train a single model to predict P(user enjoys title). Label: post-watch star rating. Problem: most users who click on a title don't rate it, so you only observe labels for titles users chose to engage with - a highly biased sample.
Netflix's decomposition:
Each model has a well-defined label, abundant training data, and can be trained independently. The final ranking score is the product of three separate models' outputs. Each model can be improved separately. Each model's failure modes are distinct and diagnosable.
This decomposition pattern generalizes:
E-commerce purchase prediction:
Ad conversion:
Content recommendation:
Each sub-model can be a binary classifier. The composition allows complex multi-stage behavior to emerge from simple components. And critically, when something goes wrong, you can inspect each component separately to find the failure.
Label Construction: The Gap Between Logs and Truth
Your training data comes from logs. Your logs record what happened. What you actually want to predict is what the user preferred - and these are not the same thing.
Implicit Feedback Bias
When a user clicks on a link, your log records a positive example. But:
- The user may have clicked because the thumbnail was misleading (clickbait)
- The user may have clicked and immediately bounced (30-second read of a 10-minute article)
- The user may have clicked because there was nothing better visible (the model is already surfacing bad content)
Clicks are proxies for interest, not measures of interest. Treating them as ground truth creates a self-reinforcing bias: the model learns what gets clicks, shows more of that content, collects more clicks on that content type, and reinforces itself. The training data distribution is shaped by the model's own past decisions - this is called feedback loop bias or exposure bias.
Mitigations:
-
Use downstream signals as labels instead of upstream clicks. Instead of "did user click?", use "did user read > 80% of the article?" or "did user return to the platform within 2 days of this session?"
-
Dwell time weighting. Weight positive examples by how long the user spent. A 30-second dwell on a 10-minute article is a weak positive. A 9-minute dwell is a strong positive.
-
Negative signal inclusion. Explicitly label content the user scrolled past quickly, hid, or reported as negative examples. This is more informative than just treating unclicked content as negative.
Survivorship Bias
You only observe labels for items the model decided to show. Items the model ranked low never appear in your logs. This creates selection bias: your training data systematically underrepresents items the current model dislikes.
Fix: exploration. Occasionally surface random items (or items ranked by an alternative strategy) and log their outcome. This is the explore-exploit tradeoff that underlies recommendation systems. A small epsilon-greedy component in your ranker gives you training data coverage beyond what the current model would show.
Delayed Labels
For churn prediction, you do not know if a user will churn for 30 days after you make the prediction. For fraud, chargebacks arrive 60-90 days after the transaction. For medical outcome prediction, you may wait years.
This creates a label delay problem: you have features (user behavior, transaction characteristics) at prediction time, but you cannot construct training labels until much later. Meanwhile, the feature distribution drifts.
Fix: define your label window carefully and understand the delay. For churn, the label at day 30 requires waiting 30 days. During those 30 days, you are serving predictions without knowing if your most recent training data reflects current user behavior. Track label delay explicitly and factor it into your retraining schedule.
A Framework for ML System Design Interviews
Use this framework in every ML system design interview. It takes 45 minutes to walk through and covers everything an interviewer at a top company wants to see.
Step 1: Clarify the Business Goal and Constraints
Ask these questions explicitly - do not assume:
- What is the business goal? What metric does this system need to move?
- Who are the users? What is the scale (1M users? 1B?)
- What are the latency constraints? (50ms? 500ms? Batch OK?)
- What data do we have? What data can we collect?
- Are there regulatory constraints? (GDPR, HIPAA, financial regulations?)
- What is the cost of errors? (False positive vs false negative asymmetry)
The last question is critical for setting the operating threshold. In fraud detection, a false negative (missing fraud) costs the company money. A false positive (blocking a legitimate transaction) costs a customer relationship. The tradeoff is a product decision, not a model decision - but it shapes everything downstream.
Step 2: Define the ML Objective
State explicitly:
- What does the model predict? (binary classification, regression, ranking score)
- What is the label? (click within 1 hour, churn within 30 days, confirmed fraud)
- What is a positive example? (user who clicked, user who churned, fraudulent transaction)
- What is a negative example? (user who saw the post but did not click within 1 hour)
- What is the loss function? (binary cross-entropy, mean squared error, pairwise ranking loss)
Step 3: Define the Data Pipeline
- Where does training data come from? (interaction logs, purchase history, content metadata)
- How are labels constructed? (click logs, chargeback reports, human annotation)
- How much data is available? (1M labeled examples? 10B interaction events?)
- How is data refreshed? (daily batch? streaming?)
Step 4: Feature Engineering
- What signals predict the label? (user history, item features, context features)
- How are features computed? (batch pipeline for historical features, real-time for session context)
- How do you avoid training-serving skew? (feature store, consistent transformations)
Step 5: Model Architecture
Now - after all four steps above - you can talk about the model.
Choose the architecture based on:
- Available training data volume (100K examples → logistic regression/GBT; 100M+ examples → deep learning)
- Latency requirements (neural networks have higher inference cost than GBT)
- Interpretability requirements (regulated industries need explainability)
- Iteration speed requirements (how fast do you need to retrain?)
Step 6: Offline Evaluation
Define metrics before you train:
- Binary classification: AUC-ROC, precision@recall, F1 at operating threshold
- Regression: RMSE, MAE, MAPE (mean absolute percentage error)
- Ranking: NDCG@k, MRR, precision@k
Choose a held-out evaluation set that reflects the serving distribution - and if data is temporal, always hold out a future time window, not a random sample.
Step 7: Online A/B Testing
Offline metrics do not always predict online impact. AUC improvement does not always mean conversion improvement. Design the A/B test:
- What is the randomization unit? (user-level, session-level, request-level)
- What is the primary metric? (the business goal)
- What are the guardrail metrics? (things you cannot let go down)
- What sample size do you need? (power analysis for the expected effect size)
- How long does the experiment run? (account for day-of-week effects: minimum 1-2 weeks)
Step 8: Monitoring and Feedback Loops
After deployment:
- Monitor input feature distributions (detect covariate shift)
- Monitor prediction distributions (detect model drift)
- Monitor downstream business metrics (detect degradation)
- Define retraining triggers (schedule-based vs metric-based)
- Log model decisions for future training data
Common Mistakes
:::danger Jumping to model architecture before defining the objective The most common failure in interviews and in real projects. You will design the wrong model for the wrong task. Spend at least 20% of an interview on framing before touching architecture. :::
:::danger Treating implicit feedback as ground truth Clicks are noisy proxies for interest, not measures of interest. A model trained on clicks as positive examples will optimize for clickability, not quality. Use downstream signals (completion, return visit, explicit rating) wherever possible. :::
:::danger Ignoring the asymmetry of errors A fraud model that misses 10% of fraud (false negatives) and a fraud model that blocks 10% of legitimate transactions (false positives) have completely different business implications. Always ask: what is the cost of each error type? Set your operating threshold accordingly. :::
:::warning Defining a proxy metric that is gameable Before committing to a proxy metric, ask: if we optimize this metric aggressively, what behavior does the system develop? If the answer is "clickbait" or "engagement farming" or "false positives," the proxy is too gameable. Add correction terms. :::
:::warning Ignoring latency constraints A 200ms inference model is useless in a 50ms system. Latency is a hard constraint, not an optimization target. Define it in Step 1 and let it constrain your model choices in Step 5. :::
Video Resources
| Resource | Creator | What It Covers |
|---|---|---|
| ML System Design Interview | Exponent | Full system design interview walkthrough |
| How to Frame ML Problems | Chip Huyen | ML production pitfalls |
| YouTube Recommendation System | Yannic Kilcher | Real objective function analysis |
| Design a Search Ranking System | Tech Dummies | ML system design example |
Interview Q&A
Q1: How do you decompose "user satisfaction" into an ML objective?
You cannot train a model to predict "user satisfaction" directly because there is no label for it in your logs. You need to decompose it into observable proxies.
Start by asking: what behaviors indicate satisfaction? For a content platform, satisfied users: (1) watch videos to completion, (2) return to the platform the next day, (3) do not immediately close the app after watching, (4) rate content positively when prompted.
Define a multi-label target: for each content item shown, construct labels for completion rate (regression: fraction watched), return visit (binary: did user open app within 24 hours?), and explicit satisfaction (binary: positive rating if rated). Train separate models for each and combine with a weighted score. The weights encode your product judgment about what "satisfaction" means.
The key insight is that satisfaction is a concept, not a measurement. You need to operationalize it into measurements - and each measurement is an imperfect proxy. Using multiple proxies and combining them is more robust than betting on any single proxy.
Q2: How do you construct labels when you have only implicit feedback (no explicit ratings)?
Implicit feedback gives you behavioral signals: clicks, scrolls, time spent, shares, re-reads. These are noisier than explicit ratings but far more abundant.
Best practices:
-
Treat different behaviors as different signal strengths. A share is stronger than a like, which is stronger than a click, which is stronger than a scroll-past. Design your label to incorporate signal strength: weight positive examples by signal type.
-
Use the absence of negative actions as weak positives. If a user scrolled through 10 posts and engaged with one, the engaged post is a strong positive and the scrolled-past posts are weak negatives (the user saw them but did not engage).
-
Use downstream behavior as a delayed label. If a user returns to the platform the next day, tag their engagement history from the prior session as positive. This avoids optimizing for immediate engagement over long-term satisfaction.
-
Include explicit negative signals. Users who hide, report, or unsubscribe from content are sending a strong negative signal. Treat these as strong negative examples even if the user initially clicked.
Q3: Walk me through the proxy metric trap with a concrete example.
Consider Airbnb's search ranking system. The naive proxy metric is booking rate: rank listings by P(booking). The model learns what gets booked. But booking rate is affected by price (cheaper listings get booked more often), availability (listings available on popular dates get booked more), and location (central listings get booked more).
If you optimize booking rate, the model surfaces the cheapest, most centrally located listings with the best availability. This is not wrong per se, but it underserves quality listings that are slightly more expensive or less central but offer a superior guest experience - and superior guest experience is Airbnb's real differentiation.
The proxy metric - booking rate - is correlated with quality in the training distribution but decorrelated when pushed hard by optimization. Airbnb's actual ranking objective includes price-adjusted booking rate, review scores, and host response rate. Each term corrects for one way the simpler proxy could be gamed or mismeasured.
Q4: How do you design the objective function for a fraud detection system?
The fraud detection objective has three components: what to predict, what the label is, and how to handle the class imbalance.
What to predict: P(fraudulent | transaction features). Binary classification.
Label construction: a transaction is labeled as fraud when it results in a confirmed chargeback or when the fraud review team manually flags it. This introduces label delay (chargebacks take 60-90 days) and label noise (some fraud is never reported, some legitimate transactions are incorrectly flagged).
Class imbalance: fraud rates are typically 0.1-1% of transactions. Naive training on imbalanced data yields a model that predicts "legitimate" always, achieving 99%+ accuracy but zero fraud recall. Use:
- Class weighting: upweight positive (fraud) examples by
- Focal loss: - downweights easy negative examples
- SMOTE: synthetic minority oversampling to balance the training set
Operating threshold: do not use the default 0.5 threshold. Set the threshold based on the precision-recall tradeoff that matches your business constraints. If you can tolerate 0.5% false positive rate (blocking 1 in 200 legitimate transactions), find the threshold that maximizes recall at that false positive rate.
Q5: What are the tradeoffs between multi-task learning and training separate models for each objective?
Separate models: each model is independently optimizable. You can tune the architecture, features, and training process for each objective without affecting others. Simpler to debug - when model 1 degrades, it is isolated from model 2. But: features and representations are not shared; each model has its own infrastructure footprint; and the objectives may conflict in ways that are not explicitly managed.
Multi-task learning (MTL): train one model with multiple output heads, each predicting a different objective. Shared representations mean the model can transfer knowledge across tasks - features useful for predicting clicks are often useful for predicting completion. This is particularly valuable when one task has limited labeled data: learning from related tasks compensates.
The risk of MTL is task interference: if tasks have conflicting gradients, the shared representation may be pulled in incompatible directions and all tasks suffer. Mitigation: gradient surgery (project out conflicting gradient components), task-specific learning rates, or uncertainty-weighted loss: where is a learned uncertainty per task.
When to use MTL: when tasks are closely related (click and completion), when one task has abundant data and another is data-sparse, or when inference latency budget requires a single model call. When to use separate models: when tasks are loosely related, when interpretability requirements differ across tasks, or when different teams own different objectives and need to iterate independently.
Case Study: Framing the LinkedIn Feed Ranking Problem
Walking through a complete framing exercise makes the abstract hierarchy concrete. Consider the LinkedIn feed ranking problem - a common ML design question at LinkedIn and a useful analog for any professional network.
Business goal: "Increase professional value derived from LinkedIn."
This is too vague. The first thing to do in an interview is push back and ask: how does LinkedIn make money? LinkedIn makes money through premium subscriptions, recruiter tools, and advertising. "Professional value" probably correlates with: users who get jobs through LinkedIn, users who successfully recruit through LinkedIn, users who consume content relevant to their career, and users who stay engaged enough to see ads.
Clarifying questions:
- Are we optimizing for job seekers, recruiters, both?
- Is the primary metric revenue (ad impressions × CPM) or retention (monthly active users)?
- What's the time horizon - immediate engagement or long-term retention?
Assume the interviewer answers: "Focus on content feed ranking. Goal is to increase meaningful engagement - not just likes, but content that users find professionally valuable. Measure by weekly active user retention."
Proxy metric: weekly active user retention is a lagging indicator - you cannot optimize a model on whether users are retained 7 days from now, because you need training labels now. Choose a leading indicator that predicts retention:
where "meaningful engagement" means: comment (weighted 5x), share with caption (weighted 4x), reaction (weighted 1x), click-through with dwell time greater than 60 seconds (weighted 3x). Explicitly exclude: reactions on posts from close connections (already personalized), reactions that occur within 2 seconds of scrolling past (accidental).
ML objective: weighted multi-label classification. For each (user, post) pair, predict:
- - strong negative signal
Final score:
Labels:
- Positive labels: extracted from engagement logs with timestamp constraints (engagement within 2 hours of impression)
- Negative labels: posts that were shown (appeared in the feed viewport for greater than 3 seconds) but received no engagement within 2 hours - these are true negatives, not just unranked examples
- Quick-skip negatives: posts that were scrolled past within 3 seconds of appearing - high-confidence negative signal
What this framing gets right: (1) it decomposes the complex objective into tractable predictions, (2) it uses time-bounded labels to avoid leakage, (3) it includes negative signals to prevent the model from learning only positive patterns, (4) the weights encode explicit product judgment about what "meaningful" means.
Worked Example: Search Ranking Objective
Search ranking is one of the most common ML design problems in interviews. The framing is deceptively simple - "rank documents by relevance" - and the details are where candidates succeed or fail.
Business goal: help users find what they are looking for quickly, increasing search satisfaction and platform retention.
Clarifying questions:
- What type of search? (web search, product search, video search, people search)
- What is the latency budget? (50ms for Google, 100ms for e-commerce is typical)
- Do we have explicit relevance judgments (human raters), or only implicit signals (clicks)?
Assume: product search on an e-commerce platform. 100ms latency. Implicit click signals plus some human-rated query-product pairs.
Proxy metric: we want to rank products the user will purchase highly. Purchase is the ideal signal - but purchase happens for only 1-3% of searches. Click is more abundant but noisier (users click products that look good in the thumbnail but are not actually relevant). A weighted combination:
ML objective: learning to rank. Train a pointwise ranker that predicts relevance score for query and document . The ranker is trained on:
- Positive examples: (query, clicked product) pairs where the click was followed by dwell time greater than 30 seconds
- Negative examples: (query, shown product) pairs that were not clicked - sampled from products shown on the same page (hard negatives) rather than randomly sampled products (easy negatives)
- Expert labels: (query, product, relevance grade) pairs rated by human raters on a 5-point scale (Irrelevant, Slightly Relevant, Relevant, Highly Relevant, Perfect)
Why hard negatives? If you train with random negative samples, the model learns to separate obviously irrelevant products (random noise) from relevant ones - a trivial task. Hard negatives (products the system already ranked highly enough to show but that users did not click) are much more informative. They force the model to learn subtle relevance distinctions.
The label construction challenge: position bias. Products shown in position 1 get far more clicks than products in position 5, even if they are equally relevant. A model trained naively on position-confounded clicks learns to predict "is this product in position 1?" rather than "is this product relevant?" Mitigation: Inverse Propensity Scoring (IPS) - weight each training example by the inverse probability of being shown at its position.
The Objective Function in Code
Translating the framing into a concrete loss function makes the design tangible. Here is how the multi-objective news feed ranking objective translates to a training setup:
import torch
import torch.nn as nn
import torch.nn.functional as F
class NewsFeedRanker(nn.Module):
"""
Multi-task model for news feed ranking.
Predicts P(comment), P(share), P(click_dwell), P(reaction), P(quick_skip)
"""
def __init__(self, input_dim: int, hidden_dim: int = 256):
super().__init__()
# Shared representation
self.shared = nn.Sequential(
nn.Linear(input_dim, hidden_dim),
nn.ReLU(),
nn.LayerNorm(hidden_dim),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
)
# Task-specific output heads
self.comment_head = nn.Linear(hidden_dim, 1)
self.share_head = nn.Linear(hidden_dim, 1)
self.dwell_head = nn.Linear(hidden_dim, 1)
self.reaction_head = nn.Linear(hidden_dim, 1)
self.quick_skip_head = nn.Linear(hidden_dim, 1)
def forward(self, x):
shared = self.shared(x)
return {
'p_comment': torch.sigmoid(self.comment_head(shared)),
'p_share': torch.sigmoid(self.share_head(shared)),
'p_dwell': torch.sigmoid(self.dwell_head(shared)),
'p_reaction': torch.sigmoid(self.reaction_head(shared)),
'p_quick_skip': torch.sigmoid(self.quick_skip_head(shared)),
}
def ranking_score(outputs: dict) -> torch.Tensor:
"""Compute the final ranking score from multi-task outputs."""
return (
5.0 * outputs['p_comment']
+ 4.0 * outputs['p_share']
+ 3.0 * outputs['p_dwell']
+ 1.0 * outputs['p_reaction']
- 2.0 * outputs['p_quick_skip']
)
def multi_task_loss(outputs: dict, labels: dict) -> torch.Tensor:
"""
Compute multi-task binary cross-entropy loss.
Each task is weighted by its business importance.
"""
task_weights = {
'p_comment': 5.0,
'p_share': 4.0,
'p_dwell': 3.0,
'p_reaction': 1.0,
'p_quick_skip': 2.0,
}
total_loss = 0.0
for task, weight in task_weights.items():
task_loss = F.binary_cross_entropy(
outputs[task].squeeze(),
labels[task].float(),
reduction='mean'
)
total_loss += weight * task_loss
return total_loss
# Training step
model = NewsFeedRanker(input_dim=512)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
def train_step(batch_features, batch_labels):
optimizer.zero_grad()
outputs = model(batch_features)
loss = multi_task_loss(outputs, batch_labels)
loss.backward()
optimizer.step()
return loss.item(), ranking_score(outputs).mean().item()
This code makes the objective function concrete: five tasks, each predicting one behavioral signal, with weights encoding product judgment. The multi-task loss aggregates them into one gradient that updates the shared representation. The ranking score is a weighted combination of the task outputs - directly implementing the proxy metric defined in the framing stage.
How to Clarify Constraints in an Interview
Most ML design questions are intentionally underspecified. The constraints you clarify in the first 5 minutes shape every subsequent decision. Here is a structured set of clarifying questions for any ML design problem:
Scale constraints
- How many users does the system serve? (thousands, millions, billions)
- How many predictions per second? (determines serving infrastructure choice)
- How much training data is available? (shapes model complexity choices)
Latency constraints
- What is the end-to-end latency budget? (50ms, 200ms, offline batch)
- Is this real-time serving or batch prediction? (batch: more complex models OK; real-time: latency is a hard constraint)
- What is the acceptable tail latency? (p99 latency, not just mean)
Accuracy vs cost constraints
- What is the cost of a false positive? (blocking a legitimate transaction, showing irrelevant content)
- What is the cost of a false negative? (missing fraud, not surfacing relevant content)
- What is the minimum acceptable precision/recall at the operating threshold?
Data constraints
- What labeled data exists? (interaction logs, human annotations, both?)
- Are there privacy constraints on the data? (GDPR, HIPAA, COPPA for minors)
- How fresh does the model need to be? (daily retraining, weekly, monthly?)
Business constraints
- Are there fairness or non-discrimination requirements? (models in credit, hiring, housing)
- Are there explainability requirements? (regulated industries need feature importance or SHAP)
- Are there budget constraints on inference cost? (GPU serving vs CPU serving)
Asking these questions at the start of an interview demonstrates that you understand ML systems in their full context - not just as optimization problems, but as products that must satisfy business, legal, and operational constraints simultaneously.
Objective Functions and Loss Functions: The Translation Layer
Once you have defined the ML objective - what to predict, what the label is, what counts as a positive - the next step is translating it into a loss function that trains the model correctly. This translation has several important choices.
Binary Cross-Entropy for Click Prediction
The standard loss for binary classification is binary cross-entropy:
where is the label and is the model's predicted probability of the positive class.
For click prediction, if the user clicked on the post within 1 hour of seeing it, and if not. The loss penalizes the model when it assigns low probability to posts the user clicked and high probability to posts they did not click.
Pairwise Ranking Loss for Search
For ranking problems, pairwise losses compare pairs of items and train the model to assign a higher score to the more relevant item:
This is the hinge loss for ranking (BPR - Bayesian Personalized Ranking uses a sigmoid variant). For each query , a relevant document (one the user clicked), and an irrelevant document (one that was shown but not clicked), the loss is zero if the relevant document is scored higher by at least margin 1. If not, the loss equals the shortfall.
Training with pairwise loss directly optimizes the ordering, not individual score accuracy. This is more aligned with ranking metrics (NDCG, MRR) than pointwise losses.
Listwise Ranking Loss for Full-List Optimization
Listwise losses optimize over the entire ranked list, not just pairs. LambdaRank and LambdaMART directly optimize NDCG:
where is the relevance grade of the item at rank and IDCG is the ideal DCG (the maximum possible DCG for this query). LambdaRank computes the NDCG gradient by approximating the gradient of a position-sensitive ranking objective - a technique used in production at Microsoft Bing and many other search systems.
Calibration: Making Probabilities Trustworthy
An ML model's output probability is calibrated if - if the model predicts 80% probability, 80% of those examples should actually be positive.
Calibration matters when:
- You are composing multiple models (CTR model × bid price model × quality score) - if any sub-model is miscalibrated, the composition is wrong
- You are setting an operating threshold - "show this ad if predicted CTR greater than 0.5%" requires calibrated probabilities
- You are communicating uncertainty to downstream systems or users
Platt scaling is the simplest calibration method. After training, fit a logistic regression on the validation set with the model's raw scores as inputs and the true labels as outputs:
where is the model's raw score and are learned on the validation set. Isotonic regression is a more flexible non-parametric alternative.
from sklearn.calibration import CalibratedClassifierCV
from sklearn.linear_model import LogisticRegression
import numpy as np
# Fit calibration on validation set
# val_scores: raw model output scores (e.g., from gradient boosting)
# val_labels: true binary labels
val_scores = model.predict_proba(X_val)[:, 1]
val_labels = y_val
# Platt scaling: fit sigmoid to (score, label) pairs
from sklearn.linear_model import LogisticRegression
calibrator = LogisticRegression()
calibrator.fit(val_scores.reshape(-1, 1), val_labels)
# Calibrated probabilities on test set
test_scores = model.predict_proba(X_test)[:, 1]
calibrated_probs = calibrator.predict_proba(
test_scores.reshape(-1, 1)
)[:, 1]
# Verify calibration with reliability diagram
def reliability_diagram(y_true, y_prob, n_bins=10):
"""Check if predicted probabilities match empirical frequencies."""
bins = np.linspace(0, 1, n_bins + 1)
bin_means = []
bin_freqs = []
for lo, hi in zip(bins[:-1], bins[1:]):
mask = (y_prob >= lo) & (y_prob < hi)
if mask.sum() > 0:
bin_means.append(y_prob[mask].mean())
bin_freqs.append(y_true[mask].mean())
return np.array(bin_means), np.array(bin_freqs)
# Perfect calibration: bin_means ≈ bin_freqs (diagonal line)
Designing for Regulatory Constraints
Some ML domains operate under regulatory constraints that fundamentally shape the objective function. Understanding these constraints is part of problem framing, not an afterthought.
Regulated Financial Services
Credit scoring in the US is governed by the Equal Credit Opportunity Act (ECOA) and Fair Housing Act (FHA), which prohibit discrimination based on race, gender, nationality, marital status, and other protected characteristics. This imposes constraints on both the features you can use and the objective function you can optimize.
Features: you cannot use race, gender, or zip code as direct features. However, features that are highly correlated with these protected attributes (zip code is highly correlated with race in many US cities) may also violate fair lending law through disparate impact, even if the correlation is not intentional.
Objective function constraints: the model must satisfy equalized fairness constraints - the approval rate gap between protected groups must be within some regulatory tolerance. This turns the optimization problem from "maximize approval rate for qualified applicants" to "maximize approval rate for qualified applicants subject to fairness constraints."
A common approach: add a fairness penalty to the loss function:
where the fairness violation term measures, for example, the difference in false positive rate between demographic groups (adverse action rate disparity). The weight is tuned to satisfy regulatory requirements.
Explainability: US regulations require that lenders provide adverse action notices explaining why a credit application was denied. A black-box neural network cannot satisfy this requirement. The model must be interpretable enough to produce per-application explanations: "Your application was denied primarily because your debt-to-income ratio (42%) exceeded our threshold (35%)."
Healthcare AI
HIPAA governs the use of protected health information (PHI) in model training. You cannot train on patient records without a Business Associate Agreement (BAA) with the covered entity. De-identified data (all 18 HIPAA identifiers removed) can be used without restriction.
FDA 510(k) clearance or Pre-Market Approval (PMA) is required for AI/ML systems that are "intended to influence clinical decisions." This regulatory pathway imposes requirements on training data documentation, validation study design, and post-market performance monitoring.
The practical implication for ML framing: in healthcare AI, the model objective must account for regulatory requirements from day one. "We'll add explainability later" is not a viable plan if regulatory approval requires explainability. Define the explainability requirement in Step 1 of the framing exercise, not Step 5.
Translating Framing into a Model Card
The outputs of the framing exercise should be documented in a model card - a structured document that records the model's intended use, limitations, performance characteristics, and ethical considerations.
Google introduced model cards in 2019. They are now widely used at Google, Meta, Hugging Face (for open-source models), and increasingly required by enterprise ML governance policies.
Key sections of a model card:
## Model Details
- **Model name:** News Feed Engagement Predictor v3.2
- **Model type:** Multi-task binary classifier (5 tasks)
- **Intended use:** Rank posts in Facebook News Feed by expected meaningful engagement
- **Out-of-scope uses:** Any use outside of News Feed ranking;
any use for targeting based on protected characteristics
## Training Data
- **Source:** Interaction logs, January 2023 – December 2023
- **Label:** Behavioral engagement signals (comment, share, reaction, dwell time, quick skip)
- **Label window:** 2-hour window after impression
- **Known limitations:** Labels are implicit (behavioral), not explicit (user preference)
## Evaluation Data
- **Holdout period:** January 2024 – February 2024 (temporal holdout)
- **Key metrics:**
- AUC-ROC: 0.84 (comment task), 0.81 (share task), 0.76 (reaction task)
- Online A/B test: +3.2% meaningful engagement rate, +1.1% 7-day retention
## Ethical Considerations
- Model outputs influence content visibility for 2B+ users
- Known risk: optimizing for engagement can amplify emotionally provocative content
- Mitigation: explicit penalty for content receiving high hide/report rates
- Fairness audit: engagement prediction parity tested across gender and age groups
## Limitations
- Performance degrades for new users with fewer than 10 posts in history (cold start)
- Not tested on users in regions with less than 1% of training data representation
- Assumes stable engagement behavior; may require retraining during major world events
Producing a model card forces precision about the objective function, evaluation methodology, known limitations, and ethical considerations. An ML engineer who can articulate all of these - before deployment - is far more likely to ship a model that works as intended.
When Heuristics Beat ML: The Framing Decision
One of the least-discussed outputs of the framing exercise is: should this be an ML problem at all?
ML is not always the right tool. Before investing in data collection, feature engineering, and model training, it is worth asking whether a rule-based system or simple heuristic could solve the problem adequately. The reasons to avoid ML when it is unnecessary:
-
ML models require maintenance. They drift, they need retraining, they require monitoring infrastructure. A rule-based system is static and predictable.
-
ML models are harder to debug. When a rule fires incorrectly, you can read the rule and fix it. When a model makes a wrong prediction, diagnosing the root cause requires feature analysis, error analysis, and sometimes interpretability tools.
-
ML models require labeled data. Collecting and labeling training data is expensive. If a rule achieves 95% of the performance and the use case does not justify the additional 5%, the rule is the right choice.
-
ML introduces regulatory complexity. Automated decisions by ML models are subject to increasing regulatory scrutiny (EU AI Act, algorithmic accountability laws). A rule-based system is fully explainable and auditable.
A framework for deciding between rules and ML:
| Criterion | Rules/Heuristics Win | ML Wins |
|---|---|---|
| Rule complexity | Can encode in fewer than 50 rules | Requires thousands of rules to approximate behavior |
| Edge case coverage | Edge cases are enumerable | Edge cases are numerous and unpredictable |
| Distribution stability | Input distribution is stable | Input distribution changes over time |
| Performance gap | Acceptable with rules | Significant gap that ML can close |
| Data availability | Insufficient labeled data | Abundant labeled data |
| Regulatory context | Requires full explainability | Explainability not required |
The canonical example: a new fraud detection system at a startup with 10,000 transactions per day. The fraud team has domain expertise. They know specific fraud patterns: cards from certain country combinations, specific transaction velocity patterns, specific merchant categories combined with specific card types. They can encode these patterns in rules.
An ML model at this scale would have too little training data to learn anything more than what the rules already capture. The rules are more interpretable, easier to update when fraud patterns change, and require no ML infrastructure. At 10M transactions per day with 100,000+ fraud cases in training data, the calculus flips - ML can detect patterns too complex for rules.
In an interview, demonstrating that you considered whether ML is necessary - and can articulate the conditions under which it is and is not - signals engineering maturity that purely model-focused candidates do not demonstrate.
:::tip The best ML system is sometimes no ML At a large fintech, a team spent three months building an ML model for detecting duplicate invoices. The model achieved 94% accuracy. A rule-based system using invoice number normalization + fuzzy string matching on the vendor name and amount achieved 91% accuracy - in two weeks, with no training data and no retraining infrastructure. The ML model was 3% better and 10x more expensive to maintain. The rule-based system stayed in production for four years. :::
The Distinction Between Optimization and Decision
The final, most important insight in ML problem framing: the model is an optimization engine, not a decision-maker. The objective function encodes human values. The model maximizes those values. If the values encoded in the objective function are wrong, the model will faithfully optimize the wrong thing.
This is not a technical failure - it is a values failure disguised as a technical problem. YouTube's extremist content rabbit hole was not a model bug. The model was doing exactly what it was designed to do: maximize watch time. The bug was in the objective function: watch time was chosen as a proxy for value when it was actually a proxy for compulsive engagement.
The lesson for ML engineers: you are not just solving optimization problems. You are deciding what values to encode in the objective function. "What should the model optimize?" is a product, business, and ethical question, not just a technical one. Sitting in the framing conversation with business stakeholders and asking "what happens if we optimize this metric aggressively?" is one of the highest-value contributions an ML engineer can make.
When an interviewer asks you to design an ML system, they are evaluating whether you understand this. A candidate who asks "what are the risks if we optimize this proxy metric too hard?" signals depth that a candidate who jumps straight to architecture never does.
Quick Reference: Prediction Type Decision Guide
Use this table when you need to quickly decide which prediction type fits a given ML problem:
| Situation | Prediction Type | Loss Function | Example |
|---|---|---|---|
| Yes/no outcome, rare positive class | Binary classification | Focal loss or weighted BCE | Fraud detection (0.1% positive rate) |
| Yes/no outcome, balanced classes | Binary classification | Binary cross-entropy | Email spam (50% spam rate on filtered data) |
| One of K mutually exclusive outcomes | Multi-class classification | Categorical cross-entropy | Query intent (K=10 intents) |
| Each item has multiple labels | Multi-label classification | Per-class BCE, independently | Tag prediction (multiple tags per post) |
| Continuous positive outcome | Regression | Mean squared error (MSE) or MAE | ETA prediction in seconds |
| Heavy-tailed continuous outcome | Regression on log | MSE on log-transformed target | Revenue prediction |
| Rank items by relevance | Pointwise ranking | BCE on (query, item) pair | Feed ranking, ad ranking |
| Optimize full ranking list | Listwise ranking | LambdaRank / LambdaMART | Search result ranking |
| Multiple related objectives | Multi-task learning | Weighted sum of per-task losses | News feed (click + share + completion) |
This table is a starting point, not a rule book. The right choice always depends on the data, the label availability, the latency constraints, and the business goal.
Key Takeaways
The framing hierarchy - Business Goal → Proxy Metric → ML Objective - is the foundation of every ML system design interview. Interviewers are evaluating whether you default to defining the problem before reaching for a solution.
The proxy metric trap is real, documented, and appears in every major ML platform that ever optimized for a single metric aggressively. The fix is multi-objective optimization with terms that correct for gamesmanship.
Label construction is not a detail - it is a core design decision. Implicit feedback bias, delayed labels, and survivorship bias each require explicit handling. The quality of your labels determines the ceiling of your model's performance.
The eight-step framework gives you a reliable scaffold for any ML system design problem. Use it in interviews and in real projects. It is not a checklist to rush through - it is a thinking tool to slow you down long enough to get the framing right.
The code examples in this lesson illustrate how abstract framing decisions become concrete model architectures and loss functions. Multi-task learning with weighted losses is the direct implementation of a multi-objective proxy metric. Position bias correction with IPS is the direct implementation of accounting for selection bias in click data.
Regulatory constraints, ethical considerations, and model cards are not separate from the framing exercise - they are part of it. The framing conversation is where you decide not just what to optimize but what you are not allowed to optimize, and what you must document about your choices.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the End-to-End ML Pipeline demo on the EngineersOfAI Playground - no code required.
:::
