Skip to main content

ML Knowledge Round - Show Your Depth

Reading time: ~20 min | Interview relevance: Critical | Roles: MLE, DS, RE

The Real Interview Moment

The interviewer leans forward: "You're training a model and the training loss is decreasing but the validation loss is flat from the start. What's different about this from the case where validation loss increases? What could cause this, and what would you do?"

This isn't the simple "training loss goes down, validation loss goes up = overfitting" scenario. The validation loss is flat, not increasing. This requires deeper thinking - is the model not learning anything generalizable at all? Is there a data leakage issue? Is the validation set from a different distribution? The ML Knowledge round tests whether you can reason about ML from first principles, not just recite definitions.

What You Will Master

  • The format and scoring of ML knowledge rounds
  • How to structure ML answers (the WHAT-WHY-WHEN framework)
  • The 30 most-asked ML questions with answer templates
  • How to handle questions you don't know
  • The depth expected at different levels (L3 vs. L5 vs. L6+)

Part 1 - How the Round Works

Format

AspectDetails
Duration45-60 minutes
Questions5-8 questions, increasing depth
StyleConversational - interviewer probes deeper based on your answers
ScoringDepth of understanding, ability to explain intuition, awareness of trade-offs

The Answer Framework: WHAT → WHY → WHEN → TRADE-OFFS

For every ML concept, structure your answer as:

  1. WHAT: Define the concept clearly (1-2 sentences)
  2. WHY: Explain the intuition - why does this work? (2-3 sentences)
  3. WHEN: When to use it and when NOT to (examples)
  4. TRADE-OFFS: Limitations and alternatives
Interviewer's Perspective

I grade ML knowledge answers on a spectrum. Level 1: "Can define it." Level 2: "Can explain the intuition." Level 3: "Knows when to use it and when not to." Level 4: "Can discuss trade-offs and alternatives." Level 5: "Can teach me something I didn't know." Most candidates are at Level 1-2. Strong Hire candidates consistently reach Level 3-4.

Part 2 - The 30 Must-Know Questions

Fundamentals (Asked at Every Level)

Q1: Explain the bias-variance trade-off.

Template answer: "The bias-variance trade-off describes the tension between two types of model error. Bias is error from oversimplified assumptions - the model can't capture the true pattern (underfitting). Variance is error from sensitivity to training data - the model captures noise (overfitting). Total error = bias² + variance + irreducible noise. Simple models have high bias, low variance. Complex models have low bias, high variance. The goal is to find the sweet spot. Practically, I use validation curves to diagnose: if train and val error are both high, the model is too simple (high bias). If train error is low but val error is high, the model is too complex (high variance)."

Q2: How do you handle overfitting?

(See the detailed answer template in the MLE Role page)

Q3: L1 vs. L2 regularization - when do you use each?

Template: "Both add a penalty to the loss function to prevent large weights. L2 (Ridge) adds the sum of squared weights - it shrinks all weights toward zero but rarely makes them exactly zero. Good when all features are potentially useful. L1 (Lasso) adds the sum of absolute weights - it drives some weights to exactly zero, performing implicit feature selection. Good when you suspect many features are irrelevant. Geometrically: L1's diamond-shaped constraint region has corners on the axes, which is why solutions tend to be sparse. In practice, I start with L2 unless I explicitly want feature selection."

Q4: Explain gradient descent and its variants.

Template: "Gradient descent minimizes a loss function by iteratively moving in the direction of steepest descent. Batch GD computes the gradient on the full dataset - stable but slow. Stochastic GD (SGD) uses one sample per step - noisy but fast, can escape local minima. Mini-batch GD (most common) uses a batch of samples - balances speed and stability. Adam combines momentum (exponential moving average of past gradients) and RMSProp (adapts learning rate per parameter). Adam converges faster in practice and requires less tuning. I use Adam as my default optimizer, SGD with momentum for fine-tuning when I want tighter convergence control."

Q5: What evaluation metrics would you use for a classification problem?

Template: "It depends on the problem. Accuracy is misleading with class imbalance. For balanced classes: accuracy + F1. For imbalanced (fraud detection, disease diagnosis): precision-recall AUC is better than ROC-AUC because it's sensitive to the minority class. Precision matters when false positives are costly (spam filter - don't want to block real emails). Recall matters when false negatives are costly (cancer screening - don't want to miss a case). In production, I pair offline metrics with online A/B test metrics to validate that offline improvements translate to real-world impact."

Deep Learning (L4+ Roles)

Q6: How does the transformer architecture work?

Q7: What is attention, and why is it important?

Q8: Explain backpropagation step by step.

Q9: What is transfer learning, and when does it work?

Q10: How does batch normalization work, and why does it help?

Applied ML (Senior+ Roles)

Q11: How do you handle class imbalance?

Q12: How do you handle missing data?

Q13: What's the difference between online and batch learning?

Q14: How do you choose between different model architectures?

Q15: Explain cross-validation and its variants.

Production ML (Senior+ MLE)

Q16: What is data leakage and how do you prevent it?

Q17: How do you detect model drift in production?

Q18: How do you set up an A/B test for a model?

Q19: What's the difference between offline and online evaluation?

Q20: How do you handle training-serving skew?

LLM-Specific (AI Engineer)

Q21: How does RAG work end-to-end?

Q22: When would you fine-tune vs. use RAG vs. use in-context learning?

Q23: What are the main failure modes of LLM-based systems?

Q24: How do you evaluate LLM outputs?

Q25: How do agents work? Explain the ReAct pattern.

Statistics (Data Scientist)

Q26: Explain p-values and their limitations.

Q27: How do you design an A/B test?

Q28: What's the difference between correlation and causation?

Q29: How do you handle multiple comparisons?

Q30: Explain the Central Limit Theorem and why it matters.

Part 3 - Depth by Level

LevelExpected DepthExample
L3 (Junior)Define concepts, give textbook explanations"Overfitting is when the model performs well on training data but poorly on test data"
L4 (Mid)Explain intuition, know when to apply"I'd use L1 regularization here because we have 500 features but suspect only 50 are relevant"
L5 (Senior)Discuss trade-offs, connect to production experience"In my experience, L1 regularization makes the feature importance more interpretable but can be unstable with correlated features - I've used Elastic Net to balance both"
L6+ (Staff)Teach the interviewer something, connect to broader systems"The regularization choice depends on the downstream serving architecture - if we need fast inference, L1's sparse weights enable optimizations that L2 can't"

Part 4 - Handling Questions You Don't Know

The worst thing you can do: Make up an answer confidently.

The best thing you can do: Be honest, then reason from first principles.

Script: "I haven't worked with [X] directly, but based on my understanding of [related concept], I'd reason about it like this: [thoughtful reasoning]. Am I on the right track?"

Common Trap

Some candidates try to redirect every question to a topic they're comfortable with. "You asked about kernel methods, but let me tell you about transformers instead." Interviewers notice this immediately, and it's a strong negative signal. If you don't know something, say so and reason through it. Don't deflect.

Practice Problems

Problem 1: Debugging

Your model has high accuracy on the test set but performs poorly in production. What could be going wrong?

Full Answer + Rubric

Strong answer:

  1. Data leakage: Test set accidentally includes information from training data (e.g., future data, target encoding leakage). Check data pipeline.
  2. Distribution mismatch: Test set doesn't represent production data (different user demographics, seasonal effects, new product categories).
  3. Feature unavailability: Features used in training aren't available in real-time serving (e.g., features computed from future events, features with high latency).
  4. Training-serving skew: Features are computed differently in training (batch SQL) vs. serving (real-time calculation).
  5. Stale model: Model was evaluated on recent test data but production data has drifted since.
  6. Wrong metric: High accuracy masks poor performance on important subgroups (e.g., model is accurate overall but terrible on high-value users).

Scoring:

  • Strong Hire: Lists 4+ root causes, prioritizes by likelihood, has debugging methodology
  • Lean Hire: Identifies 2-3 causes
  • No Hire: Can't explain why test performance wouldn't match production

Problem 2: Model Selection

You have 1M samples, 100 features, and a binary classification task. Walk me through your model selection process.

Full Answer + Rubric

Strong answer:

  1. Baseline: Logistic regression. Fast, interpretable, gives calibrated probabilities. This is my benchmark - everything else must beat it.
  2. Strong default: Gradient boosting (XGBoost/LightGBM). For tabular data with 100 features and 1M samples, this is likely the best model. Fast to train, handles mixed feature types, built-in feature importance.
  3. Consider neural networks?: With 1M samples, deep learning might work, but gradient boosting typically wins on tabular data. I'd only try if the features have spatial/sequential structure.
  4. Evaluation: Stratified k-fold cross-validation (k=5). If binary classification is imbalanced, use stratified splits and PR-AUC as primary metric.
  5. Hyperparameter tuning: Bayesian optimization (Optuna) on the gradient boosting model. Key params: learning rate, max depth, num estimators, min child weight.
  6. Ensemble: If the gap between logistic regression and GBM is large, I'd check if a simple ensemble (GBM + logistic regression) improves further.

Scoring:

  • Strong Hire: Starts with baseline, justifies model choices for this specific problem, mentions evaluation methodology
  • Lean Hire: Makes reasonable choices but doesn't explain why
  • No Hire: Jumps to deep learning without considering simpler models

Interview Cheat Sheet

Question TypeAnswer StructureTime Target
"Explain concept X"WHAT → WHY → WHEN → TRADE-OFFS2-3 minutes
"How would you handle X?"Diagnosis → Options → Selection → Justification3-5 minutes
"Compare X vs Y"Define both → Key differences → When to use each2-3 minutes
"Walk me through your approach"Problem → Data → Baseline → Iterate → Evaluate5-7 minutes

Spaced Repetition Checkpoints

  • Day 0: Read this page. Answer 5 questions from the list without looking at references.
  • Day 3: Answer 10 more questions. Practice the WHAT-WHY-WHEN-TRADE-OFFS structure.
  • Day 7: Have a friend quiz you on 10 random questions. Time yourself.
  • Day 14: Focus on your weakest topic area. Read the relevant section in ML Fundamentals.
  • Day 21: Do a full mock ML knowledge round (45 min, 6-8 questions).

What's Next

© 2026 EngineersOfAI. All rights reserved.