Skip to main content

ML Interview Questions Bank - Your Complete Preparation Guide

Reading time: ~35 min | Interview relevance: Critical | Roles: MLE, Data Scientist, AI Engineer, Applied Scientist, Research Scientist

The Real Interview Moment

You're 15 minutes into an ML knowledge round at a top tech company. The interviewer has asked you three questions - bias-variance tradeoff, regularization, and gradient descent - and you've answered them well. Then the difficulty ramps up: "You've deployed a model to production and it's performing well. Three months later, performance starts degrading even though the model hasn't changed. Walk me through your debugging process."

This is where rote memorization fails. The interviewer is testing whether you can synthesize knowledge across multiple ML fundamentals - data drift, evaluation metrics, feature engineering, cross-validation - into a coherent debugging workflow. The questions in this bank are organized to build you up to exactly this level of integration.

What You Will Master

  • 50+ questions spanning all ML fundamentals topics in this section
  • Structured model answers with the depth expected at each level
  • Company-specific question patterns (Google, Meta, Amazon, Apple, OpenAI)
  • Quick-fire rapid response format for screening rounds
  • Cross-references to detailed explanations in other pages

How to Use This Question Bank

How to Use the ML Interview Question Bank

Practice method:

  1. Set a timer for 2 minutes per question (screening) or 5 minutes (deep dive)
  2. Answer aloud as if speaking to an interviewer
  3. Check your answer against the model answer
  4. Grade yourself: Strong Hire / Lean Hire / No Hire
  5. For any "No Hire," study the linked page before retrying

Section 1 - Screening Questions (Phone Screen Level)

These questions are asked in 30-minute phone screens. Expect 5-8 questions. Each answer should take 1-2 minutes. The interviewer wants crisp, accurate definitions with intuition.

Q1: Explain the bias-variance tradeoff.

Company Variation

Asked at virtually every company. Google and Meta expect mathematical formulation. Amazon expects practical examples.

Model Answer: "The bias-variance tradeoff describes two competing sources of prediction error. Bias is systematic error from oversimplified assumptions - the model can't capture the true pattern. Variance is error from sensitivity to specific training data - the model captures noise. Total error decomposes as: Error = Bias^2 + Variance + Irreducible Noise. Simple models (linear regression) have high bias, low variance. Complex models (deep neural nets) have low bias, high variance. Practically, I diagnose this by comparing training and validation error: both high means high bias (underfit), training low but validation high means high variance (overfit). I adjust complexity through regularization, model selection, or data augmentation."

Scoring: Strong Hire = decomposition formula + practical diagnosis. Lean Hire = correct intuition without formula. No Hire = can't explain the tradeoff.

Deep dive: Bias-Variance Tradeoff

Q2: What loss function would you use for binary classification? Why?

Model Answer: "Binary cross-entropy (log loss): L=[ylog(p)+(1y)log(1p)]L = -[y \log(p) + (1-y) \log(1-p)]. It penalizes confident wrong predictions heavily - if the true label is 1 and the model predicts 0.01, the loss is -log(0.01) = 4.6, much larger than if it predicted 0.4 (loss = 0.92). This encourages calibrated probabilities. For imbalanced data, I'd use weighted cross-entropy or focal loss. For ranking tasks where I don't need calibrated probabilities, hinge loss or contrastive loss might be better."

Scoring: Strong Hire = formula + explains penalty structure + mentions alternatives for edge cases. Lean Hire = knows it's cross-entropy but can't explain why. No Hire = says "accuracy" or "MSE."

Deep dive: Loss Functions

Q3: L1 vs. L2 regularization - when do you use each?

Model Answer: "Both add a penalty to the loss function. L2 (Ridge) adds the sum of squared weights - shrinks all weights toward zero but rarely to exactly zero. L1 (Lasso) adds the sum of absolute weights - can drive weights to exactly zero, performing feature selection. Geometrically, L1's diamond constraint region has corners on axes, so solutions tend to land on axes (sparse). I use L1 when I suspect many irrelevant features and want automatic selection. L2 when all features are potentially useful and I just want to prevent large weights. Elastic Net combines both when I want sparsity but also stability with correlated features."

Scoring: Strong Hire = geometric intuition + when to use each + Elastic Net. Lean Hire = knows the difference but not the geometric reason. No Hire = confuses L1 and L2.

Deep dive: Regularization

Q4: Explain gradient descent and its variants.

Model Answer: "Gradient descent minimizes a loss function by iteratively stepping in the direction of steepest descent. Batch GD uses the full dataset per step - stable but slow. SGD uses one sample - noisy but fast, helps escape local minima. Mini-batch (typical 32-256) balances both. Adam combines momentum (exponential moving average of gradients, handles noisy gradients) with RMSProp (adapts learning rate per parameter, handles different scales). I default to Adam with lr=3e-4 for deep learning. For fine-tuning or when I need tight convergence, I switch to SGD with momentum and a learning rate schedule."

Scoring: Strong Hire = all three variants + Adam's two components + practical defaults. Lean Hire = knows variants but not Adam's mechanism. No Hire = can't explain gradient descent.

Deep dive: Optimization

Q5: How would you evaluate a classification model?

Model Answer: "It depends on the problem. For balanced classes: accuracy and F1. For imbalanced data (fraud, disease): precision-recall AUC over ROC-AUC, because ROC can be misleadingly high when negatives dominate. Precision matters when false positives are costly (spam filter). Recall when false negatives are costly (cancer screening). I always check the confusion matrix to understand error patterns. In production, I pair offline metrics with online A/B tests. And I never use a 0.5 threshold by default - I tune it on a validation set."

Scoring: Strong Hire = metric selection by context + PR-AUC over ROC-AUC reasoning + threshold tuning. Lean Hire = lists metrics correctly. No Hire = "I use accuracy."

Deep dive: Evaluation Metrics

Q6: What is overfitting and how do you prevent it?

Model Answer: "Overfitting is when the model performs well on training data but poorly on new data - it's learned noise rather than patterns. Prevention: (1) Regularization: L1/L2 penalty, dropout for neural nets. (2) More data: augmentation, synthetic data. (3) Simpler model: fewer parameters, shallower trees. (4) Early stopping: stop training when validation loss stops improving. (5) Cross-validation: use k-fold to get robust performance estimates. (6) Ensembles: bagging reduces variance. I diagnose it by the gap between training and validation performance."

Scoring: Strong Hire = 4+ techniques with practical details + diagnosis method. Lean Hire = lists techniques without detail. No Hire = can't define overfitting clearly.

Deep dive: Bias-Variance Tradeoff, Regularization

Q7: Explain cross-validation. When would you NOT use random k-fold?

Model Answer: "Cross-validation splits data into k folds, trains on k-1, tests on the remaining fold, and rotates. It gives a robust performance estimate with confidence intervals. I would NOT use random k-fold for: (1) time series data - future data would leak into training; I'd use expanding or sliding window CV. (2) Grouped data like medical images from the same patient - I'd use group k-fold to prevent patient-level leakage. (3) Imbalanced classification - I'd use stratified k-fold to preserve class ratios per fold."

Scoring: Strong Hire = all three exceptions with reasoning. Lean Hire = knows about time series but misses groups. No Hire = doesn't know what cross-validation is.

Deep dive: Cross-Validation

Q8: How do you handle missing data?

Model Answer: "First, I understand why data is missing: MCAR (random), MAR (depends on observed values), or MNAR (depends on the missing value itself - most dangerous). For MCAR/MAR: mean/median imputation for simple cases, KNN or iterative imputation for more accuracy. For MNAR: the missingness itself is informative, so I create a binary indicator feature ('is_missing'). Tree-based models (XGBoost, LightGBM) handle missing values natively. I never impute the target variable - I drop those rows. And I always impute on the training set first, then apply the same transformation to test data to prevent leakage."

Scoring: Strong Hire = MCAR/MAR/MNAR distinction + leakage prevention + indicator feature for MNAR. Lean Hire = knows about imputation methods. No Hire = "just drop rows with missing data."

Deep dive: Feature Engineering

Q9: Explain ensemble methods. When do bagging and boosting each shine?

Model Answer: "Ensembles combine multiple models for better performance. Bagging (Bootstrap Aggregating) trains models on random data subsets and averages predictions - reduces variance. Random Forest adds feature randomness on top. Best when individual models overfit (high variance). Boosting trains models sequentially, each correcting errors of the previous - reduces bias. XGBoost, LightGBM, CatBoost are state-of-the-art for tabular data. Best when individual models underfit. Key insight: bagging works because averaging reduces variance; boosting works because sequential correction reduces bias."

Scoring: Strong Hire = bias/variance reduction distinction + specific algorithms + when each shines. Lean Hire = knows the difference conceptually. No Hire = confuses bagging and boosting.

Deep dive: Ensemble Methods

Q10: What is data leakage? Give three examples.

Model Answer: "Data leakage is when information from outside the training set is used to create the model, giving unrealistically good performance that won't generalize. Three examples: (1) Feature leakage: using future information as a feature (e.g., 'days_since_churn' predicting churn). (2) Preprocessing leakage: fitting StandardScaler on the full dataset before cross-validation - test fold statistics leak into training. (3) Temporal leakage: random train/test split on time series data lets the model see future data. Fix with pipelines, temporal splits, and careful feature auditing."

Scoring: Strong Hire = three distinct types with concrete examples + how to fix. Lean Hire = one example. No Hire = can't define data leakage.

Deep dive: Cross-Validation, Feature Engineering

Q11: PCA - what is it and when would you use it?

Model Answer: "PCA finds orthogonal directions of maximum variance and projects data onto the top k. Step by step: center the data, compute covariance matrix, eigendecompose it, project onto top eigenvectors. I'd use it when: dimensions far exceed samples (gene expression data), need to reduce computation, or need to remove multicollinearity. I wouldn't use it when interpretability is required (components are linear combinations, not original features), the signal is in low-variance directions, or relationships are highly non-linear. Always standardize features first since PCA is scale-sensitive."

Scoring: Strong Hire = mathematical steps + when to use and not use + scaling caveat. Lean Hire = knows PCA reduces dimensions but can't explain the math. No Hire = "PCA selects the best features."

Deep dive: Dimensionality Reduction

Q12: Why might accuracy be a bad metric?

Model Answer: "Accuracy is misleading when classes are imbalanced. A fraud detection model with 0.1% fraud rate achieves 99.9% accuracy by predicting 'not fraud' for everything - zero useful predictions. Better alternatives: precision (when FP is costly), recall (when FN is costly), F1 (balance), PR-AUC (threshold-independent, sensitive to minority class). ROC-AUC is also threshold-independent but can be misleadingly high for severe imbalance because FPR stays low when there are many negatives."

Scoring: Strong Hire = concrete example + PR-AUC over ROC-AUC reasoning. Lean Hire = knows accuracy fails for imbalance. No Hire = "accuracy is always fine."

Deep dive: Evaluation Metrics, Handling Imbalanced Data

Q13: What is feature engineering? Give examples of powerful techniques.

Model Answer: "Feature engineering is creating new features from raw data to improve model performance. Techniques: (1) Temporal: day of week, time since last event, rolling averages. (2) Interaction features: price * quantity, ratio features. (3) Binning: age groups, income brackets (helps tree models). (4) Encoding categoricals: one-hot for low cardinality, target encoding for high cardinality (with smoothing to prevent leakage). (5) Text: TF-IDF, word embeddings, character n-grams. (6) Aggregation: group-by statistics (mean, count per user). The best features come from domain knowledge."

Scoring: Strong Hire = 4+ techniques with examples + target encoding leakage awareness. Lean Hire = lists a few techniques. No Hire = can't give examples.

Deep dive: Feature Engineering

Q14: Explain Bayes' theorem and one ML application.

Model Answer: "Bayes' theorem: P(A|B) = P(B|A) * P(A) / P(B). It updates prior beliefs given new evidence. In ML: Naive Bayes classification - P(spam|words) is proportional to P(words|spam) * P(spam). Despite the naive independence assumption (P(word1, word2|spam) = P(word1|spam) * P(word2|spam)), it works surprisingly well for text because classification only needs correct ranking, not calibrated probabilities. MAP estimation is another application: L2 regularization is equivalent to Bayesian inference with a Gaussian prior on weights."

Scoring: Strong Hire = formula + two applications + explains why NB works despite assumption. Lean Hire = knows the formula and NB. No Hire = can't state Bayes' theorem.

Deep dive: Probabilistic ML

Q15: How would you handle a dataset with 99:1 class imbalance?

Model Answer: "Four-pronged approach: (1) Metrics: use PR-AUC and F1, not accuracy. (2) Training: class weights (scale loss by inverse frequency) - simpler than resampling. SMOTE if data is low-dimensional with clear class separation. Avoid SMOTE for high-dimensional sparse data. (3) Threshold: tune on validation set - don't use 0.5. (4) Evaluation: stratified k-fold CV. For extreme imbalance (1:10000+), consider reframing as anomaly detection. In production, calibrate probabilities after any resampling."

Scoring: Strong Hire = all four prongs + when SMOTE fails + calibration. Lean Hire = mentions 2-3 techniques. No Hire = "just oversample."

Deep dive: Handling Imbalanced Data

Section 2 - Technical Deep Dive Questions (Onsite Level)

These questions are asked in 45-60 minute onsite rounds. Expect 3-5 questions with follow-ups. The interviewer probes for depth - explain the why and when, not just the what.

Q16: Walk me through how you'd debug a model whose performance is declining in production.

Company Variation

Google (L4+): Expects systematic debugging framework. Amazon: Expects data-centric approach. Meta: Expects metric-level analysis.

Model Answer:

"I'd follow a systematic debugging workflow:

Step 1: Verify the metric isn't lying

  • Is the evaluation code correct? Did a code change affect metric computation?
  • Are we comparing on the same population? Seasonal changes might shift user demographics.

Step 2: Check for data drift

  • Compare feature distributions between training data and recent production data
  • Use statistical tests (KS test, PSI score) to detect distribution shifts
  • Check for missing values or schema changes in upstream data pipelines

Step 3: Check for concept drift

  • The relationship between features and target has changed (e.g., user behavior shifted)
  • Compare model performance across time windows - gradual degradation suggests concept drift

Step 4: Check for label drift

  • If labels come from human annotators, have annotation guidelines changed?
  • If labels are derived (e.g., clicks), has the product surface changed?

Step 5: Fix

  • Data drift: retrain on recent data, add new features that capture the shift
  • Concept drift: retrain more frequently, use online learning, add time-aware features
  • Label drift: recalibrate, update labeling guidelines, relabel a sample

Step 6: Prevent

  • Set up monitoring dashboards for feature distributions, prediction distributions, and metrics
  • Alert on significant distribution shifts before performance degrades
  • Implement automated retraining pipelines with performance gates"

Scoring: Strong Hire = systematic framework covering data/concept/label drift + monitoring/prevention. Lean Hire = identifies drift but not systematic. No Hire = "just retrain the model."

Q17: Explain the difference between bagging and boosting mathematically. Why does Random Forest use deep trees while boosting uses shallow ones?

Model Answer:

"Bagging: Trains M models independently on bootstrap samples, averages predictions. For regression:

Var(fˉ)=1M2i,jCov(fi,fj)ρσ2+(1ρ)σ2M\text{Var}(\bar{f}) = \frac{1}{M^2} \sum_{i,j} \text{Cov}(f_i, f_j) \approx \rho \sigma^2 + \frac{(1-\rho)\sigma^2}{M}

Where rho is the average correlation between trees. As M increases, variance decreases (second term vanishes), but rho limits the reduction. Feature randomness (Random Forest) reduces rho.

Boosting: Trains models sequentially. Each new model fits the residuals of the previous ensemble. At step m:

Fm(x)=Fm1(x)+αmhm(x)F_m(x) = F_{m-1}(x) + \alpha_m h_m(x)

where h_m is trained on the residuals (gradient of the loss with respect to predictions).

Why deep vs. shallow trees?

  • RF uses deep, unpruned trees because each tree needs low bias. The ensemble (averaging) handles variance reduction. Deep trees overfit individually but averaging cancels out the noise.
  • Boosting uses shallow trees (depth 3-8) because each tree only needs to make a small correction. High-bias base learners are fine because boosting reduces bias sequentially. Deep trees in boosting would overcorrect and overfit."

Scoring: Strong Hire = mathematical formulations + clear explanation of why depth differs + rho in RF. Lean Hire = correct intuition without math. No Hire = can't explain the difference.

Deep dive: Ensemble Methods

Q18: You're told to use SMOTE on a text classification dataset with TF-IDF features (10K dimensions). What's wrong with this?

Model Answer:

"Three problems: (1) In 10K-dimensional sparse space, all points are roughly equidistant - the 'nearest neighbors' SMOTE relies on aren't truly similar documents. (2) TF-IDF vectors are sparse; interpolating two sparse vectors produces a dense vector that doesn't correspond to any real document - it has small non-zero weights on words from both parent documents. (3) The synthetic vectors occupy regions of feature space that real text never occupies, potentially confusing the model.

Alternatives: class weights (simplest), random oversampling (preserves real documents), text augmentation (back-translation, synonym replacement), or reduce to dense embeddings first (BERT features, 768-dim) where SMOTE might work better."

Scoring: Strong Hire = all three issues + multiple alternatives. Lean Hire = identifies dimensionality problem. No Hire = "SMOTE always works."

Deep dive: Handling Imbalanced Data

Q19: Explain nested cross-validation. When is it necessary?

Model Answer:

"Nested CV uses two loops: an outer loop for estimating generalization performance and an inner loop for hyperparameter tuning. The inner loop runs grid/random search within each outer fold's training set. The outer loop's test score is unbiased because the test data was never used for any decision.

It's necessary when you're doing model selection AND want an honest performance estimate. Without nesting, you're reporting the best inner CV score, which is optimistically biased - you selected hyperparameters that happened to work well on those specific folds. The bias can be 1-5% in practice.

Cost: k_outer * k_inner * |param_grid| model fits. For 5x5 with 20 parameter combinations: 500 fits. Expensive but necessary when you need to compare algorithms fairly (e.g., Random Forest vs. XGBoost) and report trustworthy numbers."

Scoring: Strong Hire = explains the bias of non-nested CV + computational cost + when necessary. Lean Hire = describes the structure correctly. No Hire = never heard of it.

Deep dive: Cross-Validation

Q20: A model predicts house prices. Training RMSE is 15K,testRMSEis15K, test RMSE is 45K. Diagnose and fix.

Model Answer:

"The large gap (15K vs 45K) indicates severe overfitting - high variance. Diagnostic steps:

  1. Check for data leakage: Is any feature derived from the target? (e.g., price_per_sqft already encodes price)
  2. Check data sizes: If training is small (<1K samples) with many features, the model has enough capacity to memorize
  3. Check model complexity: If using deep trees or many features, it might be too flexible

Fixes (in order):

  1. Regularization: L2 penalty, reduce max_depth for trees, dropout for NNs
  2. Feature reduction: Remove correlated or noisy features; use PCA or feature importance
  3. More data: Augmentation or gathering more samples
  4. Simpler model: From gradient boosting to regularized linear regression
  5. Cross-validation: Use k-fold to tune hyperparameters properly, not a single split
  6. Ensemble: Bagging reduces variance

I'd also check if the test set is from a different distribution (different geography, time period) - that would be dataset shift, not just overfitting."

Scoring: Strong Hire = systematic diagnosis + checks for leakage/distribution shift + ordered fixes. Lean Hire = identifies overfitting and lists some fixes. No Hire = only says "get more data."

Q21: Explain how you'd choose between PCA and UMAP for dimensionality reduction.

Model Answer:

"Decision factors:

  1. Goal: If preprocessing for ML, PCA first (linear, fast, deterministic, interpretable via loadings). If visualization, UMAP (preserves local and some global structure).

  2. Linearity: If relationships are linear, PCA is optimal. If there's manifold structure, UMAP captures it.

  3. Scale: PCA handles millions of samples easily. UMAP scales well too (O(n log n)) but is slower than PCA.

  4. New data: PCA has a closed-form projection for new data. UMAP also supports .transform() but it's an approximation.

  5. Determinism: PCA gives the same result every time. UMAP depends on random initialization.

  6. Interpretability: PCA loadings tell you which original features contribute to each component. UMAP components are uninterpretable.

My default: start with PCA. If downstream task performance is poor AND I suspect non-linear structure (check by comparing PCA reconstruction error with autoencoder reconstruction error), switch to UMAP or an autoencoder."

Scoring: Strong Hire = systematic comparison on 5+ dimensions + practical workflow. Lean Hire = knows the main differences. No Hire = "always use PCA" or "always use UMAP."

Deep dive: Dimensionality Reduction

Q22: What is the difference between aleatoric and epistemic uncertainty? How do you measure each?

Company Variation

Google (L5+): Expects implementation details. Autonomous driving companies: Expects safety-critical reasoning. OpenAI: Expects connection to calibration and RLHF.

Model Answer:

"Aleatoric uncertainty is inherent data noise - irreducible even with infinite data. Example: ambiguous image where even humans disagree. Epistemic uncertainty is model uncertainty from limited data - reducible by collecting more data. Example: predicting for a region with no training examples.

Measuring aleatoric: train the model to output both mean AND variance - the predicted variance is aleatoric uncertainty. Use a heteroscedastic loss (negative log-likelihood of a Gaussian).

Measuring epistemic: use Monte Carlo Dropout (run inference multiple times with dropout on) or deep ensembles (train 5 independent models). The variance of predictions across runs/models captures epistemic uncertainty.

Total uncertainty = aleatoric + epistemic. In safety-critical applications, high epistemic uncertainty means 'escalate to human' (the model hasn't seen this before). High aleatoric uncertainty means 'the data itself is ambiguous' (consider getting more measurements)."

Scoring: Strong Hire = clear distinction + measurement methods for each + action implications. Lean Hire = knows the concepts but not how to measure. No Hire = never heard of the distinction.

Deep dive: Probabilistic ML

Q23: Explain focal loss. When and why would you use it instead of cross-entropy?

Model Answer:

"Focal loss is cross-entropy with a modulating factor: FL = -(1-p_t)^gamma * log(p_t), where gamma is typically 2. When the model correctly classifies a sample with high confidence (p_t = 0.9), the (1-0.9)^2 factor reduces the loss by 100x. For hard, misclassified samples (p_t = 0.1), the factor is (0.9)^2 = 0.81 - nearly full loss.

Use instead of cross-entropy when: (1) severe class imbalance, especially in detection tasks (object detection has 10,000 background patches per object), (2) easy negatives dominate the gradient - focal loss lets the model focus on the hard, informative examples. Combine with alpha (class weighting) for both addressing imbalance and focusing on hard examples.

The key insight: class imbalance causes a flood of easy negatives whose cumulative loss overwhelms the rare, hard positives. Focal loss solves this by down-weighting the easy examples regardless of class."

Scoring: Strong Hire = formula + numerical example + RetinaNet context + distinction from class weights. Lean Hire = knows the formula. No Hire = never heard of focal loss.

Deep dive: Handling Imbalanced Data, Loss Functions

Q24: How does XGBoost handle missing values?

Model Answer:

"XGBoost learns the optimal direction to send missing values at each tree split. During training, for each split, it tries sending all instances with missing values to both the left and right child, and picks the direction that minimizes the loss. This learned 'default direction' is then applied at inference time.

This is superior to imputation because: (1) the optimal handling might differ per feature and per split in the tree, (2) no preprocessing step needed, (3) the missingness pattern itself can be informative. LightGBM and CatBoost have similar capabilities.

Caveat: this works well when missing values have a consistent pattern between training and test data. If missingness changes significantly (a sensor starts failing differently), the learned default directions may be wrong."

Scoring: Strong Hire = explains the algorithm (try both directions at each split) + advantages over imputation + caveat. Lean Hire = knows XGBoost handles missing values natively. No Hire = "you need to impute first."

Deep dive: Ensemble Methods, Feature Engineering

Q25: Explain the connection between L2 regularization and Bayesian inference.

Model Answer:

"L2 regularization adds lambda * ||w||^2 to the loss. In Bayesian terms, this is equivalent to MAP estimation with a Gaussian prior on the weights: P(w) = N(0, sigma^2 I), where sigma^2 = 1/(2*lambda).

The MAP objective is: argmax [log P(D|w) + log P(w)]. The log-likelihood is the standard loss function. The log-prior becomes -w^T w / (2*sigma^2), which is proportional to -lambda * ||w||^2.

Large lambda = small sigma = tight prior = strong belief weights should be near zero. Small lambda = large sigma = weak prior = let the data speak.

Similarly, L1 regularization corresponds to a Laplace prior, which has heavier tails and a sharp peak at zero - explaining why L1 produces sparse solutions."

Scoring: Strong Hire = derives the connection mathematically + interprets lambda as prior strength + L1/Laplace connection. Lean Hire = knows the connection exists but can't derive it. No Hire = doesn't see the connection.

Deep dive: Regularization, Probabilistic ML

Section 3 - Senior/Staff Level Questions

These questions are asked in senior (L5+) and staff (L6+) interviews. Expect 2-3 deep questions with extensive follow-ups. The interviewer expects you to drive the discussion, ask clarifying questions, consider tradeoffs, and propose complete systems.

Q26: Design a complete ML pipeline for a credit card fraud detection system at scale.

Company Variation

Amazon (L6): Expects cost analysis. Google (L5): Expects system design. Stripe/PayPal: Domain-specific expectations.

Strong Answer Framework:

  1. Data pipeline: Stream processing (Kafka), feature store for real-time features (transaction velocity, location anomalies), batch features (historical patterns)
  2. Labeling: Chargeback labels (delayed 30-90 days), investigation labels (faster but biased)
  3. Training: Time-based splits (never random), class weights over SMOTE (0.2% fraud = 500:1 ratio), optimize for PR-AUC
  4. Model: Two-stage: fast rule-based filter (blocks obvious fraud, <1ms), followed by ML model (XGBoost or neural net, <50ms latency)
  5. Threshold: Cost-sensitive: 1000fraudlossvs.1000 fraud loss vs. 10 investigation cost gives 100:1 cost ratio; tune threshold to minimize expected cost on validation set
  6. Monitoring: Track false positive rate (analyst capacity), false negative rate (delayed via chargebacks), prediction distribution drift, feature drift
  7. Retraining: Weekly retraining with expanding window, champion-challenger deployment

Q27: You have 1M features (genomics data). How do you build a model?

Strong Answer Framework:

  1. First reaction: 1M features with likely <10K samples. Model will overfit without aggressive dimensionality reduction.
  2. Feature filtering: Remove near-zero variance features, correlated feature pairs (keep one)
  3. Dimensionality reduction: PCA to 100-500 components (denoising step)
  4. Feature selection: Stability selection with L1 (run Lasso 100 times on bootstrapped data, keep features selected >80% of the time) to find robust biomarkers
  5. Model: Elastic Net or sparse SVM for interpretability; Random Forest for performance baseline
  6. Validation: Repeated stratified k-fold (small dataset means high variance in CV estimates)
  7. Biological validation: Selected features should be biologically plausible; discuss with domain experts
  8. Comparison: Compare PCA + classifier vs. raw features + Elastic Net vs. autoencoder + classifier

Q28: How would you implement an A/B test for a recommendation model? What pitfalls exist?

Strong Answer Framework:

  1. Randomization unit: User-level (not session-level, to avoid within-user inconsistency)
  2. Metrics: Primary (engagement, revenue), guardrails (latency, user complaints), long-term proxy (retention)
  3. Duration: Power analysis to determine sample size. With 1% MDE and 5% significance: typically 2-4 weeks
  4. Pitfalls:
    • Novelty effect: users click more on anything new - wait for effect to stabilize
    • Network effects: if users interact, treatment can affect control (interference)
    • Simpson's paradox: aggregate improvement might hide degradation for a segment
    • Multiple comparisons: testing 10 metrics inflates false positive rate - use Bonferroni or FDR correction
    • P-value peeking: checking results daily inflates false positives - use sequential testing
  5. Bayesian alternative: Posterior probability that treatment is better, credible intervals on lift

Q29: Explain the curse of dimensionality and its practical implications for ML systems.

Strong Answer Framework:

  1. Mathematical: In d dimensions, volume of unit hypercube shell (within epsilon of surface) is 1(12ϵ)d1 - (1-2\epsilon)^d. For d=100, epsilon=0.01: 87% of volume is at the surface. All points become "edge cases."
  2. Distance convergence: Max distance / min distance ratio approaches 1. k-NN, SVM, clustering all fail because similarity becomes meaningless.
  3. Data sparsity: To maintain same data density in d dimensions, need exponentially more data (ndn^d for n points per unit in 1D).
  4. Practical implications: Feature engineering must be thoughtful (not just throw everything in), regularization becomes critical, dimensionality reduction is a prerequisite for many algorithms, tree-based methods are somewhat robust (they partition one feature at a time)
  5. Johnson-Lindenstrauss: Random projections to O(log n / epsilon^2) dims preserve distances - a practical escape from the curse.

Deep dive: Dimensionality Reduction

Q30: Compare gradient boosting vs. deep learning for tabular data. When would you choose each?

Company Variation

Google: Published "TabNet" and research on deep learning for tabular data. Kaggle competitions: Gradient boosting still dominates. Meta: Uses both in production.

Strong Answer Framework:

FactorGradient BoostingDeep Learning
Tabular data (<100 features)Usually betterCompetitive if very large dataset
Feature engineeringHandles raw features wellCan learn representations
Missing valuesNative handlingNeeds imputation
Training timeMinutes-hoursHours-days
InterpretabilityFeature importance, SHAPHarder to interpret
Data size <100KClearly betterOverfits
Data size >10MStill competitiveCan match or beat
Structured + unstructuredCan't handle images/textHandles multi-modal
Hyperparameter sensitivityModerateHigh

Default: XGBoost/LightGBM for tabular. Deep learning when: multi-modal data (text + tabular), very large datasets (>10M rows), or need for learned representations that transfer to other tasks.

Q31 (Bonus): How do you decide between feature selection and feature extraction?

Strong Answer Framework:

"This depends on three factors: interpretability requirements, data structure, and downstream task.

Choose feature selection when:

  • Regulatory or business requirements demand interpretable features (credit scoring, healthcare)
  • You want to reduce data collection costs (fewer features to gather in production)
  • Domain knowledge suggests many features are irrelevant
  • Methods: Lasso (embedded), mutual information (filter), RFE (wrapper)

Choose feature extraction when:

  • Features are highly correlated (PCA decorrelates)
  • Non-linear relationships exist (kernel PCA, autoencoders)
  • You need a compact representation for downstream tasks
  • Interpretability is not a priority

Hybrid approach: Use feature selection to remove clearly irrelevant features, then PCA on the remaining features. This combines the benefits: reduced noise from selection + decorrelation from extraction.

Key insight: feature selection preserves original feature meaning; feature extraction creates new features that may capture more signal but lose interpretability."

Deep dive: Dimensionality Reduction

Q32 (Bonus): Explain calibration in ML. Why do modern neural networks tend to be poorly calibrated?

Strong Answer Framework:

"Calibration means predicted probabilities match actual frequencies - when a model says '80% chance,' it should be correct 80% of the time.

Why modern NNs are poorly calibrated:

  1. Overparameterization: Modern NNs have far more parameters than needed to fit the training data, allowing them to produce extreme logits (very confident predictions)
  2. Cross-entropy + softmax: The loss approaches zero only as logits approach infinity, incentivizing extreme confidence
  3. Lack of regularization: NNs trained without strong regularization learn to be overconfident
  4. Batch normalization: Changes the effective learning rate and can amplify confidence

Measuring calibration: Expected Calibration Error (ECE) - bin predictions by confidence, compute |accuracy - confidence| per bin, take weighted average. Visualize with reliability diagrams.

Fixing calibration:

  1. Temperature scaling: Divide logits by learned T (single parameter, fit on validation set). Simple and effective.
  2. Platt scaling: Fit logistic regression on raw model outputs
  3. Isotonic regression: Non-parametric, more flexible, needs more data
  4. Mixup training: Interpolate between training examples - naturally reduces overconfidence

In production, monitor calibration over time - it drifts as data distributions change."

Deep dive: Probabilistic ML

Section 4 - Company-Tagged Questions

Google

Q31: "Explain how you'd design the loss function for a multi-task learning system with shared representations." (L5+)

Model Answer: "Multi-task learning shares lower layers for representation learning and has separate heads per task. The loss is a weighted sum: L = w1L1 + w2L2 + ... The challenge is balancing task weights - tasks with larger loss magnitudes dominate gradients. Solutions: (1) Uncertainty weighting: learn task weights via homoscedastic uncertainty (Kendall et al.). (2) GradNorm: dynamically adjust weights to equalize gradient norms across tasks. (3) PCGrad: project conflicting gradients to avoid tasks hurting each other. I'd start with uncertainty weighting, monitor per-task metrics, and use GradNorm if tasks have very different scales."

Q32: "Walk me through the AdamW optimizer. Why was weight decay decoupled from L2 regularization?" (L4+)

Model Answer: "In standard Adam, L2 regularization adds lambdaw to the gradient, but Adam then divides by the second moment estimate, so the effective regularization varies per parameter - parameters with large gradients get less regularization. AdamW fixes this by decoupling: apply weight decay directly to the weights (w = w - lrwd*w) separately from the gradient update. This ensures every parameter gets the same proportional decay regardless of gradient history. The practical difference is significant - AdamW generalizes better in most deep learning settings."

Q33: "How does label smoothing work and when does it help?" (L4)

Model Answer: "Instead of hard targets [0, 1], use soft targets [epsilon/K, 1-epsilon+epsilon/K] where K is the number of classes and epsilon is typically 0.1. This prevents the model from becoming overconfident - it can't drive logits to infinity to achieve zero loss. Benefits: better calibration, improved generalization, and more stable training. It helps most in classification tasks with potentially noisy labels. It can hurt when the true decision boundary is sharp and confidence matters (e.g., safety-critical binary decisions)."

Meta

Q34: "You have 100M training examples. How do you do hyperparameter tuning efficiently?" (E5+)

Model Answer: "At this scale, full training for each hyperparameter configuration is prohibitive. Strategy: (1) Subsample: tune on a representative 1-5% subsample first to narrow the search space. (2) Multi-fidelity: train for fewer epochs to eliminate bad configs early (Hyperband/ASHA). (3) Bayesian optimization: use GP surrogate to model the objective and select promising configs. (4) Transfer from similar tasks: use hyperparameters that worked on related models as warm starts. (5) Parallel exploration: run 16-32 configs simultaneously on a cluster. Final step: validate the top 2-3 configs on the full dataset."

Q35: "Explain how you'd build a model to predict user engagement with <10ms latency." (E5)

Model Answer: "Latency constraint drives architecture choices. (1) Feature engineering: pre-compute heavy features offline (user embeddings, historical aggregates) and store in a feature store; only compute lightweight real-time features at serving time. (2) Model: distilled model - train a complex teacher (gradient boosting ensemble), distill into a small model (shallow NN or logistic regression). (3) Serving: batch predictions where possible, use model compilation (TorchScript, TensorRT, ONNX) for optimized inference. (4) Architecture: two-stage - fast first-pass model filters 90% of candidates, complex model scores remaining 10%. (5) Monitoring: track P99 latency, not just P50."

Q36: "What's the difference between offline and online metrics? When do they disagree?" (E4)

Model Answer: "Offline metrics (AUC, NDCG, RMSE) measure model quality on held-out data. Online metrics (CTR, revenue, retention) measure business impact via A/B tests. They disagree when: (1) the offline metric doesn't capture user experience (higher accuracy but worse UX), (2) there's a feedback loop (recommending popular items gets more clicks but reduces diversity), (3) presentation bias (position, UI changes affect online metrics independently), (4) long-term effects (short-term engagement up, long-term retention down). Always validate offline improvements with online experiments before full deployment."

Amazon

Q37: "How would you build a customer churn prediction model? Include the cost-benefit analysis." (L5+)

Model Answer: "Define churn operationally (no purchase in 90 days). Features: recency, frequency, monetary value (RFM), engagement metrics, support tickets, competitive pricing signals. Model: XGBoost with time-based CV (train on months 1-9, validate on 10, test on 11-12). Cost-benefit: retention offer costs Cpercustomer,churnedcustomerlifetimevaluelossisC per customer, churned customer lifetime value loss is V. Expected profit from intervention = P(churn|intervene=no) * V - C. Only intervene when P(churn) > C/V. This gives a natural threshold. With 50retentionofferand50 retention offer and 500 CLV, threshold = 0.1. ROI = (prevented churns * 500interventions500 - interventions * 50) / (interventions * $50)."

Q38: "Explain how you'd handle seasonality in a demand forecasting model." (L5)

Model Answer: "Decompose the signal: trend + seasonality + residual. Methods: (1) Feature engineering: day of week, month, holiday indicators, lagged features (same day last week/year). (2) Fourier features: sine/cosine terms at seasonal frequencies for smooth periodic patterns. (3) Prophet/ARIMA: explicit additive decomposition. (4) For ML models: include calendar features + lagged demand + rolling statistics. Key pitfall: the most recent data often doesn't include the same seasonal period (e.g., training in March can't learn December patterns from this year). Use at least 2 years of data, and weight recent data more heavily while preserving older seasonal patterns."

Q39: "Walk me through how you'd evaluate a recommendation system beyond accuracy." (L5)

Model Answer: "Offline: (1) Ranking metrics: NDCG@K, MAP, MRR for relevance ranking quality. (2) Coverage: what fraction of the item catalog appears in recommendations? Low coverage = popularity bias. (3) Diversity: intra-list diversity - are recommendations varied or repetitive? (4) Novelty: are we recommending items the user wouldn't have found on their own? (5) Fairness: are all item providers getting reasonable exposure? Online: (6) A/B test with engagement metrics (click-through, dwell time, conversion). (7) Long-term retention - does the recommendation system keep users coming back? (8) User satisfaction surveys."

Apple

Q40: "How would you build a model that works well with differential privacy constraints?" (ICT4+)

Model Answer: "Differential privacy (DP) adds calibrated noise to guarantee that any individual's data doesn't significantly affect the model. DP-SGD: clip per-sample gradients to bound sensitivity, then add Gaussian noise proportional to the sensitivity and privacy budget (epsilon). Tradeoffs: smaller epsilon = more privacy = more noise = worse model quality. Strategies to mitigate accuracy loss: (1) larger batch sizes (noise is per-batch, so larger batches dilute noise), (2) pre-training on public data then fine-tuning with DP on private data, (3) use models with fewer parameters (less noise needed per parameter), (4) federated learning to keep data on-device."

Q41: "Explain the tradeoffs between on-device and server-side ML." (ICT4)

Model Answer: "On-device: lower latency (no network), works offline, preserves privacy (data stays local), but constrained by device compute/memory (model must be small, quantized). Server-side: unlimited compute, can use large models, easier to update, but requires network, raises privacy concerns, and latency includes round-trip time. Hybrid approach: on-device for latency-sensitive inference (keyboard prediction, face detection), server-side for heavy computation (complex NLU, large-scale retrieval). On-device models need quantization (INT8, INT4), pruning, and distillation to fit. CoreML/TFLite for deployment."

OpenAI

Q42: "Explain the RLHF pipeline and how it differs from standard supervised fine-tuning." (Research Eng)

Model Answer: "RLHF has three stages: (1) SFT: supervised fine-tuning on high-quality demonstrations (standard). (2) Reward model training: human raters compare pairs of model outputs, a reward model learns to score outputs by human preference. (3) RL fine-tuning: use PPO to optimize the language model's policy against the reward model, with a KL penalty to prevent diverging too far from the SFT model. Key difference from SFT: SFT trains on 'correct' outputs, RLHF trains on 'preferred' outputs - it captures nuances (helpfulness, harmlessness, style) that are hard to specify in a loss function. Challenges: reward hacking, reward model quality, training instability."

Q43: "How would you detect and mitigate hallucinations in a language model?" (Research Eng)

Model Answer: "Detection: (1) Factual verification against a knowledge base or retrieval system. (2) Self-consistency: generate multiple responses and check agreement - high variance suggests hallucination. (3) Uncertainty estimation: high token-level entropy or low sequence-level confidence. (4) Entailment checking: verify that the output is entailed by the context/sources. Mitigation: (1) RAG: ground responses in retrieved documents. (2) Chain-of-thought: force step-by-step reasoning, making errors more detectable. (3) Fine-tuning to abstain: train the model to say 'I don't know' when uncertain. (4) Post-hoc filtering: a separate classifier or critic model that flags potentially hallucinated content."

Section 5 - Quick-Fire Round

20 rapid questions with 1-2 sentence answers. Practice answering each in under 15 seconds. This format is used in screening rounds to test breadth.

QF1: What's the difference between a generative and discriminative model? Generative models model P(x|y) and P(y) to compute P(y|x) - they model the data distribution. Discriminative models directly model P(y|x) - the decision boundary. Examples: naive Bayes (generative) vs. logistic regression (discriminative).

QF2: What is the kernel trick? It computes dot products in a high-dimensional feature space without explicitly transforming the data - enabling non-linear classifiers (SVM) with linear computational cost.

QF3: What's the difference between precision and recall? Precision = TP/(TP+FP) - "of those predicted positive, how many are correct?" Recall = TP/(TP+FN) - "of those actually positive, how many did we catch?"

QF4: What is dropout? Randomly sets neuron activations to zero during training with probability p. Prevents co-adaptation of neurons and acts as approximate Bayesian inference (ensemble of sub-networks).

QF5: Why normalize/standardize features? Gradient-based optimizers converge faster with features on similar scales. Distance-based algorithms (k-NN, SVM, PCA) are distorted by different scales.

QF6: What is the vanishing gradient problem? In deep networks, gradients can shrink exponentially through layers (especially with sigmoid/tanh activations), making early layers impossible to train. Fixed with ReLU, residual connections, batch normalization.

QF7: Batch normalization - what does it do? Normalizes activations within a mini-batch to have zero mean and unit variance, then applies learned scale and shift. Stabilizes training, allows higher learning rates, acts as mild regularization.

QF8: What is transfer learning? Using a model pre-trained on one task (e.g., ImageNet classification) as a starting point for a related task (e.g., medical image classification). Fine-tune the last layers while freezing earlier layers.

QF9: ROC curve vs. precision-recall curve? ROC plots TPR vs. FPR - stays high even with class imbalance. PR curve plots precision vs. recall - more informative for imbalanced data because precision directly reflects false positives among predictions.

QF10: What is gradient clipping? Caps gradient magnitudes during backpropagation to prevent exploding gradients. Common in RNNs and transformer training. Typical max norm: 1.0 or 5.0.

QF11: Explain the difference between SGD and Adam. SGD uses a fixed learning rate for all parameters. Adam adapts the learning rate per parameter using running averages of first moment (momentum) and second moment (gradient magnitude), converging faster with less tuning.

QF12: What is early stopping? Stop training when validation loss stops improving for N epochs (patience). Prevents overfitting without explicitly setting regularization strength - the number of training steps acts as implicit regularization.

QF13: What is multicollinearity and why does it matter? When features are highly correlated, regression coefficients become unstable (small data changes cause large coefficient swings). Doesn't affect predictions much but makes coefficients uninterpretable. Fix with PCA, dropping features, or Ridge regression.

QF14: What is a confusion matrix? A 2x2 (or NxN) table showing counts of True Positives, False Positives, True Negatives, and False Negatives. It's the basis for computing precision, recall, F1, and accuracy.

QF15: Parametric vs. non-parametric models? Parametric: fixed number of parameters regardless of data size (linear regression, neural nets). Non-parametric: complexity grows with data (k-NN, decision trees, Gaussian processes). Neither is inherently better - depends on data and constraints.

QF16: What is the reparameterization trick? Used in VAEs to allow gradients to flow through stochastic sampling. Instead of sampling z ~ N(mu, sigma), compute z = mu + sigma * epsilon where epsilon ~ N(0,1). Now z is differentiable w.r.t. mu and sigma.

QF17: What is target leakage? When a feature encodes information about the target that wouldn't be available at prediction time. Example: "treatment_outcome" as a feature for predicting whether a patient needs treatment.

QF18: What is a learning rate schedule? Changing the learning rate during training. Common: warmup (start low, increase), cosine annealing (decrease following cosine curve), step decay (drop by factor at fixed epochs). Helps escape plateaus and converge to better optima.

QF19: What is stratified sampling? Splitting data so each split preserves the original class distribution. Critical for imbalanced classification - prevents folds with zero minority samples.

QF20: What is the No Free Lunch theorem? No single algorithm is best for all problems. Any algorithm that excels on one class of problems must underperform on another. Implication: always try multiple approaches and validate empirically.

Cross-Reference Index

TopicDetailed Page
Bias-Variance Tradeoff01 - Bias-Variance
Loss Functions02 - Loss Functions
Regularization (L1, L2, Dropout, etc.)03 - Regularization
Optimization (SGD, Adam, LR schedules)04 - Optimization
Evaluation Metrics (F1, AUC, etc.)05 - Evaluation Metrics
Feature Engineering06 - Feature Engineering
Ensemble Methods (RF, XGBoost, etc.)07 - Ensemble Methods
Cross-Validation08 - Cross-Validation
Handling Imbalanced Data09 - Handling Imbalance
Dimensionality Reduction (PCA, UMAP, etc.)10 - Dimensionality Reduction
Probabilistic ML (Bayes, GPs, BNNs)11 - Probabilistic ML

Interview Cheat Sheet

Round TypeQuestionsTime Per QDepth Expected
Phone Screen5-82-3 minDefinition + intuition + one example
Technical Deep Dive3-58-12 minFull explanation + tradeoffs + follow-ups
Senior/Staff2-315-20 minSystem design + mathematical depth + production considerations
Quick-Fire15-2015-30 secCrisp 1-2 sentence answer

Answer Framework for Every Question

  1. WHAT: Define the concept (1-2 sentences)
  2. WHY: Explain the intuition - why does this work? (2-3 sentences)
  3. WHEN: When to use it and when NOT to (specific scenarios)
  4. TRADE-OFFS: Limitations, alternatives, what you'd consider in production

Spaced Repetition Checkpoints

Day 0 - Initial Learning

  • Answer all 15 screening questions aloud (time yourself: 2 min each)
  • Grade yourself on each - identify any "No Hire" areas
  • Answer all 20 quick-fire questions in under 5 minutes total
  • Read the detailed pages for any topics where you scored "No Hire"

Day 3 - Recall

  • Without looking, answer Q1-Q10 aloud again
  • Attempt two technical deep dive questions (Q16-Q25)
  • Review your weakest 3 topics from Day 0
  • Practice the quick-fire round again - target under 30 seconds each

Day 7 - Application

  • Answer all screening questions without preparation
  • Attempt Q26 (fraud detection design) with full system thinking
  • Practice answering with follow-up questions (have a friend probe deeper)
  • Score yourself against the rubrics

Day 14 - Integration

  • Do a mock interview: 5 random questions from any section, 45 minutes total
  • Practice company-specific questions for your target companies
  • Attempt staff-level questions Q26-Q30
  • Identify gaps in your knowledge and fill them with the detailed pages

Day 21 - Mastery

  • Full mock interview with all difficulty levels mixed
  • Can you answer any question in this bank confidently?
  • Can you handle 2-3 levels of follow-up on each answer?
  • Practice system design questions (Q26-Q28) end-to-end in under 20 minutes each
© 2026 EngineersOfAI. All rights reserved.