The Write-Up - Turning Analysis Into a Hiring Decision
Reading time: ~40 min | Interview relevance: Critical | Roles: MLE, AI Eng, Data Scientist, Research Engineer
The Real Interview Moment
You are a hiring manager at Stripe. You have eight take-home submissions on your desk. It is 9 PM, and you have a hiring committee meeting at 9 AM tomorrow. You cannot run any code tonight - you can only read. You open the first submission: a raw Jupyter notebook with 60 code cells, no markdown, and a single print(classification_report(y_test, y_pred)) at the bottom. You spend four minutes scrolling, cannot find the punchline, and move on. You open the second submission: a clean write-up with a two-paragraph executive summary, three well-labeled figures, a methodology section that explains every decision, and a "Next Steps" section showing what the candidate would do with more time. You understand the entire analysis in three minutes. You write "advance to on-site" and move to the next submission.
The difference between these two candidates was not the quality of their models. It was the quality of their communication. The first candidate may have done better analysis, but you will never know - they did not make it readable. This page teaches you how to write take-home results that get read, understood, and remembered.
What You Will Master
- Structure a write-up that guides the evaluator from problem to conclusion in under five minutes
- Write an executive summary that captures the key insight in two paragraphs
- Build visualizations that answer questions instead of decorating pages
- Create a technical appendix that demonstrates depth without cluttering the main narrative
- Prepare a follow-up presentation that summarizes your work in 10-15 minutes
- Handle follow-up questions with structured, confident responses
- Adapt your write-up style for different company cultures and evaluator types
Self-Assessment: Where Are You Now?
| Skill | 1 -- Cannot | 2 -- Vaguely | 3 -- Can Do | 4 -- Consistently | 5 -- Can Teach | Your Score |
|---|---|---|---|---|---|---|
| Write a two-paragraph executive summary | ___ | |||||
| Structure a write-up with clear sections | ___ | |||||
| Create visualizations that support decisions | ___ | |||||
| Explain methodology decisions concisely | ___ | |||||
| Write a technical appendix | ___ | |||||
| Present results in 10-15 minutes | ___ | |||||
| Handle adversarial questions calmly | ___ | |||||
| Adapt communication style to the audience | ___ |
Target: All 4s and 5s before your interview.
Part 1 -- The Write-Up Structure
The Five-Section Framework
Every take-home write-up should follow a structure that mirrors how the evaluator thinks. They want to know, in order: What did you do? Why does it matter? How did you do it? What did you find? What would you do next?
The Two-Speed Reading Principle
Your write-up will be read at two speeds:
- Speed read (2-3 minutes): The evaluator reads only the executive summary, looks at the figures, and skims the results table. This is how most evaluators form their initial impression.
- Deep read (15-30 minutes): If the speed read is compelling, the evaluator reads the full methodology and may run your code. This is where they validate their initial impression.
Design your write-up so that both reads are satisfying. The speed read should convey the complete story. The deep read should reinforce it with rigor.
"I structure my write-ups with five sections: executive summary, problem understanding, methodology, results, and next steps. The executive summary is two paragraphs - the first states the problem and approach, the second states the key results and their business implications. Every figure has a clear takeaway in its title. The methodology section explains not just what I did but why I made each decision. The next steps section shows that I understand this is a starting point, not a finished product."
Part 2 -- The Executive Summary
The Two-Paragraph Formula
The executive summary is the most important part of your write-up. Many evaluators read only this section. It must be self-contained.
Paragraph 1: Problem and Approach
- One sentence: What is the problem?
- One sentence: What is the business context or impact?
- One sentence: What was your approach at a high level?
Paragraph 2: Results and Implications
- One sentence: What are the key quantitative results?
- One sentence: How do they compare to a baseline?
- One sentence: What is the actionable takeaway?
Example: Strong Executive Summary
## Executive Summary
This analysis addresses the problem of predicting customer churn for a
subscription-based SaaS product. With an 8% monthly churn rate costing an
estimated $2.4M annually in lost revenue, even modest improvements in
early identification can drive significant retention savings. I developed
a LightGBM classifier trained on RFM features, engagement velocity
metrics, and usage pattern features, evaluated using precision-recall AUC
to account for the severe class imbalance.
The final model achieves a PR-AUC of 0.43 (5.4x improvement over the
0.08 random baseline), identifying 62\% of churners in the top decile of
risk scores. At a 30\% precision threshold \text{---} where each intervention
costs roughly $50 in CSM time - the model would flag approximately 340
customers per month, of which ~100 would actually churn, yielding an
estimated $180K annual savings assuming a 30\% save rate. Key predictive
signals are declining login frequency (past 14 days vs. prior 14 days),
days since last support ticket resolution, and contract renewal proximity.
Example: Weak Executive Summary
## Summary
I used LightGBM to predict churn. I tried several models including
random forest and logistic regression. LightGBM performed the best.
The AUC was 0.91. I used 5-fold cross-validation.
Do not report only ROC-AUC for imbalanced classification problems. An AUC of 0.91 sounds impressive but means nothing if the evaluator cannot translate it into a business decision. Always pair statistical metrics with business-interpretable metrics: "identifies X% of churners in the top Y decile" or "at Z% precision, we would flag N customers per month."
The "So What?" Test
After writing your executive summary, read it and ask: "If I were a VP of Product, would I know what to do with this information?" If the answer is no, rewrite it. Technical correctness without actionability is a missed opportunity.
Part 3 -- Visualizations That Communicate
The Purpose-Driven Visualization Framework
Every figure in your write-up should answer a specific question. If you cannot state the question a figure answers, delete it.
| Question | Visualization | Example Title |
|---|---|---|
| What does the data look like? | Distribution plots, class balance bar | "Target class distribution: 8% churn rate creates a 12:1 imbalance" |
| Which features matter? | Feature importance bar chart | "Top 10 features: engagement velocity dominates, demographics contribute little" |
| How well does the model perform? | PR curve, calibration plot | "Precision-recall tradeoff: 62% recall at 30% precision threshold" |
| Where does the model fail? | Confusion matrix, error analysis | "False negatives cluster in recently onboarded users (< 30 days)" |
| How do models compare? | Grouped bar chart, comparison table | "LightGBM outperforms logistic regression by 12% PR-AUC across all folds" |
The Four Rules of Take-Home Figures
Rule 1: Title is the takeaway, not the description.
# BAD - describes the chart
fig.suptitle("Feature Importance Plot")
# GOOD - states the finding
fig.suptitle(
"Login frequency decline is 3x more predictive than any demographic feature"
)
Rule 2: Label everything.
def plot_precision_recall_curve(
y_true: pd.Series,
y_scores: np.ndarray,
model_name: str = "LightGBM",
save_path: Optional[str] = None,
) -> None:
"""Plot PR curve with baseline and operating point annotation."""
precision, recall, thresholds = precision_recall_curve(y_true, y_scores)
pr_auc = average_precision_score(y_true, y_scores)
baseline = y_true.mean()
fig, ax = plt.subplots(figsize=(8, 6))
# Main curve
ax.plot(recall, precision, linewidth=2, label=f"{model_name} (PR-AUC={pr_auc:.3f})")
# Baseline
ax.axhline(y=baseline, color="red", linestyle="--",
label=f"Random baseline ({baseline:.3f})")
# Operating point annotation
target_precision = 0.30
idx = np.argmin(np.abs(precision[:-1] - target_precision))
ax.plot(recall[idx], precision[idx], "ko", markersize=10)
ax.annotate(
f"Operating point\nPrecision={precision[idx]:.2f}, Recall={recall[idx]:.2f}",
xy=(recall[idx], precision[idx]),
xytext=(recall[idx] + 0.1, precision[idx] + 0.1),
arrowprops=dict(arrowstyle="->"),
fontsize=10,
bbox=dict(boxstyle="round,pad=0.3", facecolor="wheat"),
)
ax.set_xlabel("Recall", fontsize=12)
ax.set_ylabel("Precision", fontsize=12)
ax.set_title(
f"Model identifies 62% of churners at 30% precision threshold\n"
f"(5.4x improvement over random baseline)",
fontsize=13,
fontweight="bold",
)
ax.legend(fontsize=11)
ax.set_xlim([0, 1])
ax.set_ylim([0, 1])
plt.tight_layout()
if save_path:
fig.savefig(save_path, dpi=150, bbox_inches="tight")
plt.show()
Rule 3: Use consistent styling.
# Define a consistent color palette for your entire write-up
COLORS = {
"primary": "#2563eb",
"secondary": "#7c3aed",
"positive": "#16a34a",
"negative": "#dc2626",
"neutral": "#6b7280",
"baseline": "#dc2626",
}
# Apply consistent matplotlib style
plt.rcParams.update({
"figure.figsize": (10, 6),
"font.size": 11,
"axes.titlesize": 13,
"axes.labelsize": 12,
"axes.grid": True,
"grid.alpha": 0.3,
"legend.fontsize": 10,
})
Rule 4: Less is more.
Include 4-6 figures maximum. Each one should earn its place.
Never include a default sns.heatmap(df.corr()) with 40+ features. It is unreadable, it does not answer a question, and it signals that you are filling space instead of thinking. If you must show correlations, show the top 10 feature pairs with a bar chart and explain why they matter.
Visualization Templates for Common Scenarios
Model Comparison Table (better than a chart for 2-4 models)
def create_model_comparison_table(
results: Dict[str, Dict[str, float]],
) -> pd.DataFrame:
"""Create a formatted model comparison table.
Args:
results: Dict mapping model names to metric dictionaries.
Returns:
Styled DataFrame for display.
"""
comparison = pd.DataFrame(results).T
comparison = comparison.round(4)
comparison = comparison.sort_values("pr_auc", ascending=False)
# Add rank column
comparison.insert(0, "rank", range(1, len(comparison) + 1))
return comparison
# Example usage
results = {
"LightGBM": {"roc_auc": 0.912, "pr_auc": 0.431, "f1": 0.387, "train_time_sec": 12},
"Random Forest": {"roc_auc": 0.889, "pr_auc": 0.382, "f1": 0.341, "train_time_sec": 45},
"Logistic Reg.": {"roc_auc": 0.834, "pr_auc": 0.276, "f1": 0.289, "train_time_sec": 2},
"Baseline (majority)": {"roc_auc": 0.500, "pr_auc": 0.080, "f1": 0.000, "train_time_sec": 0},
}
comparison = create_model_comparison_table(results)
Error Analysis Plot
def plot_error_analysis(
y_true: pd.Series,
y_pred: np.ndarray,
segment_col: pd.Series,
segment_name: str = "Customer Segment",
) -> None:
"""Plot model performance broken down by a segment variable.
This reveals where the model underperforms - critical for
demonstrating analytical depth in a take-home.
"""
df = pd.DataFrame({
"y_true": y_true.values,
"y_pred": y_pred,
"segment": segment_col.values,
})
segment_metrics = []
for segment in df["segment"].unique():
mask = df["segment"] == segment
if mask.sum() < 10:
continue
segment_metrics.append({
"segment": segment,
"n_samples": mask.sum(),
"pr_auc": average_precision_score(
df.loc[mask, "y_true"], df.loc[mask, "y_pred"]
),
"churn_rate": df.loc[mask, "y_true"].mean(),
})
metrics_df = pd.DataFrame(segment_metrics).sort_values("pr_auc")
fig, ax = plt.subplots(figsize=(10, 6))
bars = ax.barh(metrics_df["segment"], metrics_df["pr_auc"], color=COLORS["primary"])
# Color the worst-performing segments
for i, bar in enumerate(bars):
if metrics_df.iloc[i]["pr_auc"] < 0.3:
bar.set_color(COLORS["negative"])
ax.set_xlabel("PR-AUC")
ax.set_title(
f"Model underperforms on new customers (< 30 days) - "
f"insufficient behavioral history",
fontsize=13,
fontweight="bold",
)
plt.tight_layout()
plt.show()
Data science roles at consumer companies (Meta, Netflix, Spotify) weight visualization quality heavily - they want to see that you can communicate findings to product managers. ML engineering roles at infrastructure companies (Google, Amazon AWS) care less about pretty plots and more about rigorous evaluation methodology. Adjust your emphasis accordingly.
Part 4 -- The Methodology Section
Decisions, Not Descriptions
The methodology section is where most candidates lose points. They describe what they did but not why they did it. Every methodological choice should be accompanied by a rationale.
The Decision-Rationale Template
For each major decision, use this template:
Decision: [What you chose] Rationale: [Why you chose it] Alternative considered: [What else you thought about] Why rejected: [Why the alternative was worse for this problem]
Example: Methodology Section
### Feature Engineering
**Decision:** Engineered RFM (Recency, Frequency, Monetary) features
and engagement velocity metrics rather than using raw transactional data.
**Rationale:** Raw transactions are at the event level (avg. 47 per
customer), while our prediction target is at the customer level.
Aggregation is necessary, and RFM features are an established framework
for capturing customer behavior patterns in churn prediction.
**Key features (8 total):**
1. **recency_days** - Days since last transaction (captures disengagement)
2. **frequency_30d** - Transactions in last 30 days (captures current engagement)
3. **monetary_avg** - Average transaction value (captures customer tier)
4. **login_velocity** - (logins in last 14 days) / (logins in prior 14 days).
Values < 1.0 indicate declining engagement. This feature has the highest
importance in the final model (gain = 0.31).
5-8. Rolling aggregates at 7, 14, 30, 60-day windows for login counts.
**Alternative considered:** Including raw demographic features (age, location,
plan type). Rejected after initial analysis showed < 2% importance in a
preliminary Random Forest. Demographics are poor predictors of churn timing
in this dataset, likely because the customer base is homogeneous.
### Model Selection
**Decision:** LightGBM with 5-fold stratified cross-validation,
optimized for PR-AUC.
**Rationale:**
- **LightGBM over Random Forest:** 12% higher PR-AUC (0.43 vs. 0.38)
with 4x faster training time, enabling more hyperparameter exploration.
- **LightGBM over Logistic Regression:** Non-linear feature interactions
(e.g., recency x frequency) are captured automatically. LR required
manual interaction terms and still underperformed by 15% PR-AUC.
- **Stratified CV:** Necessary because of 8% positive rate. Random splits
risk folds with < 5% positives, producing unstable estimates.
- **PR-AUC over ROC-AUC:** With 12:1 class imbalance, ROC-AUC is
inflated by the large number of true negatives. PR-AUC focuses on the
minority class, which is the actionable class.
**Hyperparameter tuning:** Bayesian optimization (Optuna, 50 trials) over
learning_rate, max_depth, num_leaves, min_child_samples, subsample, and
colsample_bytree. Best parameters listed in Appendix A.
The candidates I advance to on-site are the ones who anticipate my questions. When I read "LightGBM over Logistic Regression" and they have already provided the PR-AUC comparison, I do not need to ask "did you try simpler models?" That saves time in the follow-up and signals thorough thinking.
The Methodology Pitfalls
| Pitfall | Example | Fix |
|---|---|---|
| Description without rationale | "I used LightGBM" | "I chose LightGBM because it outperformed RF by 12% PR-AUC" |
| Rationale without evidence | "LightGBM is the best for tabular data" | Show the comparison table with numbers |
| Missing baseline | "PR-AUC of 0.43" | "PR-AUC of 0.43 vs. 0.08 random baseline (5.4x improvement)" |
| Ignoring class imbalance | "Accuracy of 92%" | "92% accuracy is trivially achieved by predicting majority class. PR-AUC = 0.43." |
| No alternative models tried | "I used XGBoost" | "Compared LR, RF, XGBoost, LightGBM - see comparison table" |
Part 5 -- The Technical Appendix
What Goes in the Appendix
The appendix is for detail that supports your claims but would clutter the main narrative. Think of it as the "show your work" section for the deep reader.
Appendix Template
## Appendix A: Hyperparameter Optimization
Search method: Bayesian optimization (Optuna)
Trials: 50
Objective: 5-fold stratified CV PR-AUC (mean)
| Parameter | Search Range | Best Value | Default |
|-----------|-------------|------------|---------|
| learning_rate | [0.01, 0.3] | 0.047 | 0.1 |
| max_depth | [3, 10] | 6 | -1 |
| num_leaves | [15, 63] | 31 | 31 |
| min_child_samples | [5, 50] | 18 | 20 |
| subsample | [0.5, 1.0] | 0.82 | 1.0 |
| colsample_bytree | [0.5, 1.0] | 0.76 | 1.0 |
| reg_alpha | [0, 1.0] | 0.08 | 0.0 |
| reg_lambda | [0, 1.0] | 0.12 | 0.0 |
Tuned PR-AUC: 0.431 +/- 0.018
Default PR-AUC: 0.409 +/- 0.022
Improvement from tuning: +5.4%
## Appendix B: Per-Fold Cross-Validation Results
| Fold | Train PR-AUC | Val PR-AUC | Val ROC-AUC | n_positive | n_negative |
|------|-------------|------------|-------------|------------|------------|
| 1 | 0.891 | 0.442 | 0.918 | 312 | 3,588 |
| 2 | 0.887 | 0.419 | 0.904 | 308 | 3,592 |
| 3 | 0.893 | 0.448 | 0.921 | 315 | 3,585 |
| 4 | 0.889 | 0.427 | 0.911 | 310 | 3,590 |
| 5 | 0.885 | 0.420 | 0.907 | 305 | 3,595 |
| **Mean** | **0.889** | **0.431** | **0.912** | **310** | **3,590** |
| **Std** | **0.003** | **0.012** | **0.007** | **4** | **4** |
Observations:
- Low standard deviation (0.012) across folds indicates stable performance
- No outlier folds (all within 1.5 std of mean)
- Consistent train-val gap (~0.46) suggests moderate overfitting,
acceptable given the class imbalance
## Appendix C: Feature Correlation Analysis
Top 5 correlated feature pairs (Pearson):
1. frequency_30d / frequency_60d: 0.89 - expected overlap in time windows
2. monetary_avg / monetary_total: 0.76 - kept both as they capture
different aspects (average ticket vs. lifetime value)
3. login_velocity / frequency_30d: 0.52 - moderate; velocity captures
trend while frequency captures level
Decision: No features removed due to multicollinearity. LightGBM handles
correlated features well via feature subsampling (colsample_bytree=0.76).
For take-homes with 4-hour time limits, an appendix is optional. Include one only if you have time after completing the main write-up. For 8-hour or weekend projects, an appendix is expected and demonstrates thoroughness. Even a brief appendix with the hyperparameter table and per-fold results adds value.
Part 6 -- The Follow-Up Presentation
Presentation Structure (10-15 Minutes)
Many companies follow the take-home with a 30-45 minute session where you present your work (10-15 minutes) and then answer questions (15-30 minutes). This is where offers are won or lost.
Slide-by-Slide Guide
Slide 1: Problem Statement (30 seconds)
- Restate the problem in your own words
- State the business context: why this matters
- State the evaluation metric and why you chose it
Slide 2: Approach Overview (90 seconds)
- High-level pipeline: Data -> Features -> Model -> Evaluation
- State 2-3 key decisions at a high level (details on next slides)
- Mention what you did NOT do and why (scope management)
Slides 3-4: Key Methodology Decisions (3 minutes)
- Feature engineering: what features and why
- Model choice: what you compared and what won
- Show the model comparison table
- Show feature importance (top 5-10 features)
Slides 5-6: Results (3 minutes)
- Performance metrics with baseline comparison
- PR curve or most relevant performance visualization
- Business-interpretable result: "Top decile captures X% of churners"
- Confidence intervals or cross-validation stability
Slide 7: Error Analysis (2 minutes)
- Where does the model fail?
- What segments underperform?
- What types of errors are most costly?
- This slide separates good candidates from great candidates
Slide 8: Next Steps and Limitations (2 minutes)
- What would you do with one more day?
- What would you do with one more month?
- What are the known limitations?
- What assumptions did you make that might not hold?
Presentation Anti-Patterns
| Anti-Pattern | Why It Fails | Fix |
|---|---|---|
| Reading code on slides | Evaluators cannot parse code at presentation speed | Show results and key function signatures, not implementations |
| Walking through every EDA plot | Loses the audience in details | Show 1-2 key EDA insights that drove decisions |
| Apologizing for what you did not do | Undermines confidence | Frame limitations as "next steps" - forward-looking, not defensive |
| Skipping the baseline | Results have no context | Always show "X vs. baseline" comparisons |
| Going over time | Signals poor preparation | Practice with a timer; cut content rather than speed up |
Do not spend more than 20% of your presentation on EDA. The evaluator has already seen your notebook - they know what the data looks like. Spend 60% on methodology and results, and 20% on error analysis and next steps. The ratio of "what I decided" to "what I observed" should be at least 3:1.
Part 7 -- Handling Follow-Up Questions
The Question Taxonomy
Follow-up questions fall into five categories. Recognizing the category helps you structure your answer.
The STAR-T Framework for Technical Questions
Adapt the STAR framework for technical follow-ups:
- Situation: Acknowledge the question's context
- Thought process: Explain your reasoning framework
- Action: What you would do or did
- Result: Expected outcome or observed result
- Tradeoff: What you would give up and why it is acceptable
Example Q&A Exchanges
Question (Challenge): "You used PR-AUC as your primary metric. But your stakeholders care about customer retention rate. Why not optimize directly for a business metric?"
Strong answer: "Great question. I chose PR-AUC as the optimization metric because it is differentiable and well-behaved for gradient-based optimization, which business metrics like retention rate are not - retention depends on the intervention strategy, not just the model. However, I evaluate the business impact separately: at our chosen operating point of 30% precision, we flag 340 customers per month, and assuming a 30% save rate from proactive outreach, that translates to roughly 100 saved customers. I would propose an A/B test to validate this save rate before making retention claims. If we find the save rate differs by risk score tier, we could optimize the threshold per tier."
Weak answer: "PR-AUC is the standard metric for imbalanced classification."
Question (Extension): "How would you deploy this model in production?"
Strong answer: "I would break this into three phases. First, batch scoring: run the model weekly on a snapshot of current customer features, write risk scores to a database table, and have the CS team triage the top decile. This is the fastest path to value and lets us validate the model's utility before investing in infrastructure. Second, once we have validated the model drives retention, I would move to a daily batch pipeline - an Airflow DAG that computes features from the data warehouse, runs inference, and writes scores to an API-accessible store. Third, if we find that real-time signals like 'customer currently on cancellation page' are predictive, we would build a streaming feature pipeline and serve the model behind a low-latency API. Each phase has a clear ROI gate before proceeding."
Weak answer: "I would put it in a Docker container and deploy to Kubernetes."
Question (Stress Test): "What happens to your model when the product adds a major new feature that changes user engagement patterns?"
Strong answer: "This is a concept drift scenario - the relationship between my features and churn changes because user behavior changes. My login_velocity feature, which is the strongest predictor, would be most affected. Short-term, I would add monitoring: track the distribution of risk scores weekly and alert if the mean or variance shifts by more than two standard deviations. Medium-term, I would implement rolling retraining on a 90-day window, so the model adapts to new behavioral patterns within a quarter. Long-term, I would add product-specific features - engagement with the new feature specifically - which requires coordination with the product team to instrument the right events."
The follow-up Q&A is where I separate candidates who memorized solutions from candidates who think in frameworks. A candidate who says "I would retrain the model" in response to a drift question gets a neutral score. A candidate who distinguishes between covariate drift and concept drift, proposes monitoring, and suggests both short-term and long-term mitigations gets a strong hire. The depth of the answer matters more than the specific solution.
Phrases That Hurt vs. Help
| Hurts | Helps |
|---|---|
| "I did not have time for that" | "Given more time, I would prioritize X because..." |
| "I do not know" (full stop) | "I have not implemented that, but my approach would be..." |
| "That is a good point, I did not think of that" | "That is a valid concern. The impact would be X, and I would address it by..." |
| "I just used the default parameters" | "I started with defaults as a baseline, then tuned the three most impactful parameters" |
| "The model is pretty good" | "The model achieves X, which represents a Y% improvement over the baseline, but underperforms on Z segment" |
Part 8 -- Adapting to Company Culture
Write-Up Styles by Company Type
Different companies value different aspects of your write-up. Adjust your emphasis accordingly.
| Company Type | Emphasis | Write-Up Style | Example |
|---|---|---|---|
| FAANG / Big Tech | Rigor, scalability, metrics | Formal, metric-heavy, production-aware | Google, Meta, Amazon |
| Growth-Stage Startup | Business impact, speed, pragmatism | Concise, action-oriented, ROI-focused | Stripe, Notion, Figma |
| Research Lab | Novelty, depth, ablation studies | Academic, thorough, with ablations | OpenAI, DeepMind, Anthropic |
| Consulting / Analytics | Storytelling, stakeholder communication | Narrative, polished visualizations, executive-friendly | McKinsey QuantumBlack |
| Fintech / Healthcare | Regulatory awareness, interpretability | Cautious, explainability-focused, bias-aware | Two Sigma, Tempus |
Tailoring the Executive Summary
For a FAANG role:
The final model achieves PR-AUC of 0.431 +/- 0.012 (5-fold stratified CV),
representing a 5.4x improvement over the random baseline. Feature ablation
shows that removing engagement velocity features reduces PR-AUC by 31%,
confirming they are the primary predictive signal. At scale, the model
scores 100K customers in < 2 seconds on a single CPU core, meeting the
latency requirements for daily batch scoring.
For a startup role:
The model identifies 62% of likely churners in the top risk decile,
enabling proactive outreach to the highest-risk customers. At our
recommended operating threshold, each month we would flag ~340 customers
for CSM intervention at a cost of ~$17K in CSM time, with an expected
return of ~$180K in retained revenue (assuming a conservative 30% save
rate). Recommended first step: A/B test proactive outreach on model-flagged
vs. randomly selected at-risk customers to validate the save rate.
For a research lab role:
We compare four model families (logistic regression, random forest,
gradient boosted trees, and neural network) on the churn prediction task,
evaluating each under stratified 5-fold CV with PR-AUC as the primary
metric. Gradient boosted trees (LightGBM) achieve the best performance
(0.431 +/- 0.012), followed by RF (0.382 +/- 0.019). Ablation over
feature groups reveals that temporal engagement features contribute 78%
of the total predictive signal, while demographic features contribute
< 2%. Analysis of calibration curves shows that LightGBM's probability
outputs are well-calibrated in the 0.1-0.5 range but overconfident above
0.5, suggesting Platt scaling would improve deployment utility.
Practice Problems
Problem 1: Write an Executive Summary
You completed a take-home for a recommendation system. Key facts:
- Task: Predict which products a user will purchase next
- Dataset: 500K users, 10K products, 12 months of purchase history
- Best model: Matrix factorization + LightGBM hybrid, Recall@10 = 0.23
- Baseline: Popularity-based, Recall@10 = 0.09
- Key insight: Recency-weighted purchase history outperforms raw frequency
- Deployment consideration: Must update recommendations daily for 500K users
Write a two-paragraph executive summary.
Hint 1 -- Direction
Paragraph 1: Problem, business context, approach. Paragraph 2: Results vs. baseline, key insight, deployment feasibility.
Hint 2 -- Key Elements to Include
- Business context: why recommendations matter (revenue per user, conversion rate)
- Quantify the improvement: 0.23 vs. 0.09 is a 2.6x improvement
- Translate Recall@10 into business language: "top 10 recommendations contain at least one actual purchase for 23% of users"
- Address the daily update requirement with a concrete plan
Hint 3 -- Strong Example
"This analysis develops a product recommendation system for a catalog of 10K items, using 12 months of purchase history from 500K users. Effective recommendations directly impact revenue through increased conversion rates and average order values. I developed a hybrid approach combining matrix factorization embeddings (capturing latent user-product affinities) with a LightGBM ranker (incorporating recency signals and contextual features), evaluated using Recall@10 to measure whether actual purchased items appear in the top-10 recommendations.
The hybrid model achieves Recall@10 of 0.23, a 2.6x improvement over the popularity baseline (0.09). In business terms, 23% of users would see at least one product they actually purchased in their top-10 recommendations, compared to 9% with a simple 'most popular items' approach. The key modeling insight is that recency-weighted purchase history (exponential decay with a 30-day half-life) outperforms raw purchase frequency by 18% in recall, suggesting that recent behavior is far more predictive of near-term purchases than lifetime history. For deployment, the embedding computation (batch matrix factorization) requires ~45 minutes, and the LightGBM scoring runs in ~8 minutes for 500K users, well within a nightly batch window."
Scoring Rubric:
- Strong Hire: Includes business context, quantifies improvement vs. baseline, translates metric into plain language, addresses deployment feasibility with specific numbers. Both paragraphs are self-contained and actionable.
- Lean Hire: Mentions key results but lacks business translation or deployment discussion.
- No Hire: Lists technical details without context ("I used matrix factorization, Recall@10 = 0.23").
Problem 2: Create a Presentation Outline
You have 12 minutes to present a fraud detection take-home. Key facts:
- 1M transactions, 0.3% fraud rate
- Compared LR, RF, XGBoost, neural network
- XGBoost won (PR-AUC 0.72)
- Key features: transaction velocity, device fingerprint mismatch, amount deviation
- Error analysis: model misses sophisticated fraud patterns (account takeover)
- Next step: graph features from transaction networks
Create a slide-by-slide outline with timing.
Hint 1 -- Direction
12 minutes = about 6-7 slides at 2 minutes each. Allocate time to results and error analysis, not EDA.
Hint 2 -- Time Allocation
- Problem + Approach: 2 min (1-2 slides)
- Feature engineering + Model selection: 3 min (2 slides)
- Results: 3 min (1-2 slides)
- Error analysis: 2 min (1 slide)
- Next steps: 2 min (1 slide)
Hint 3 -- Full Outline
Slide 1 (1 min): Problem Context
- Fraud detection: 0.3% fraud rate in 1M transactions
- Business: even 0.1% false negative improvement = significant loss prevention
- Metric choice: PR-AUC (not ROC-AUC) because of extreme class imbalance
- Evaluation focus: precision at high recall thresholds (catch fraud without blocking legitimate users)
Slide 2 (2 min): Feature Engineering
- Three feature categories: velocity (transactions per hour), identity (device/IP mismatch), amount (deviation from user's historical pattern)
- Show feature importance chart (top 5)
- Key insight: transaction velocity in last 1 hour is 4x more predictive than transaction amount
Slide 3 (2 min): Model Comparison
- Table: LR, RF, XGBoost, NN with PR-AUC, inference time, training time
- XGBoost wins: PR-AUC 0.72 vs NN 0.69 (comparable performance, 10x faster inference)
- Why not NN: marginal improvement does not justify inference latency for real-time scoring
Slide 4 (2 min): Results Deep Dive
- PR curve with operating point annotated
- At 80% precision: recall = 0.58 (catches 58% of fraud)
- At 95% precision: recall = 0.31 (for automated blocking)
- Business translation: two operating modes - alert (high recall) and auto-block (high precision)
Slide 5 (2 min): Error Analysis
- Model misses account takeover fraud (legitimate device, unusual behavior)
- Confusion matrix segmented by fraud type
- 78% of false negatives are account takeover (vs. stolen card)
- Current features focus on device/velocity but miss behavioral anomalies
Slide 6 (2 min): Next Steps
- Short-term: add graph features (transaction network, merchant connections)
- Medium-term: sequence model (LSTM on transaction sequences) for behavioral patterns
- Monitoring: track precision/recall weekly, retrain monthly
- A/B test: deploy alongside rule-based system, measure incremental catch rate
Slide 7 (1 min): Summary
- XGBoost + velocity features achieve PR-AUC 0.72 on fraud detection
- Two operating modes for different use cases
- Key gap: account takeover fraud requires graph and sequence features
- Recommended deployment: batch scoring for alerts, real-time scoring later
Scoring Rubric:
- Strong Hire: Presentation has clear flow, spends majority of time on results and analysis (not EDA), includes error analysis with specific failure modes, has quantified next steps. Timing is realistic.
- Lean Hire: Covers main results but error analysis is thin and next steps are generic.
- No Hire: Spends 5+ minutes on EDA and data description, leaving 2 minutes for results.
Problem 3: Handle These Questions
For each question, write a two-sentence answer that would satisfy a senior ML engineer.
- "Your model has a high AUC but low precision. Is that a problem?"
- "You used LightGBM. Have you considered a neural network?"
- "Your feature importance shows login_velocity is dominant. Is that a concern?"
- "What happens if the data distribution shifts next quarter?"
- "You only had 6 hours. What did you intentionally skip?"
Hint 1 -- Direction
Each answer should: (1) directly address the concern, and (2) demonstrate awareness of the tradeoff involved.
Hint 2 -- Key Principles
- AUC vs precision: depends on the operating point and cost asymmetry
- LightGBM vs NN: justify based on data size, interpretability, and training time
- Dominant feature: discuss single-point-of-failure risk and robustness
- Distribution shift: discuss monitoring, retraining, and robustness strategies
- Intentional skips: frame as prioritization, not omission
Hint 3 -- Strong Answers
-
"High AUC with low precision is expected in imbalanced classification - it means the model ranks positives above negatives well but the raw threshold needs calibration. I would adjust the decision threshold to match the business's cost asymmetry between false positives and false negatives, and report precision-recall at the specific operating point rather than at the default 0.5 threshold."
-
"I considered a neural network and ran a preliminary comparison - it achieved comparable PR-AUC (0.42 vs 0.43) but required 8x longer training time, which limited my ability to iterate on hyperparameters and features within the time constraint. For tabular data of this size (20K samples, 25 features), gradient-boosted trees typically match or exceed neural networks, and the interpretability advantage (feature importance, SHAP values) made LightGBM the pragmatic choice."
-
"A dominant feature is both a strength and a risk - it means we have found a strong signal, but the model is fragile if that signal degrades or becomes unavailable. I would run an ablation study removing login_velocity to measure the performance drop, and if the drop exceeds 20%, I would invest in finding alternative engagement signals that capture similar information through different data sources."
-
"Distribution shift is the primary production risk for this model - I would implement weekly monitoring of feature distributions and prediction score distributions, with automated alerts when KL divergence exceeds a threshold. Additionally, I would set up rolling retraining on a 90-day window so the model adapts to gradual drift, with a manual retraining trigger for sudden shifts like a product launch or market event."
-
"I intentionally skipped deep hyperparameter tuning, neural network architectures, and SHAP-based interpretability analysis. I prioritized feature engineering and model comparison because, in my experience, the feature set determines 80% of model performance while hyperparameters contribute at most 5-10%, so the time was better spent building strong features and validating with proper cross-validation."
Interview Cheat Sheet
| Concept | Key Practice | One-Liner | Red Flag |
|---|---|---|---|
| Executive summary | Two paragraphs: approach + results with business impact | The evaluator should understand everything from this alone | No summary, or a summary that says "I used LightGBM" |
| Visualizations | Title states the takeaway, not the chart type | Every figure answers a specific question | Correlation heatmap of 40 features, unlabeled axes |
| Methodology | Every decision has a rationale and alternative | "I chose X over Y because Z" | "I used X" without explanation |
| Baseline comparison | Every metric is compared to a meaningful baseline | Results without context are meaningless | "AUC of 0.91" with no baseline |
| Error analysis | Show where the model fails, not just where it succeeds | Error analysis separates good from great candidates | Only showing aggregate metrics |
| Next steps | Concrete, prioritized, forward-looking | Shows awareness that this is a starting point | "I would get more data" as the only next step |
| Presentation | 60% results/analysis, 20% method, 20% next steps | Do not narrate EDA - narrate decisions | 5 minutes of EDA, 2 minutes of results |
| Q&A handling | Acknowledge, analyze, address, tradeoff | Frame limitations as next steps, not failures | "I did not have time" or "I do not know" (full stop) |
| Business translation | Translate metrics into decisions and dollars | "Top decile captures X% of churners" | "PR-AUC is 0.43" without interpretation |
| Appendix | Supporting detail for the deep reader | Shows depth without cluttering the narrative | No supporting detail, or 20-page appendix |
Spaced Repetition Checkpoints
Day 0 -- Initial Learning
- Read this entire page
- Rewrite the executive summary of a past project using the two-paragraph formula
- Audit the visualizations in a past project - does each one answer a specific question?
- Complete the self-assessment
Day 3 -- First Recall
- Without looking, list the five sections of a write-up
- Write the "So What?" version of three technical results from your experience
- Practice the STAR-T framework on one follow-up question out loud
Day 7 -- Practice
- Do Practice Problem 1 (executive summary) without looking at hints
- Create a presentation outline for a past project (timed: 10 minutes)
- Answer the five questions from Problem 3 out loud, timed (2 minutes each)
Day 14 -- Application
- Do a full mock take-home write-up with all five sections (timed: 2 hours)
- Present it to a friend or mentor in 12 minutes
- Have them ask 5 follow-up questions and practice the STAR-T framework
Day 21 -- Mock Interview
- Present a take-home to someone unfamiliar with the problem
- Time the presentation (must be under 15 minutes)
- Ask them to evaluate: "Could you follow my reasoning without running the code?"
- Iterate on weak areas
Key Takeaways
-
The write-up is your interview before the interview. Most evaluators form their opinion from the write-up alone, before any code is run. A clear, structured write-up with an actionable executive summary gets you to the on-site. A notebook dump does not.
-
Every decision needs a rationale. "I used LightGBM" is a description. "I chose LightGBM over Random Forest because it achieved 12% higher PR-AUC with 4x faster training, enabling more iteration in the time constraint" is a rationale. Evaluators hire people who can explain their reasoning, not people who can call sklearn functions.
-
Visualizations are arguments, not decorations. Each figure should have a takeaway in its title, not a description. "Login velocity is 3x more predictive than demographics" is a finding. "Feature Importance Plot" is a label. The evaluator should understand your key findings by reading only the figure titles.
-
The follow-up Q&A is where offers are won. Preparing for five categories of questions - clarification, challenge, extension, depth, and stress test - means you are never caught off guard. Acknowledge the question, analyze the tradeoff, address the concern, and state what you would do next.
-
Adapt to the audience. A startup VP wants business impact and deployment feasibility. A Google staff engineer wants rigorous metrics and scalability analysis. A research scientist wants ablation studies and methodological depth. One write-up format does not fit all.
