The Write-Up - Turning Analysis Into a Hiring Decision

Reading time: ~40 min | Interview relevance: Critical | Roles: MLE, AI Eng, Data Scientist, Research Engineer

The Real Interview Moment

You are a hiring manager at Stripe. You have eight take-home submissions on your desk. It is 9 PM, and you have a hiring committee meeting at 9 AM tomorrow. You cannot run any code tonight - you can only read. You open the first submission: a raw Jupyter notebook with 60 code cells, no markdown, and a single print(classification_report(y_test, y_pred)) at the bottom. You spend four minutes scrolling, cannot find the punchline, and move on. You open the second submission: a clean write-up with a two-paragraph executive summary, three well-labeled figures, a methodology section that explains every decision, and a "Next Steps" section showing what the candidate would do with more time. You understand the entire analysis in three minutes. You write "advance to on-site" and move to the next submission.

The difference between these two candidates was not the quality of their models. It was the quality of their communication. The first candidate may have done better analysis, but you will never know - they did not make it readable. This page teaches you how to write take-home results that get read, understood, and remembered.

What You Will Master

Structure a write-up that guides the evaluator from problem to conclusion in under five minutes
Write an executive summary that captures the key insight in two paragraphs
Build visualizations that answer questions instead of decorating pages
Create a technical appendix that demonstrates depth without cluttering the main narrative
Prepare a follow-up presentation that summarizes your work in 10-15 minutes
Handle follow-up questions with structured, confident responses
Adapt your write-up style for different company cultures and evaluator types

Self-Assessment: Where Are You Now?

Skill	1 -- Cannot	2 -- Vaguely	3 -- Can Do	4 -- Consistently	5 -- Can Teach	Your Score
Write a two-paragraph executive summary						___
Structure a write-up with clear sections						___
Create visualizations that support decisions						___
Explain methodology decisions concisely						___
Write a technical appendix						___
Present results in 10-15 minutes						___
Handle adversarial questions calmly						___
Adapt communication style to the audience						___

Target: All 4s and 5s before your interview.

Part 1 -- The Write-Up Structure

The Five-Section Framework

Every take-home write-up should follow a structure that mirrors how the evaluator thinks. They want to know, in order: What did you do? Why does it matter? How did you do it? What did you find? What would you do next?

Write-Up Structure - Five-Section Framework from Executive Summary to Next Steps

The Two-Speed Reading Principle

Your write-up will be read at two speeds:

Speed read (2-3 minutes): The evaluator reads only the executive summary, looks at the figures, and skims the results table. This is how most evaluators form their initial impression.
Deep read (15-30 minutes): If the speed read is compelling, the evaluator reads the full methodology and may run your code. This is where they validate their initial impression.

Design your write-up so that both reads are satisfying. The speed read should convey the complete story. The deep read should reinforce it with rigor.

60-Second Answer

"I structure my write-ups with five sections: executive summary, problem understanding, methodology, results, and next steps. The executive summary is two paragraphs - the first states the problem and approach, the second states the key results and their business implications. Every figure has a clear takeaway in its title. The methodology section explains not just what I did but why I made each decision. The next steps section shows that I understand this is a starting point, not a finished product."

Part 2 -- The Executive Summary

The Two-Paragraph Formula

The executive summary is the most important part of your write-up. Many evaluators read only this section. It must be self-contained.

Paragraph 1: Problem and Approach

One sentence: What is the problem?
One sentence: What is the business context or impact?
One sentence: What was your approach at a high level?

Paragraph 2: Results and Implications

One sentence: What are the key quantitative results?
One sentence: How do they compare to a baseline?
One sentence: What is the actionable takeaway?

Example: Strong Executive Summary

## Executive Summary

This analysis addresses the problem of predicting customer churn for a
subscription-based SaaS product. With an 8% monthly churn rate costing an
estimated $2.4M annually in lost revenue, even modest improvements in
early identification can drive significant retention savings. I developed
a LightGBM classifier trained on RFM features, engagement velocity
metrics, and usage pattern features, evaluated using precision-recall AUC
to account for the severe class imbalance.

The final model achieves a PR-AUC of 0.43 (5.4x improvement over the
0.08 random baseline), identifying 62\% of churners in the top decile of
risk scores. At a 30\% precision threshold \text{---} where each intervention
costs roughly $50 in CSM time - the model would flag approximately 340
customers per month, of which ~100 would actually churn, yielding an
estimated $180K annual savings assuming a 30\% save rate. Key predictive
signals are declining login frequency (past 14 days vs. prior 14 days),
days since last support ticket resolution, and contract renewal proximity.

Example: Weak Executive Summary

## Summary

I used LightGBM to predict churn. I tried several models including
random forest and logistic regression. LightGBM performed the best.
The AUC was 0.91. I used 5-fold cross-validation.

Common Trap

Do not report only ROC-AUC for imbalanced classification problems. An AUC of 0.91 sounds impressive but means nothing if the evaluator cannot translate it into a business decision. Always pair statistical metrics with business-interpretable metrics: "identifies X% of churners in the top Y decile" or "at Z% precision, we would flag N customers per month."

The "So What?" Test

After writing your executive summary, read it and ask: "If I were a VP of Product, would I know what to do with this information?" If the answer is no, rewrite it. Technical correctness without actionability is a missed opportunity.

The "So What?" Test - From Technical Result to Business Context to Action to Impact

Part 3 -- Visualizations That Communicate

The Purpose-Driven Visualization Framework

Every figure in your write-up should answer a specific question. If you cannot state the question a figure answers, delete it.

Question	Visualization	Example Title
What does the data look like?	Distribution plots, class balance bar	"Target class distribution: 8% churn rate creates a 12:1 imbalance"
Which features matter?	Feature importance bar chart	"Top 10 features: engagement velocity dominates, demographics contribute little"
How well does the model perform?	PR curve, calibration plot	"Precision-recall tradeoff: 62% recall at 30% precision threshold"
Where does the model fail?	Confusion matrix, error analysis	"False negatives cluster in recently onboarded users (< 30 days)"
How do models compare?	Grouped bar chart, comparison table	"LightGBM outperforms logistic regression by 12% PR-AUC across all folds"

The Four Rules of Take-Home Figures

Rule 1: Title is the takeaway, not the description.

# BAD - describes the chart
fig.suptitle("Feature Importance Plot")

# GOOD - states the finding
fig.suptitle(
    "Login frequency decline is 3x more predictive than any demographic feature"
)

Rule 2: Label everything.

def plot_precision_recall_curve(
    y_true: pd.Series,
    y_scores: np.ndarray,
    model_name: str = "LightGBM",
    save_path: Optional[str] = None,
) -> None:
    """Plot PR curve with baseline and operating point annotation."""
    precision, recall, thresholds = precision_recall_curve(y_true, y_scores)
    pr_auc = average_precision_score(y_true, y_scores)
    baseline = y_true.mean()

    fig, ax = plt.subplots(figsize=(8, 6))

    # Main curve
    ax.plot(recall, precision, linewidth=2, label=f"{model_name} (PR-AUC={pr_auc:.3f})")

    # Baseline
    ax.axhline(y=baseline, color="red", linestyle="--",
               label=f"Random baseline ({baseline:.3f})")

    # Operating point annotation
    target_precision = 0.30
    idx = np.argmin(np.abs(precision[:-1] - target_precision))
    ax.plot(recall[idx], precision[idx], "ko", markersize=10)
    ax.annotate(
        f"Operating point\nPrecision={precision[idx]:.2f}, Recall={recall[idx]:.2f}",
        xy=(recall[idx], precision[idx]),
        xytext=(recall[idx] + 0.1, precision[idx] + 0.1),
        arrowprops=dict(arrowstyle="->"),
        fontsize=10,
        bbox=dict(boxstyle="round,pad=0.3", facecolor="wheat"),
    )

    ax.set_xlabel("Recall", fontsize=12)
    ax.set_ylabel("Precision", fontsize=12)
    ax.set_title(
        f"Model identifies 62% of churners at 30% precision threshold\n"
        f"(5.4x improvement over random baseline)",
        fontsize=13,
        fontweight="bold",
    )
    ax.legend(fontsize=11)
    ax.set_xlim([0, 1])
    ax.set_ylim([0, 1])
    plt.tight_layout()

    if save_path:
        fig.savefig(save_path, dpi=150, bbox_inches="tight")

    plt.show()

Rule 3: Use consistent styling.

# Define a consistent color palette for your entire write-up
COLORS = {
    "primary": "#2563eb",
    "secondary": "#7c3aed",
    "positive": "#16a34a",
    "negative": "#dc2626",
    "neutral": "#6b7280",
    "baseline": "#dc2626",
}

# Apply consistent matplotlib style
plt.rcParams.update({
    "figure.figsize": (10, 6),
    "font.size": 11,
    "axes.titlesize": 13,
    "axes.labelsize": 12,
    "axes.grid": True,
    "grid.alpha": 0.3,
    "legend.fontsize": 10,
})

Rule 4: Less is more.

Include 4-6 figures maximum. Each one should earn its place.

Figure Selection Framework - Essential, Optional, and What to Avoid in Take-Home Write-Ups

Instant Rejection

Never include a default sns.heatmap(df.corr()) with 40+ features. It is unreadable, it does not answer a question, and it signals that you are filling space instead of thinking. If you must show correlations, show the top 10 feature pairs with a bar chart and explain why they matter.

Visualization Templates for Common Scenarios

Model Comparison Table (better than a chart for 2-4 models)

def create_model_comparison_table(
    results: Dict[str, Dict[str, float]],
) -> pd.DataFrame:
    """Create a formatted model comparison table.

    Args:
        results: Dict mapping model names to metric dictionaries.

    Returns:
        Styled DataFrame for display.
    """
    comparison = pd.DataFrame(results).T
    comparison = comparison.round(4)
    comparison = comparison.sort_values("pr_auc", ascending=False)

    # Add rank column
    comparison.insert(0, "rank", range(1, len(comparison) + 1))

    return comparison

# Example usage
results = {
    "LightGBM": {"roc_auc": 0.912, "pr_auc": 0.431, "f1": 0.387, "train_time_sec": 12},
    "Random Forest": {"roc_auc": 0.889, "pr_auc": 0.382, "f1": 0.341, "train_time_sec": 45},
    "Logistic Reg.": {"roc_auc": 0.834, "pr_auc": 0.276, "f1": 0.289, "train_time_sec": 2},
    "Baseline (majority)": {"roc_auc": 0.500, "pr_auc": 0.080, "f1": 0.000, "train_time_sec": 0},
}

comparison = create_model_comparison_table(results)

Error Analysis Plot

def plot_error_analysis(
    y_true: pd.Series,
    y_pred: np.ndarray,
    segment_col: pd.Series,
    segment_name: str = "Customer Segment",
) -> None:
    """Plot model performance broken down by a segment variable.

    This reveals where the model underperforms - critical for
    demonstrating analytical depth in a take-home.
    """
    df = pd.DataFrame({
        "y_true": y_true.values,
        "y_pred": y_pred,
        "segment": segment_col.values,
    })

    segment_metrics = []
    for segment in df["segment"].unique():
        mask = df["segment"] == segment
        if mask.sum() < 10:
            continue
        segment_metrics.append({
            "segment": segment,
            "n_samples": mask.sum(),
            "pr_auc": average_precision_score(
                df.loc[mask, "y_true"], df.loc[mask, "y_pred"]
            ),
            "churn_rate": df.loc[mask, "y_true"].mean(),
        })

    metrics_df = pd.DataFrame(segment_metrics).sort_values("pr_auc")

    fig, ax = plt.subplots(figsize=(10, 6))
    bars = ax.barh(metrics_df["segment"], metrics_df["pr_auc"], color=COLORS["primary"])

    # Color the worst-performing segments
    for i, bar in enumerate(bars):
        if metrics_df.iloc[i]["pr_auc"] < 0.3:
            bar.set_color(COLORS["negative"])

    ax.set_xlabel("PR-AUC")
    ax.set_title(
        f"Model underperforms on new customers (< 30 days) - "
        f"insufficient behavioral history",
        fontsize=13,
        fontweight="bold",
    )
    plt.tight_layout()
    plt.show()

Company Variation

Data science roles at consumer companies (Meta, Netflix, Spotify) weight visualization quality heavily - they want to see that you can communicate findings to product managers. ML engineering roles at infrastructure companies (Google, Amazon AWS) care less about pretty plots and more about rigorous evaluation methodology. Adjust your emphasis accordingly.

Part 4 -- The Methodology Section

Decisions, Not Descriptions

The methodology section is where most candidates lose points. They describe what they did but not why they did it. Every methodological choice should be accompanied by a rationale.

Methodology Section Pattern - What I Did, Why I Did It, What I Considered, Why I Rejected It

The Decision-Rationale Template

For each major decision, use this template:

Decision: [What you chose] Rationale: [Why you chose it] Alternative considered: [What else you thought about] Why rejected: [Why the alternative was worse for this problem]

Example: Methodology Section

### Feature Engineering

**Decision:** Engineered RFM (Recency, Frequency, Monetary) features
and engagement velocity metrics rather than using raw transactional data.

**Rationale:** Raw transactions are at the event level (avg. 47 per
customer), while our prediction target is at the customer level.
Aggregation is necessary, and RFM features are an established framework
for capturing customer behavior patterns in churn prediction.

**Key features (8 total):**
1. **recency_days** - Days since last transaction (captures disengagement)
2. **frequency_30d** - Transactions in last 30 days (captures current engagement)
3. **monetary_avg** - Average transaction value (captures customer tier)
4. **login_velocity** - (logins in last 14 days) / (logins in prior 14 days).
   Values < 1.0 indicate declining engagement. This feature has the highest
   importance in the final model (gain = 0.31).
5-8. Rolling aggregates at 7, 14, 30, 60-day windows for login counts.

**Alternative considered:** Including raw demographic features (age, location,
plan type). Rejected after initial analysis showed < 2% importance in a
preliminary Random Forest. Demographics are poor predictors of churn timing
in this dataset, likely because the customer base is homogeneous.

### Model Selection

**Decision:** LightGBM with 5-fold stratified cross-validation,
optimized for PR-AUC.

**Rationale:**
- **LightGBM over Random Forest:** 12% higher PR-AUC (0.43 vs. 0.38)
  with 4x faster training time, enabling more hyperparameter exploration.
- **LightGBM over Logistic Regression:** Non-linear feature interactions
  (e.g., recency x frequency) are captured automatically. LR required
  manual interaction terms and still underperformed by 15% PR-AUC.
- **Stratified CV:** Necessary because of 8% positive rate. Random splits
  risk folds with < 5% positives, producing unstable estimates.
- **PR-AUC over ROC-AUC:** With 12:1 class imbalance, ROC-AUC is
  inflated by the large number of true negatives. PR-AUC focuses on the
  minority class, which is the actionable class.

**Hyperparameter tuning:** Bayesian optimization (Optuna, 50 trials) over
learning_rate, max_depth, num_leaves, min_child_samples, subsample, and
colsample_bytree. Best parameters listed in Appendix A.

Evaluator's Perspective

The candidates I advance to on-site are the ones who anticipate my questions. When I read "LightGBM over Logistic Regression" and they have already provided the PR-AUC comparison, I do not need to ask "did you try simpler models?" That saves time in the follow-up and signals thorough thinking.

The Methodology Pitfalls

Pitfall	Example	Fix
Description without rationale	"I used LightGBM"	"I chose LightGBM because it outperformed RF by 12% PR-AUC"
Rationale without evidence	"LightGBM is the best for tabular data"	Show the comparison table with numbers
Missing baseline	"PR-AUC of 0.43"	"PR-AUC of 0.43 vs. 0.08 random baseline (5.4x improvement)"
Ignoring class imbalance	"Accuracy of 92%"	"92% accuracy is trivially achieved by predicting majority class. PR-AUC = 0.43."
No alternative models tried	"I used XGBoost"	"Compared LR, RF, XGBoost, LightGBM - see comparison table"

Part 5 -- The Technical Appendix

What Goes in the Appendix

The appendix is for detail that supports your claims but would clutter the main narrative. Think of it as the "show your work" section for the deep reader.

Technical Appendix Structure - Hyperparameter Results, Feature Importance, CV Details, EDA, Alternative Models

Appendix Template

## Appendix A: Hyperparameter Optimization

Search method: Bayesian optimization (Optuna)
Trials: 50
Objective: 5-fold stratified CV PR-AUC (mean)

| Parameter | Search Range | Best Value | Default |
|-----------|-------------|------------|---------|
| learning_rate | [0.01, 0.3] | 0.047 | 0.1 |
| max_depth | [3, 10] | 6 | -1 |
| num_leaves | [15, 63] | 31 | 31 |
| min_child_samples | [5, 50] | 18 | 20 |
| subsample | [0.5, 1.0] | 0.82 | 1.0 |
| colsample_bytree | [0.5, 1.0] | 0.76 | 1.0 |
| reg_alpha | [0, 1.0] | 0.08 | 0.0 |
| reg_lambda | [0, 1.0] | 0.12 | 0.0 |

Tuned PR-AUC: 0.431 +/- 0.018
Default PR-AUC: 0.409 +/- 0.022
Improvement from tuning: +5.4%

## Appendix B: Per-Fold Cross-Validation Results

| Fold | Train PR-AUC | Val PR-AUC | Val ROC-AUC | n_positive | n_negative |
|------|-------------|------------|-------------|------------|------------|
| 1 | 0.891 | 0.442 | 0.918 | 312 | 3,588 |
| 2 | 0.887 | 0.419 | 0.904 | 308 | 3,592 |
| 3 | 0.893 | 0.448 | 0.921 | 315 | 3,585 |
| 4 | 0.889 | 0.427 | 0.911 | 310 | 3,590 |
| 5 | 0.885 | 0.420 | 0.907 | 305 | 3,595 |
| **Mean** | **0.889** | **0.431** | **0.912** | **310** | **3,590** |
| **Std** | **0.003** | **0.012** | **0.007** | **4** | **4** |

Observations:
- Low standard deviation (0.012) across folds indicates stable performance
- No outlier folds (all within 1.5 std of mean)
- Consistent train-val gap (~0.46) suggests moderate overfitting,
  acceptable given the class imbalance

## Appendix C: Feature Correlation Analysis

Top 5 correlated feature pairs (Pearson):
1. frequency_30d / frequency_60d: 0.89 - expected overlap in time windows
2. monetary_avg / monetary_total: 0.76 - kept both as they capture
   different aspects (average ticket vs. lifetime value)
3. login_velocity / frequency_30d: 0.52 - moderate; velocity captures
   trend while frequency captures level

Decision: No features removed due to multicollinearity. LightGBM handles
correlated features well via feature subsampling (colsample_bytree=0.76).

When to Skip the Appendix

For take-homes with 4-hour time limits, an appendix is optional. Include one only if you have time after completing the main write-up. For 8-hour or weekend projects, an appendix is expected and demonstrates thoroughness. Even a brief appendix with the hyperparameter table and per-fold results adds value.

Part 6 -- The Follow-Up Presentation

Presentation Structure (10-15 Minutes)

Many companies follow the take-home with a 30-45 minute session where you present your work (10-15 minutes) and then answer questions (15-30 minutes). This is where offers are won or lost.

Follow-Up Presentation Flow - 8 Slides in 10-15 Minutes Plus Q&A

Slide-by-Slide Guide

Slide 1: Problem Statement (30 seconds)

Restate the problem in your own words
State the business context: why this matters
State the evaluation metric and why you chose it

Slide 2: Approach Overview (90 seconds)

High-level pipeline: Data -> Features -> Model -> Evaluation
State 2-3 key decisions at a high level (details on next slides)
Mention what you did NOT do and why (scope management)

Slides 3-4: Key Methodology Decisions (3 minutes)

Feature engineering: what features and why
Model choice: what you compared and what won
Show the model comparison table
Show feature importance (top 5-10 features)

Slides 5-6: Results (3 minutes)

Performance metrics with baseline comparison
PR curve or most relevant performance visualization
Business-interpretable result: "Top decile captures X% of churners"
Confidence intervals or cross-validation stability

Slide 7: Error Analysis (2 minutes)

Where does the model fail?
What segments underperform?
What types of errors are most costly?
This slide separates good candidates from great candidates

Slide 8: Next Steps and Limitations (2 minutes)

What would you do with one more day?
What would you do with one more month?
What are the known limitations?
What assumptions did you make that might not hold?

Presentation Anti-Patterns

Anti-Pattern	Why It Fails	Fix
Reading code on slides	Evaluators cannot parse code at presentation speed	Show results and key function signatures, not implementations
Walking through every EDA plot	Loses the audience in details	Show 1-2 key EDA insights that drove decisions
Apologizing for what you did not do	Undermines confidence	Frame limitations as "next steps" - forward-looking, not defensive
Skipping the baseline	Results have no context	Always show "X vs. baseline" comparisons
Going over time	Signals poor preparation	Practice with a timer; cut content rather than speed up

Common Trap

Do not spend more than 20% of your presentation on EDA. The evaluator has already seen your notebook - they know what the data looks like. Spend 60% on methodology and results, and 20% on error analysis and next steps. The ratio of "what I decided" to "what I observed" should be at least 3:1.

Part 7 -- Handling Follow-Up Questions

The Question Taxonomy

Follow-up questions fall into five categories. Recognizing the category helps you structure your answer.

Follow-Up Question Taxonomy - Five Types: Clarification, Challenge, Extension, Depth, Stress Test

The STAR-T Framework for Technical Questions

Adapt the STAR framework for technical follow-ups:

Situation: Acknowledge the question's context
Thought process: Explain your reasoning framework
Action: What you would do or did
Result: Expected outcome or observed result
Tradeoff: What you would give up and why it is acceptable

Example Q&A Exchanges

Question (Challenge): "You used PR-AUC as your primary metric. But your stakeholders care about customer retention rate. Why not optimize directly for a business metric?"

Strong answer: "Great question. I chose PR-AUC as the optimization metric because it is differentiable and well-behaved for gradient-based optimization, which business metrics like retention rate are not - retention depends on the intervention strategy, not just the model. However, I evaluate the business impact separately: at our chosen operating point of 30% precision, we flag 340 customers per month, and assuming a 30% save rate from proactive outreach, that translates to roughly 100 saved customers. I would propose an A/B test to validate this save rate before making retention claims. If we find the save rate differs by risk score tier, we could optimize the threshold per tier."

Weak answer: "PR-AUC is the standard metric for imbalanced classification."

Question (Extension): "How would you deploy this model in production?"

Strong answer: "I would break this into three phases. First, batch scoring: run the model weekly on a snapshot of current customer features, write risk scores to a database table, and have the CS team triage the top decile. This is the fastest path to value and lets us validate the model's utility before investing in infrastructure. Second, once we have validated the model drives retention, I would move to a daily batch pipeline - an Airflow DAG that computes features from the data warehouse, runs inference, and writes scores to an API-accessible store. Third, if we find that real-time signals like 'customer currently on cancellation page' are predictive, we would build a streaming feature pipeline and serve the model behind a low-latency API. Each phase has a clear ROI gate before proceeding."

Weak answer: "I would put it in a Docker container and deploy to Kubernetes."

Question (Stress Test): "What happens to your model when the product adds a major new feature that changes user engagement patterns?"

Strong answer: "This is a concept drift scenario - the relationship between my features and churn changes because user behavior changes. My login_velocity feature, which is the strongest predictor, would be most affected. Short-term, I would add monitoring: track the distribution of risk scores weekly and alert if the mean or variance shifts by more than two standard deviations. Medium-term, I would implement rolling retraining on a 90-day window, so the model adapts to new behavioral patterns within a quarter. Long-term, I would add product-specific features - engagement with the new feature specifically - which requires coordination with the product team to instrument the right events."

Evaluator's Perspective

The follow-up Q&A is where I separate candidates who memorized solutions from candidates who think in frameworks. A candidate who says "I would retrain the model" in response to a drift question gets a neutral score. A candidate who distinguishes between covariate drift and concept drift, proposes monitoring, and suggests both short-term and long-term mitigations gets a strong hire. The depth of the answer matters more than the specific solution.

Phrases That Hurt vs. Help

Hurts	Helps
"I did not have time for that"	"Given more time, I would prioritize X because..."
"I do not know" (full stop)	"I have not implemented that, but my approach would be..."
"That is a good point, I did not think of that"	"That is a valid concern. The impact would be X, and I would address it by..."
"I just used the default parameters"	"I started with defaults as a baseline, then tuned the three most impactful parameters"
"The model is pretty good"	"The model achieves X, which represents a Y% improvement over the baseline, but underperforms on Z segment"

Part 8 -- Adapting to Company Culture

Write-Up Styles by Company Type

Different companies value different aspects of your write-up. Adjust your emphasis accordingly.

Company Type	Emphasis	Write-Up Style	Example
FAANG / Big Tech	Rigor, scalability, metrics	Formal, metric-heavy, production-aware	Google, Meta, Amazon
Growth-Stage Startup	Business impact, speed, pragmatism	Concise, action-oriented, ROI-focused	Stripe, Notion, Figma
Research Lab	Novelty, depth, ablation studies	Academic, thorough, with ablations	OpenAI, DeepMind, Anthropic
Consulting / Analytics	Storytelling, stakeholder communication	Narrative, polished visualizations, executive-friendly	McKinsey QuantumBlack
Fintech / Healthcare	Regulatory awareness, interpretability	Cautious, explainability-focused, bias-aware	Two Sigma, Tempus

Tailoring the Executive Summary

For a FAANG role:

The final model achieves PR-AUC of 0.431 +/- 0.012 (5-fold stratified CV),
representing a 5.4x improvement over the random baseline. Feature ablation
shows that removing engagement velocity features reduces PR-AUC by 31%,
confirming they are the primary predictive signal. At scale, the model
scores 100K customers in < 2 seconds on a single CPU core, meeting the
latency requirements for daily batch scoring.

For a startup role:

The model identifies 62% of likely churners in the top risk decile,
enabling proactive outreach to the highest-risk customers. At our
recommended operating threshold, each month we would flag ~340 customers
for CSM intervention at a cost of ~$17K in CSM time, with an expected
return of ~$180K in retained revenue (assuming a conservative 30% save
rate). Recommended first step: A/B test proactive outreach on model-flagged
vs. randomly selected at-risk customers to validate the save rate.

For a research lab role:

We compare four model families (logistic regression, random forest,
gradient boosted trees, and neural network) on the churn prediction task,
evaluating each under stratified 5-fold CV with PR-AUC as the primary
metric. Gradient boosted trees (LightGBM) achieve the best performance
(0.431 +/- 0.012), followed by RF (0.382 +/- 0.019). Ablation over
feature groups reveals that temporal engagement features contribute 78%
of the total predictive signal, while demographic features contribute
< 2%. Analysis of calibration curves shows that LightGBM's probability
outputs are well-calibrated in the 0.1-0.5 range but overconfident above
0.5, suggesting Platt scaling would improve deployment utility.

Practice Problems

Problem 1: Write an Executive Summary

You completed a take-home for a recommendation system. Key facts:

Task: Predict which products a user will purchase next
Dataset: 500K users, 10K products, 12 months of purchase history
Best model: Matrix factorization + LightGBM hybrid, Recall@10 = 0.23
Baseline: Popularity-based, Recall@10 = 0.09
Key insight: Recency-weighted purchase history outperforms raw frequency
Deployment consideration: Must update recommendations daily for 500K users

Write a two-paragraph executive summary.

Hint 1 -- Direction

Paragraph 1: Problem, business context, approach. Paragraph 2: Results vs. baseline, key insight, deployment feasibility.

Hint 2 -- Key Elements to Include

Business context: why recommendations matter (revenue per user, conversion rate)
Quantify the improvement: 0.23 vs. 0.09 is a 2.6x improvement
Translate Recall@10 into business language: "top 10 recommendations contain at least one actual purchase for 23% of users"
Address the daily update requirement with a concrete plan

Hint 3 -- Strong Example

"This analysis develops a product recommendation system for a catalog of 10K items, using 12 months of purchase history from 500K users. Effective recommendations directly impact revenue through increased conversion rates and average order values. I developed a hybrid approach combining matrix factorization embeddings (capturing latent user-product affinities) with a LightGBM ranker (incorporating recency signals and contextual features), evaluated using Recall@10 to measure whether actual purchased items appear in the top-10 recommendations.

The hybrid model achieves Recall@10 of 0.23, a 2.6x improvement over the popularity baseline (0.09). In business terms, 23% of users would see at least one product they actually purchased in their top-10 recommendations, compared to 9% with a simple 'most popular items' approach. The key modeling insight is that recency-weighted purchase history (exponential decay with a 30-day half-life) outperforms raw purchase frequency by 18% in recall, suggesting that recent behavior is far more predictive of near-term purchases than lifetime history. For deployment, the embedding computation (batch matrix factorization) requires ~45 minutes, and the LightGBM scoring runs in ~8 minutes for 500K users, well within a nightly batch window."

Scoring Rubric:

Strong Hire: Includes business context, quantifies improvement vs. baseline, translates metric into plain language, addresses deployment feasibility with specific numbers. Both paragraphs are self-contained and actionable.
Lean Hire: Mentions key results but lacks business translation or deployment discussion.
No Hire: Lists technical details without context ("I used matrix factorization, Recall@10 = 0.23").

Problem 2: Create a Presentation Outline

You have 12 minutes to present a fraud detection take-home. Key facts:

1M transactions, 0.3% fraud rate
Compared LR, RF, XGBoost, neural network
XGBoost won (PR-AUC 0.72)
Key features: transaction velocity, device fingerprint mismatch, amount deviation
Error analysis: model misses sophisticated fraud patterns (account takeover)
Next step: graph features from transaction networks

Create a slide-by-slide outline with timing.

Hint 1 -- Direction

12 minutes = about 6-7 slides at 2 minutes each. Allocate time to results and error analysis, not EDA.

Hint 2 -- Time Allocation

Problem + Approach: 2 min (1-2 slides)
Feature engineering + Model selection: 3 min (2 slides)
Results: 3 min (1-2 slides)
Error analysis: 2 min (1 slide)
Next steps: 2 min (1 slide)

Hint 3 -- Full Outline

Slide 1 (1 min): Problem Context

Fraud detection: 0.3% fraud rate in 1M transactions
Business: even 0.1% false negative improvement = significant loss prevention
Metric choice: PR-AUC (not ROC-AUC) because of extreme class imbalance
Evaluation focus: precision at high recall thresholds (catch fraud without blocking legitimate users)

Slide 2 (2 min): Feature Engineering

Three feature categories: velocity (transactions per hour), identity (device/IP mismatch), amount (deviation from user's historical pattern)
Show feature importance chart (top 5)
Key insight: transaction velocity in last 1 hour is 4x more predictive than transaction amount

Slide 3 (2 min): Model Comparison

Table: LR, RF, XGBoost, NN with PR-AUC, inference time, training time
XGBoost wins: PR-AUC 0.72 vs NN 0.69 (comparable performance, 10x faster inference)
Why not NN: marginal improvement does not justify inference latency for real-time scoring

Slide 4 (2 min): Results Deep Dive

PR curve with operating point annotated
At 80% precision: recall = 0.58 (catches 58% of fraud)
At 95% precision: recall = 0.31 (for automated blocking)
Business translation: two operating modes - alert (high recall) and auto-block (high precision)

Slide 5 (2 min): Error Analysis

Model misses account takeover fraud (legitimate device, unusual behavior)
Confusion matrix segmented by fraud type
78% of false negatives are account takeover (vs. stolen card)
Current features focus on device/velocity but miss behavioral anomalies

Slide 6 (2 min): Next Steps

Short-term: add graph features (transaction network, merchant connections)
Medium-term: sequence model (LSTM on transaction sequences) for behavioral patterns
Monitoring: track precision/recall weekly, retrain monthly
A/B test: deploy alongside rule-based system, measure incremental catch rate

Slide 7 (1 min): Summary

XGBoost + velocity features achieve PR-AUC 0.72 on fraud detection
Two operating modes for different use cases
Key gap: account takeover fraud requires graph and sequence features
Recommended deployment: batch scoring for alerts, real-time scoring later

Scoring Rubric:

Strong Hire: Presentation has clear flow, spends majority of time on results and analysis (not EDA), includes error analysis with specific failure modes, has quantified next steps. Timing is realistic.
Lean Hire: Covers main results but error analysis is thin and next steps are generic.
No Hire: Spends 5+ minutes on EDA and data description, leaving 2 minutes for results.

Problem 3: Handle These Questions

For each question, write a two-sentence answer that would satisfy a senior ML engineer.

"Your model has a high AUC but low precision. Is that a problem?"
"You used LightGBM. Have you considered a neural network?"
"Your feature importance shows login_velocity is dominant. Is that a concern?"
"What happens if the data distribution shifts next quarter?"
"You only had 6 hours. What did you intentionally skip?"

Hint 1 -- Direction

Each answer should: (1) directly address the concern, and (2) demonstrate awareness of the tradeoff involved.

Hint 2 -- Key Principles

AUC vs precision: depends on the operating point and cost asymmetry
LightGBM vs NN: justify based on data size, interpretability, and training time
Dominant feature: discuss single-point-of-failure risk and robustness
Distribution shift: discuss monitoring, retraining, and robustness strategies
Intentional skips: frame as prioritization, not omission

Hint 3 -- Strong Answers

"High AUC with low precision is expected in imbalanced classification - it means the model ranks positives above negatives well but the raw threshold needs calibration. I would adjust the decision threshold to match the business's cost asymmetry between false positives and false negatives, and report precision-recall at the specific operating point rather than at the default 0.5 threshold."
"I considered a neural network and ran a preliminary comparison - it achieved comparable PR-AUC (0.42 vs 0.43) but required 8x longer training time, which limited my ability to iterate on hyperparameters and features within the time constraint. For tabular data of this size (20K samples, 25 features), gradient-boosted trees typically match or exceed neural networks, and the interpretability advantage (feature importance, SHAP values) made LightGBM the pragmatic choice."
"A dominant feature is both a strength and a risk - it means we have found a strong signal, but the model is fragile if that signal degrades or becomes unavailable. I would run an ablation study removing login_velocity to measure the performance drop, and if the drop exceeds 20%, I would invest in finding alternative engagement signals that capture similar information through different data sources."
"Distribution shift is the primary production risk for this model - I would implement weekly monitoring of feature distributions and prediction score distributions, with automated alerts when KL divergence exceeds a threshold. Additionally, I would set up rolling retraining on a 90-day window so the model adapts to gradual drift, with a manual retraining trigger for sudden shifts like a product launch or market event."
"I intentionally skipped deep hyperparameter tuning, neural network architectures, and SHAP-based interpretability analysis. I prioritized feature engineering and model comparison because, in my experience, the feature set determines 80% of model performance while hyperparameters contribute at most 5-10%, so the time was better spent building strong features and validating with proper cross-validation."

Interview Cheat Sheet

Concept	Key Practice	One-Liner	Red Flag
Executive summary	Two paragraphs: approach + results with business impact	The evaluator should understand everything from this alone	No summary, or a summary that says "I used LightGBM"
Visualizations	Title states the takeaway, not the chart type	Every figure answers a specific question	Correlation heatmap of 40 features, unlabeled axes
Methodology	Every decision has a rationale and alternative	"I chose X over Y because Z"	"I used X" without explanation
Baseline comparison	Every metric is compared to a meaningful baseline	Results without context are meaningless	"AUC of 0.91" with no baseline
Error analysis	Show where the model fails, not just where it succeeds	Error analysis separates good from great candidates	Only showing aggregate metrics
Next steps	Concrete, prioritized, forward-looking	Shows awareness that this is a starting point	"I would get more data" as the only next step
Presentation	60% results/analysis, 20% method, 20% next steps	Do not narrate EDA - narrate decisions	5 minutes of EDA, 2 minutes of results
Q&A handling	Acknowledge, analyze, address, tradeoff	Frame limitations as next steps, not failures	"I did not have time" or "I do not know" (full stop)
Business translation	Translate metrics into decisions and dollars	"Top decile captures X% of churners"	"PR-AUC is 0.43" without interpretation
Appendix	Supporting detail for the deep reader	Shows depth without cluttering the narrative	No supporting detail, or 20-page appendix

Spaced Repetition Checkpoints

Day 0 -- Initial Learning

Read this entire page
Rewrite the executive summary of a past project using the two-paragraph formula
Audit the visualizations in a past project - does each one answer a specific question?
Complete the self-assessment

Day 3 -- First Recall

Without looking, list the five sections of a write-up
Write the "So What?" version of three technical results from your experience
Practice the STAR-T framework on one follow-up question out loud

Day 7 -- Practice

Do Practice Problem 1 (executive summary) without looking at hints
Create a presentation outline for a past project (timed: 10 minutes)
Answer the five questions from Problem 3 out loud, timed (2 minutes each)

Day 14 -- Application

Do a full mock take-home write-up with all five sections (timed: 2 hours)
Present it to a friend or mentor in 12 minutes
Have them ask 5 follow-up questions and practice the STAR-T framework

Day 21 -- Mock Interview

Present a take-home to someone unfamiliar with the problem
Time the presentation (must be under 15 minutes)
Ask them to evaluate: "Could you follow my reasoning without running the code?"
Iterate on weak areas

Key Takeaways

The write-up is your interview before the interview. Most evaluators form their opinion from the write-up alone, before any code is run. A clear, structured write-up with an actionable executive summary gets you to the on-site. A notebook dump does not.
Every decision needs a rationale. "I used LightGBM" is a description. "I chose LightGBM over Random Forest because it achieved 12% higher PR-AUC with 4x faster training, enabling more iteration in the time constraint" is a rationale. Evaluators hire people who can explain their reasoning, not people who can call sklearn functions.
Visualizations are arguments, not decorations. Each figure should have a takeaway in its title, not a description. "Login velocity is 3x more predictive than demographics" is a finding. "Feature Importance Plot" is a label. The evaluator should understand your key findings by reading only the figure titles.
The follow-up Q&A is where offers are won. Preparing for five categories of questions - clarification, challenge, extension, depth, and stress test - means you are never caught off guard. Acknowledge the question, analyze the tradeoff, address the concern, and state what you would do next.
Adapt to the audience. A startup VP wants business impact and deployment feasibility. A Google staff engineer wants rigorous metrics and scalability analysis. A research scientist wants ablation studies and methodological depth. One write-up format does not fit all.

The Real Interview Moment​

What You Will Master​

Self-Assessment: Where Are You Now?​

Part 1 -- The Write-Up Structure​

The Five-Section Framework​

The Two-Speed Reading Principle​

Part 2 -- The Executive Summary​

The Two-Paragraph Formula​

Example: Strong Executive Summary​

Example: Weak Executive Summary​

The "So What?" Test​

Part 3 -- Visualizations That Communicate​

The Purpose-Driven Visualization Framework​

The Four Rules of Take-Home Figures​

Visualization Templates for Common Scenarios​

Part 4 -- The Methodology Section​

Decisions, Not Descriptions​

The Decision-Rationale Template​

Example: Methodology Section​

The Methodology Pitfalls​

Part 5 -- The Technical Appendix​

What Goes in the Appendix​

Appendix Template​

Part 6 -- The Follow-Up Presentation​

Presentation Structure (10-15 Minutes)​

Slide-by-Slide Guide​

Presentation Anti-Patterns​

Part 7 -- Handling Follow-Up Questions​

The Question Taxonomy​

The STAR-T Framework for Technical Questions​

Example Q&A Exchanges​

Phrases That Hurt vs. Help​

Part 8 -- Adapting to Company Culture​

Write-Up Styles by Company Type​

Tailoring the Executive Summary​

Practice Problems​

Problem 1: Write an Executive Summary​

Problem 2: Create a Presentation Outline​

Problem 3: Handle These Questions​

Interview Cheat Sheet​

Spaced Repetition Checkpoints​

Day 0 -- Initial Learning​

Day 3 -- First Recall​

Day 7 -- Practice​

Day 14 -- Application​

Day 21 -- Mock Interview​

Key Takeaways​

The Real Interview Moment

What You Will Master

Self-Assessment: Where Are You Now?

Part 1 -- The Write-Up Structure

The Five-Section Framework

The Two-Speed Reading Principle

Part 2 -- The Executive Summary

The Two-Paragraph Formula

Example: Strong Executive Summary

Example: Weak Executive Summary

The "So What?" Test

Part 3 -- Visualizations That Communicate

The Purpose-Driven Visualization Framework

The Four Rules of Take-Home Figures

Visualization Templates for Common Scenarios

Part 4 -- The Methodology Section

Decisions, Not Descriptions

The Decision-Rationale Template

Example: Methodology Section

The Methodology Pitfalls

Part 5 -- The Technical Appendix

What Goes in the Appendix

Appendix Template

Part 6 -- The Follow-Up Presentation

Presentation Structure (10-15 Minutes)

Slide-by-Slide Guide

Presentation Anti-Patterns

Part 7 -- Handling Follow-Up Questions

The Question Taxonomy

The STAR-T Framework for Technical Questions

Example Q&A Exchanges

Phrases That Hurt vs. Help

Part 8 -- Adapting to Company Culture

Write-Up Styles by Company Type

Tailoring the Executive Summary

Practice Problems

Problem 1: Write an Executive Summary

Problem 2: Create a Presentation Outline

Problem 3: Handle These Questions

Interview Cheat Sheet

Spaced Repetition Checkpoints

Day 0 -- Initial Learning

Day 3 -- First Recall

Day 7 -- Practice

Day 14 -- Application

Day 21 -- Mock Interview

Key Takeaways