What Evaluators Actually Want - Inside the Scoring Room
Reading time: ~30 min | Interview relevance: Critical | Roles: MLE, Data Scientist, Applied Scientist, AI Engineer, Research Engineer
The Real Interview Moment
You submit your take-home project at 11:47 PM on a Sunday night - 13 minutes before the deadline. You spent 14 hours across the weekend, tried six different models, squeezed out a 0.89 AUC, and included three pages of hyperparameter tuning results. You feel confident.
Two weeks later, the rejection email arrives: "Thank you for your time. After careful review, we have decided to move forward with other candidates."
You are devastated. You know your model was strong. What happened?
What happened is this: the evaluator - a senior ML engineer who reviews 15 take-home submissions per hiring cycle - opened your notebook, saw no README, scrolled through 47 code cells with no markdown explanations, noticed you used train_test_split after feature scaling (data leakage), and closed the notebook within 8 minutes. Your 0.89 AUC was irrelevant because the evaluator never trusted the number.
Meanwhile, the candidate who advanced submitted a 0.84 AUC model with a clean README, well-documented EDA, a proper pipeline that prevented leakage, and a two-page write-up explaining their feature engineering rationale and what they would try with more time. That candidate got an offer.
This chapter reveals exactly what happens on the other side of the submission - how evaluators think, what they score, and what actually determines whether your take-home leads to the next round.
What You Will Master
- Real evaluation rubrics used at different company types
- The relative weight of code quality vs. modeling quality
- What "production readiness" means in a take-home context
- How communication in notebooks and READMEs is scored
- Common scoring criteria across the industry
- The signals that cause instant pass or instant rejection
Self-Assessment: Where Are You Now?
| Level | Description | Target |
|---|---|---|
| Beginner | "I assumed model accuracy was the main criterion" | Read all parts - your mental model needs recalibrating |
| Intermediate | "I know code quality matters, but I am unsure how it is weighted" | Focus on Parts 1-3 for the rubric details |
| Advanced | "I write clean code and document well, but I have still been rejected" | Jump to Part 4 (hidden signals) and the practice exercise |
Part 1 - Real Evaluation Rubrics
How Evaluators Actually Review Submissions
Before we discuss criteria, understand the mechanics of review:
-
First pass (2-3 minutes): The evaluator opens the README (if there is one), skims the notebook structure, and forms a first impression. If there is no README, no markdown, and the code is wall-to-wall, the impression is already negative.
-
Second pass (5-10 minutes): They look at the data handling, check for leakage, review the evaluation approach, and read any write-up. This is where most submissions are rejected.
-
Deep review (15-30 minutes): Only the top 20-30% of submissions reach this stage. The evaluator reads the code carefully, checks methodology, and forms their recommendation.
"Evaluators spend 10-15 minutes on most submissions, with only the top 20-30% receiving a deep review. The first 2-3 minutes are decisive - README quality, notebook structure, and first impression determine whether the evaluator approaches your work positively or is already looking for reasons to reject. This means your submission must communicate competence before the evaluator reads a single line of code."
The Standard Rubric (Composite from Multiple Companies)
After synthesizing rubrics from tech companies, startups, and financial firms, here is the composite scoring framework that most evaluators use (explicitly or implicitly):
| Criterion | Weight | Score 1 (Fail) | Score 3 (Pass) | Score 5 (Strong Pass) |
|---|---|---|---|---|
| Problem Understanding | 10% | Misinterprets the task or ignores requirements | Correctly identifies the task and addresses all requirements | Identifies ambiguities, states assumptions, goes beyond the literal prompt |
| Data Exploration | 15% | No EDA or only df.describe() | Basic visualizations and summary statistics | Insightful EDA that reveals patterns, informs feature engineering, documents findings |
| Data Handling | 10% | Ignores missing values, no preprocessing | Handles missing data, encodes categoricals, scales features | Thoughtful imputation, handles edge cases, documents decisions |
| Feature Engineering | 15% | Uses raw features only | Creates some derived features with rationale | Creative, domain-aware features with clear justification |
| Model Selection & Evaluation | 20% | Single model, no baseline, wrong metric | Multiple models compared, proper train/test split, appropriate metrics | Baseline comparison, cross-validation, multiple metrics, error analysis |
| Code Quality | 15% | Spaghetti code, no functions, unreproducible | Organized notebook, some functions, runs end-to-end | Modular code, type hints, tests, requirements file, clean abstractions |
| Communication | 15% | No write-up, no markdown, no README | Brief write-up, some markdown in notebook | Clear narrative, executive summary, honest about limitations, next steps |
Company-Specific Rubric Variations
Different companies weight these criteria differently based on what they value:
Google / DeepMind: Heavy emphasis on statistical rigor and evaluation methodology. They will check if your cross-validation strategy is sound and whether your metric choice matches the problem. Less emphasis on polished presentation.
Startups (Series A-C): Heavy emphasis on code quality and production readiness. They need people who can ship, not just analyze. A Dockerfile, proper error handling, and a Makefile will set you apart.
Meta / Applied ML: Balanced across all criteria but with extra weight on feature engineering. They want to see you think creatively about the data.
Finance (Two Sigma, Citadel): Extreme emphasis on proper evaluation and absence of leakage. Statistical rigor is non-negotiable. They also expect awareness of temporal dependencies.
Healthcare (Flatiron, Tempus): Extra weight on interpretability and communication. Can you explain your model's decisions to a clinician? Do you discuss false negative vs. false positive trade-offs in clinical context?
Part 2 - Code Quality vs. Modeling Quality
The Great Misconception
Most candidates believe this is the weighting:
Modeling Quality: 70%
Code Quality: 15%
Communication: 15%
The actual weighting is closer to:
Modeling Quality: 30-35%
Code Quality: 30-35%
Communication: 30-35%
This shocks candidates who come from academic backgrounds where the model result is everything. In industry, a model that works but cannot be understood, maintained, or reproduced is worthless.
What "Modeling Quality" Actually Means
Modeling quality is not about having the highest metric. It is about demonstrating sound methodology:
| Signal | What It Shows | How to Demonstrate |
|---|---|---|
| Baseline first | You understand relative improvement | Start with a dummy classifier or simple logistic regression |
| Appropriate model for data size | You will not overfit 500 rows with a neural net | Match model complexity to dataset size |
| Proper evaluation | You understand generalization | Cross-validation, held-out test set, appropriate metrics |
| Hyperparameter discipline | You do not randomly guess | Grid/random search with CV, document what you tried |
| Error analysis | You understand your model's weaknesses | Confusion matrix, per-class metrics, failure case examination |
| Honest reporting | You do not cherry-pick results | Report all models tried, including ones that did not work |
# WRONG: This is what most candidates do
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
model = XGBClassifier(n_estimators=1000, max_depth=10)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
# "I used XGBoost because it usually works best"
# RIGHT: This is what evaluators want to see
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import (
classification_report,
roc_auc_score,
average_precision_score,
confusion_matrix,
)
from sklearn.model_selection import cross_val_score
# Step 1: Establish baseline
baseline = DummyClassifier(strategy="most_frequent")
baseline_scores = cross_val_score(baseline, X_train, y_train, cv=5, scoring="roc_auc")
print(f"Baseline AUC (majority class): {baseline_scores.mean():.4f} +/- {baseline_scores.std():.4f}")
# Step 2: Simple model first
lr = LogisticRegression(max_iter=1000, random_state=42)
lr_scores = cross_val_score(lr, X_train, y_train, cv=5, scoring="roc_auc")
print(f"Logistic Regression AUC: {lr_scores.mean():.4f} +/- {lr_scores.std():.4f}")
# Step 3: More complex models only if justified
models = {
"Random Forest": RandomForestClassifier(n_estimators=100, random_state=42),
"Gradient Boosting": GradientBoostingClassifier(n_estimators=100, random_state=42),
}
results = {"Baseline": baseline_scores.mean(), "Logistic Regression": lr_scores.mean()}
for name, model in models.items():
scores = cross_val_score(model, X_train, y_train, cv=5, scoring="roc_auc")
results[name] = scores.mean()
print(f"{name} AUC: {scores.mean():.4f} +/- {scores.std():.4f}")
# Step 4: Final evaluation on held-out test set (ONE TIME ONLY)
best_model = GradientBoostingClassifier(n_estimators=100, random_state=42)
best_model.fit(X_train, y_train)
y_pred = best_model.predict(X_test)
y_prob = best_model.predict_proba(X_test)[:, 1]
print("\n--- Final Test Set Evaluation ---")
print(f"ROC-AUC: {roc_auc_score(y_test, y_prob):.4f}")
print(f"Average Precision: {average_precision_score(y_test, y_prob):.4f}")
print(f"\nClassification Report:\n{classification_report(y_test, y_pred)}")
print(f"\nConfusion Matrix:\n{confusion_matrix(y_test, y_pred)}")
"I used XGBoost because it usually wins Kaggle competitions" is a red flag to evaluators. It signals that you do not think about why a model is appropriate for a given problem. Instead, say: "I started with logistic regression as a baseline (AUC 0.76), then tried random forest (AUC 0.81) and gradient boosting (AUC 0.84). The gradient boosting model showed meaningful improvement on the validation set, so I selected it for final evaluation. With more time, I would explore feature interactions that might help the linear model close the gap."
What "Code Quality" Actually Means
Code quality in a take-home is not about following every PEP 8 rule. It is about demonstrating that you write code that other people can understand and maintain.
| Quality Signal | Failing Example | Passing Example |
|---|---|---|
| Naming | df2, X_new, temp | df_customers, X_train_scaled, features_engineered |
| Functions | 200-line notebook cells | Reusable functions with docstrings |
| Comments | No comments or # increment i | Comments explaining why, not what |
| Organization | Random cell order, dead code | Clear sections with markdown headers |
| Reproducibility | No seeds, no requirements | random_state=42 everywhere, requirements.txt |
| Error handling | try: except: pass or none | Meaningful error messages, input validation |
# FAILING code quality
df = pd.read_csv('data.csv')
df = df.dropna()
df['f1'] = df['a'] / df['b']
X = df.drop('target', axis=1)
y = df['target']
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X = sc.fit_transform(X) # LEAKAGE: scaling before split!
X_train, X_test, y_train, y_test = train_test_split(X, y)
# ... 300 more lines like this
# PASSING code quality
def load_and_validate_data(filepath: str) -> pd.DataFrame:
"""Load dataset and perform basic validation checks.
Args:
filepath: Path to the CSV data file.
Returns:
Validated DataFrame with basic type checks applied.
Raises:
FileNotFoundError: If the data file does not exist.
ValueError: If required columns are missing.
"""
REQUIRED_COLUMNS = ["customer_id", "tenure", "monthly_charges", "churn"]
df = pd.read_csv(filepath)
missing_cols = set(REQUIRED_COLUMNS) - set(df.columns)
if missing_cols:
raise ValueError(f"Missing required columns: {missing_cols}")
print(f"Loaded {len(df)} records with {len(df.columns)} columns")
print(f"Missing values: {df.isnull().sum().sum()} total")
return df
def create_features(df: pd.DataFrame) -> pd.DataFrame:
"""Engineer features from raw customer data.
Feature engineering rationale:
- charge_per_tenure: Captures customer value trajectory
- is_new_customer: Tenure < 6 months correlates with higher churn
- high_monthly_charge: Identifies price-sensitive segment
"""
df = df.copy() # Avoid modifying the original DataFrame
df["charge_per_tenure"] = df["monthly_charges"] / (df["tenure"] + 1) # +1 to avoid division by zero
df["is_new_customer"] = (df["tenure"] < 6).astype(int)
df["high_monthly_charge"] = (df["monthly_charges"] > df["monthly_charges"].median()).astype(int)
return df
Part 3 - The Production Readiness Signal
What Evaluators Mean by "Production Readiness"
When evaluators say they want "production-quality code," they do not expect a deployed microservice. They want evidence that you think about production concerns:
The Reproducibility Test
Many evaluators have a simple test: they clone your repository and try to run your code. If it does not work, you fail. Period.
# The reproducibility checklist evaluators mentally run:
reproducibility_checks = {
"Can I install dependencies?": [
"requirements.txt or environment.yml exists",
"Python version specified",
"All imports resolve after installing",
],
"Can I run the code?": [
"README has clear run instructions",
"No hardcoded absolute paths (/Users/john/data/...)",
"Data files included or download instructions provided",
"Single command to execute (python main.py or Run All in notebook)",
],
"Do I get the same results?": [
"Random seeds set (numpy, sklearn, torch, python hashseed)",
"No non-deterministic operations without seeds",
"Results match what is reported in the write-up",
],
}
Here are the fastest ways to fail the reproducibility test:
- Hardcoded paths:
pd.read_csv("/Users/yourname/Desktop/project/data.csv")- Use relative paths:pd.read_csv("data/dataset.csv") - Missing dependencies: You used
lightgbmbut it is not in your requirements file. - Notebook state pollution: Your notebook only works if cells are run in a specific non-linear order. Always restart kernel and run all before submitting.
- Missing data: You processed the data locally but only submitted the processed version with no way to reproduce the processing.
The Pipeline Test
Evaluators check whether you use proper ML pipelines to prevent data leakage:
# WRONG: Manual preprocessing invites leakage
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
# Scaling before splitting = DATA LEAKAGE
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X) # Test data statistics leak into training
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, random_state=42)
# RIGHT: Pipeline prevents leakage by design
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
# Split FIRST
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.2)
# Pipeline ensures scaler is fit only on training data
pipeline = Pipeline([
("scaler", StandardScaler()),
("model", GradientBoostingClassifier(random_state=42)),
])
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
# The scaler was fit ONLY on X_train and applied to X_test - no leakage
Part 4 - Communication: The Hidden Differentiator
Why Communication Outweighs Model Performance
A senior ML engineer at a major tech company shared this perspective: "I have never rejected a candidate for having a model that was 2% worse than the best submission. I have rejected dozens of candidates who could not explain why they made the choices they made."
Communication demonstrates:
- Thinking process: How you approach ambiguity and make decisions
- Collaboration readiness: Can you explain your work to teammates and stakeholders?
- Self-awareness: Do you know the limitations of your approach?
- Senior-level judgment: Do you understand trade-offs, not just techniques?
The Notebook Communication Framework
Every markdown cell in your notebook should follow this pattern:
## Section Title
**What I am doing:** [One sentence describing the action]
**Why:** [One sentence explaining the reasoning]
[Code cell]
**What I found:** [Key takeaway from the results]
**Implication for next steps:** [How this finding informs the next decision]
README Structure That Passes
# [Project Title] - Take-Home Challenge
## Overview
[2-3 sentences: What is the problem? What approach did I take? What were the key results?]
## Setup
```bash
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
How to Run
# Run the full analysis
jupyter notebook analysis.ipynb
# Or run as a script
python src/main.py
Project Structure
project/
├── data/ # Dataset files
├── notebooks/ # Analysis notebooks
│ └── analysis.ipynb # Main analysis (run this)
├── src/ # Source code
│ ├── features.py # Feature engineering
│ ├── models.py # Model training and evaluation
│ └── utils.py # Utility functions
├── results/ # Output figures and results
├── requirements.txt # Python dependencies
└── README.md # This file
Key Findings
- [Finding 1]
- [Finding 2]
- [Finding 3]
Approach
[3-5 sentences summarizing your methodology]
Results
| Model | AUC | Precision | Recall | F1 |
|---|---|---|---|---|
| Baseline (majority) | 0.500 | - | - | - |
| Logistic Regression | 0.764 | 0.42 | 0.68 | 0.52 |
| Gradient Boosting | 0.841 | 0.53 | 0.71 | 0.61 |
Limitations & Next Steps
- [Limitation 1 and what you would do about it]
- [Limitation 2 and what you would do about it]
- [Next step you would take with more time]
Time Spent
Approximately X hours over Y days.
:::tip[60-Second Answer]
"The README is the first thing evaluators see and the last thing most candidates write. A strong README takes 20 minutes to write and immediately signals professionalism. It should contain: a one-paragraph overview, setup instructions (copy-pasteable), project structure, key findings, results table, and limitations. Think of it as the executive summary for a busy reviewer who may never open your notebook."
:::
### Write-Up Quality Spectrum
| Level | Description | Example |
|-------|-------------|---------|
| **No write-up** | Just code with no explanation | (Instant red flag) |
| **Minimal** | "I used random forest and got 85% accuracy" | (Fails - no reasoning) |
| **Adequate** | Describes what was done, some reasoning for choices | (Passes at junior level) |
| **Good** | Clear methodology, explains trade-offs, documents assumptions | (Passes at mid-level) |
| **Excellent** | Narrative thread from problem to solution, honest about limitations, business context | (Passes at senior level) |
## Part 5 - Scoring Criteria Deep Dive
### Criterion 1: Data Handling (10-15% of score)
What evaluators check:
| Check | What They Look For | Red Flag |
|-------|-------------------|----------|
| **Missing data** | Thoughtful imputation strategy | `df.dropna()` without justification |
| **Data types** | Correct handling of categoricals, dates, text | Feeding string columns into a model |
| **Outliers** | Detection and reasoned handling | Ignoring clear outliers or blindly removing them |
| **Data leakage** | Proper temporal splits, no target leakage | Features derived from the target variable |
| **Train/test split** | Appropriate strategy for the data type | Random split on time-series data |
### Criterion 2: Feature Engineering (10-15% of score)
```python
# Examples of feature engineering that impress evaluators
# Domain-aware feature: customer tenure bands
def create_tenure_features(df: pd.DataFrame) -> pd.DataFrame:
"""Create tenure-based features with domain rationale.
Rationale: Customer churn research shows non-linear relationship
with tenure - highest churn in first 6 months (onboarding friction)
and after 24 months (seeking better deals). Creating bins captures
these distinct behavioral segments.
"""
df = df.copy()
df["tenure_band"] = pd.cut(
df["tenure"],
bins=[0, 6, 12, 24, 48, float("inf")],
labels=["new", "settling", "established", "loyal", "veteran"],
)
df["months_since_last_interaction"] = (
pd.to_datetime("today") - pd.to_datetime(df["last_interaction_date"])
).dt.days / 30
# Interaction features that capture customer engagement patterns
df["avg_monthly_spend"] = df["total_charges"] / (df["tenure"] + 1)
df["spend_trend"] = df["last_3mo_charges"] / (df["prev_3mo_charges"] + 1)
return df
Criterion 3: Model Selection & Evaluation (15-20% of score)
The most heavily weighted technical criterion. Evaluators look for:
- Baseline comparison: Did you establish what a naive model achieves?
- Appropriate metrics: Did you choose metrics that match the business problem?
- Proper validation: Cross-validation, not a single train/test split for model selection
- No data leakage: Preprocessing inside the validation loop
- Error analysis: Understanding where the model fails, not just overall performance
| Metric Choice | When Appropriate | When Inappropriate |
|---|---|---|
| Accuracy | Balanced classes, equal error costs | Imbalanced datasets (almost always inappropriate) |
| ROC-AUC | Ranking quality matters, threshold flexibility needed | When class distribution differs between train and production |
| Precision-Recall AUC | Imbalanced datasets, positive class is rare | When negative class errors are equally costly |
| F1 Score | Need a single threshold-dependent metric | When precision/recall trade-off matters (report both) |
| RMSE | Regression, errors in original units | When outlier sensitivity is a concern (use MAE) |
| Log Loss | Probability calibration matters | When only ranking matters (use AUC) |
Criterion 4: Code Quality (10-15% of score)
Evaluators mentally score code quality on these dimensions:
Criterion 5: Communication (15-20% of score)
This is the criterion that most candidates underweight and the one that most often separates "reject" from "advance":
| Communication Element | What Evaluators Want | Common Mistake |
|---|---|---|
| Problem statement | Restate in your own words, note ambiguities | Diving straight into code without framing |
| EDA narrative | Insights, not just plots | Showing 15 plots with no interpretation |
| Decision rationale | Why you chose this model/metric/approach | "I used XGBoost because it is popular" |
| Results presentation | Comparison tables, key metrics highlighted | Printing raw sklearn output |
| Limitations | Honest about what is not ideal | Claiming perfect results or ignoring weaknesses |
| Next steps | What you would do with more time | No mention of future improvements |
Evaluators can tell the difference between genuine next steps ("With more time, I would address the class imbalance using SMOTE and explore temporal features since the data has a time dimension I did not fully leverage") and fake next steps ("With more time, I would use deep learning and deploy the model as an API"). The former shows understanding of your solution's actual weaknesses. The latter shows you are just listing buzzwords.
Part 6 - The Hidden Signals
Positive Signals That Make Evaluators Excited
These are the things that make an evaluator think "I want to work with this person":
- Questioning the data: "I noticed 3% of records have negative transaction amounts. I assumed these are refunds and handled them as follows..."
- Principled metric choice: "Given the business context (churn prediction where false negatives are costly), I optimize for recall at a precision threshold of 0.5."
- Feature engineering rationale: "I created a 'recency' feature because customers who have not interacted in 30+ days show 3x higher churn rate in the EDA."
- Calibration awareness: "The model outputs probabilities of 0.7 that correspond to actual churn rates of ~0.65, suggesting slight overconfidence. With more time, I would apply Platt scaling."
- Honest limitations: "My model performs poorly on customers with < 3 months of history (recall drops to 0.32). This is likely because the features I engineered require sufficient behavioral data."
Negative Signals That Cause Instant Rejection
These signals cause evaluators to reject a submission immediately, regardless of model performance:
-
Data leakage: Scaling, encoding, or feature selection on the full dataset before splitting. This inflates all your metrics and tells the evaluator you do not understand ML fundamentals.
-
Copy-paste code: Obviously copied from Stack Overflow or tutorials with variable names like
df_exampleor comments that reference a different dataset. Evaluators have seen these patterns thousands of times. -
No evaluation: Reporting only training metrics, or using accuracy on an imbalanced dataset without acknowledging the issue.
-
Broken notebook: Code cells that produce errors, outputs from a different run than the code shown, or cells that only work in a specific execution order.
-
Ignoring the prompt: Building a regression model when classification was asked for, or answering questions that were not in the prompt while skipping ones that were.
The "30-Second Test"
Before submitting, perform this test. Open your submission as if you are the evaluator seeing it for the first time:
THIRTY_SECOND_TEST = {
"Second 0-5": "Is there a README? Can I understand the project in one paragraph?",
"Second 5-10": "Is the notebook named clearly? Is there a logical structure?",
"Second 10-15": "Does the first cell explain what this project does?",
"Second 15-20": "Can I see section headers? Is there a clear flow?",
"Second 20-25": "Are there visualizations? Do they have titles and labels?",
"Second 25-30": "Is there a conclusion/results section? Can I find the final answer?",
}
# If any answer is "no", fix it before submitting.
Part 7 - Practice Exercises
Exercise 1: Score a Mock Submission
Read the following mock submission details and assign scores (1-5) for each criterion:
Task: Predict customer churn from a 50,000-row dataset with 15 features.
What the candidate submitted:
- README: "Run analysis.ipynb" (one line)
- Notebook: 35 cells, 5 markdown cells, 30 code cells
- EDA: Three visualizations (histogram of target, correlation heatmap, scatter plot)
- Preprocessing: StandardScaler applied to all features, missing values filled with mean
- Models: Logistic regression and XGBoost, selected by test set accuracy
- Results: XGBoost accuracy 91%, logistic regression accuracy 88%
- Target distribution: 8% churn (imbalanced)
- Write-up: Two paragraphs at the bottom of the notebook
- No requirements.txt, no random seeds
Your scores:
| Criterion | Your Score (1-5) | Why? |
|---|---|---|
| Problem Understanding | ||
| Data Exploration | ||
| Data Handling | ||
| Feature Engineering | ||
| Model Selection & Evaluation | ||
| Code Quality | ||
| Communication |
Suggested Scoring
| Criterion | Score | Rationale |
|---|---|---|
| Problem Understanding | 3 | Identified the task correctly, but did not discuss class imbalance |
| Data Exploration | 2 | Basic plots but no insights documented, no target analysis by features |
| Data Handling | 2 | Mean imputation without justification, no mention of potential leakage in scaling |
| Feature Engineering | 1 | No features engineered, raw features only |
| Model Selection & Evaluation | 2 | Two models compared, but used accuracy on imbalanced data, selected on test set |
| Code Quality | 2 | Some structure but no seeds, no requirements, minimal documentation |
| Communication | 2 | Brief write-up exists but lacks reasoning and methodology explanation |
| Overall | Likely Reject | The 91% accuracy is misleading (majority class is 92%), suggesting the model is barely better than always predicting no churn |
Exercise 2: Improve the Mock Submission
Take the mock submission above and write a plan to improve it. For each criterion, list specific actions you would take. Compare your plan against the guidance in this chapter.
Exercise 3: Build Your Own Rubric
Design an evaluation rubric for a take-home project in your target role. Adjust the weights based on what you know about the company and role. Practice scoring your own past work against this rubric.
Interview Cheat Sheet
| Question | Key Points |
|---|---|
| "How do you approach a take-home project?" | Read prompt carefully, baseline first, proper evaluation, invest in communication |
| "What metrics would you use for this problem?" | Depends on class balance, error costs, and business context - never just accuracy |
| "Why did you choose this model?" | Started simple, compared multiple approaches, selected based on validation performance |
| "What would you do differently with more time?" | Specific improvements tied to observed weaknesses, not generic buzzwords |
| "How do you ensure reproducibility?" | Random seeds, requirements file, relative paths, clear run instructions |
| "What is data leakage and how do you prevent it?" | Information from test/future leaking into training; use pipelines, split before preprocessing |
| "How do you handle imbalanced data?" | Appropriate metrics (precision-recall, F1), stratified splits, possibly resampling with justification |
| "What makes a good take-home submission?" | Clean code, sound methodology, honest communication, answers the actual prompt |
Next Steps
Now that you understand what evaluators look for, you need reusable templates to structure your work efficiently. The next chapter, Project Templates, provides complete starter templates for every common take-home format - classification, NLP, time series, recommendation systems, computer vision, and LLM/RAG tasks.
