Skip to main content

What Evaluators Actually Want - Inside the Scoring Room

Reading time: ~30 min | Interview relevance: Critical | Roles: MLE, Data Scientist, Applied Scientist, AI Engineer, Research Engineer

The Real Interview Moment

You submit your take-home project at 11:47 PM on a Sunday night - 13 minutes before the deadline. You spent 14 hours across the weekend, tried six different models, squeezed out a 0.89 AUC, and included three pages of hyperparameter tuning results. You feel confident.

Two weeks later, the rejection email arrives: "Thank you for your time. After careful review, we have decided to move forward with other candidates."

You are devastated. You know your model was strong. What happened?

What happened is this: the evaluator - a senior ML engineer who reviews 15 take-home submissions per hiring cycle - opened your notebook, saw no README, scrolled through 47 code cells with no markdown explanations, noticed you used train_test_split after feature scaling (data leakage), and closed the notebook within 8 minutes. Your 0.89 AUC was irrelevant because the evaluator never trusted the number.

Meanwhile, the candidate who advanced submitted a 0.84 AUC model with a clean README, well-documented EDA, a proper pipeline that prevented leakage, and a two-page write-up explaining their feature engineering rationale and what they would try with more time. That candidate got an offer.

This chapter reveals exactly what happens on the other side of the submission - how evaluators think, what they score, and what actually determines whether your take-home leads to the next round.

What You Will Master

  • Real evaluation rubrics used at different company types
  • The relative weight of code quality vs. modeling quality
  • What "production readiness" means in a take-home context
  • How communication in notebooks and READMEs is scored
  • Common scoring criteria across the industry
  • The signals that cause instant pass or instant rejection

Self-Assessment: Where Are You Now?

LevelDescriptionTarget
Beginner"I assumed model accuracy was the main criterion"Read all parts - your mental model needs recalibrating
Intermediate"I know code quality matters, but I am unsure how it is weighted"Focus on Parts 1-3 for the rubric details
Advanced"I write clean code and document well, but I have still been rejected"Jump to Part 4 (hidden signals) and the practice exercise

Part 1 - Real Evaluation Rubrics

How Evaluators Actually Review Submissions

Before we discuss criteria, understand the mechanics of review:

  1. First pass (2-3 minutes): The evaluator opens the README (if there is one), skims the notebook structure, and forms a first impression. If there is no README, no markdown, and the code is wall-to-wall, the impression is already negative.

  2. Second pass (5-10 minutes): They look at the data handling, check for leakage, review the evaluation approach, and read any write-up. This is where most submissions are rejected.

  3. Deep review (15-30 minutes): Only the top 20-30% of submissions reach this stage. The evaluator reads the code carefully, checks methodology, and forms their recommendation.

How Evaluators Review Submissions - Three-Pass Flow with Decision Points

60-Second Answer

"Evaluators spend 10-15 minutes on most submissions, with only the top 20-30% receiving a deep review. The first 2-3 minutes are decisive - README quality, notebook structure, and first impression determine whether the evaluator approaches your work positively or is already looking for reasons to reject. This means your submission must communicate competence before the evaluator reads a single line of code."

The Standard Rubric (Composite from Multiple Companies)

After synthesizing rubrics from tech companies, startups, and financial firms, here is the composite scoring framework that most evaluators use (explicitly or implicitly):

CriterionWeightScore 1 (Fail)Score 3 (Pass)Score 5 (Strong Pass)
Problem Understanding10%Misinterprets the task or ignores requirementsCorrectly identifies the task and addresses all requirementsIdentifies ambiguities, states assumptions, goes beyond the literal prompt
Data Exploration15%No EDA or only df.describe()Basic visualizations and summary statisticsInsightful EDA that reveals patterns, informs feature engineering, documents findings
Data Handling10%Ignores missing values, no preprocessingHandles missing data, encodes categoricals, scales featuresThoughtful imputation, handles edge cases, documents decisions
Feature Engineering15%Uses raw features onlyCreates some derived features with rationaleCreative, domain-aware features with clear justification
Model Selection & Evaluation20%Single model, no baseline, wrong metricMultiple models compared, proper train/test split, appropriate metricsBaseline comparison, cross-validation, multiple metrics, error analysis
Code Quality15%Spaghetti code, no functions, unreproducibleOrganized notebook, some functions, runs end-to-endModular code, type hints, tests, requirements file, clean abstractions
Communication15%No write-up, no markdown, no READMEBrief write-up, some markdown in notebookClear narrative, executive summary, honest about limitations, next steps

Company-Specific Rubric Variations

Different companies weight these criteria differently based on what they value:

Company Variation

Google / DeepMind: Heavy emphasis on statistical rigor and evaluation methodology. They will check if your cross-validation strategy is sound and whether your metric choice matches the problem. Less emphasis on polished presentation.

Startups (Series A-C): Heavy emphasis on code quality and production readiness. They need people who can ship, not just analyze. A Dockerfile, proper error handling, and a Makefile will set you apart.

Meta / Applied ML: Balanced across all criteria but with extra weight on feature engineering. They want to see you think creatively about the data.

Finance (Two Sigma, Citadel): Extreme emphasis on proper evaluation and absence of leakage. Statistical rigor is non-negotiable. They also expect awareness of temporal dependencies.

Healthcare (Flatiron, Tempus): Extra weight on interpretability and communication. Can you explain your model's decisions to a clinician? Do you discuss false negative vs. false positive trade-offs in clinical context?

Part 2 - Code Quality vs. Modeling Quality

The Great Misconception

Most candidates believe this is the weighting:

Modeling Quality: 70%
Code Quality: 15%
Communication: 15%

The actual weighting is closer to:

Modeling Quality: 30-35%
Code Quality: 30-35%
Communication: 30-35%

This shocks candidates who come from academic backgrounds where the model result is everything. In industry, a model that works but cannot be understood, maintained, or reproduced is worthless.

What "Modeling Quality" Actually Means

Modeling quality is not about having the highest metric. It is about demonstrating sound methodology:

SignalWhat It ShowsHow to Demonstrate
Baseline firstYou understand relative improvementStart with a dummy classifier or simple logistic regression
Appropriate model for data sizeYou will not overfit 500 rows with a neural netMatch model complexity to dataset size
Proper evaluationYou understand generalizationCross-validation, held-out test set, appropriate metrics
Hyperparameter disciplineYou do not randomly guessGrid/random search with CV, document what you tried
Error analysisYou understand your model's weaknessesConfusion matrix, per-class metrics, failure case examination
Honest reportingYou do not cherry-pick resultsReport all models tried, including ones that did not work
# WRONG: This is what most candidates do
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score

model = XGBClassifier(n_estimators=1000, max_depth=10)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
# "I used XGBoost because it usually works best"

# RIGHT: This is what evaluators want to see
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import (
classification_report,
roc_auc_score,
average_precision_score,
confusion_matrix,
)
from sklearn.model_selection import cross_val_score

# Step 1: Establish baseline
baseline = DummyClassifier(strategy="most_frequent")
baseline_scores = cross_val_score(baseline, X_train, y_train, cv=5, scoring="roc_auc")
print(f"Baseline AUC (majority class): {baseline_scores.mean():.4f} +/- {baseline_scores.std():.4f}")

# Step 2: Simple model first
lr = LogisticRegression(max_iter=1000, random_state=42)
lr_scores = cross_val_score(lr, X_train, y_train, cv=5, scoring="roc_auc")
print(f"Logistic Regression AUC: {lr_scores.mean():.4f} +/- {lr_scores.std():.4f}")

# Step 3: More complex models only if justified
models = {
"Random Forest": RandomForestClassifier(n_estimators=100, random_state=42),
"Gradient Boosting": GradientBoostingClassifier(n_estimators=100, random_state=42),
}

results = {"Baseline": baseline_scores.mean(), "Logistic Regression": lr_scores.mean()}

for name, model in models.items():
scores = cross_val_score(model, X_train, y_train, cv=5, scoring="roc_auc")
results[name] = scores.mean()
print(f"{name} AUC: {scores.mean():.4f} +/- {scores.std():.4f}")

# Step 4: Final evaluation on held-out test set (ONE TIME ONLY)
best_model = GradientBoostingClassifier(n_estimators=100, random_state=42)
best_model.fit(X_train, y_train)
y_pred = best_model.predict(X_test)
y_prob = best_model.predict_proba(X_test)[:, 1]

print("\n--- Final Test Set Evaluation ---")
print(f"ROC-AUC: {roc_auc_score(y_test, y_prob):.4f}")
print(f"Average Precision: {average_precision_score(y_test, y_prob):.4f}")
print(f"\nClassification Report:\n{classification_report(y_test, y_pred)}")
print(f"\nConfusion Matrix:\n{confusion_matrix(y_test, y_pred)}")
Common Trap

"I used XGBoost because it usually wins Kaggle competitions" is a red flag to evaluators. It signals that you do not think about why a model is appropriate for a given problem. Instead, say: "I started with logistic regression as a baseline (AUC 0.76), then tried random forest (AUC 0.81) and gradient boosting (AUC 0.84). The gradient boosting model showed meaningful improvement on the validation set, so I selected it for final evaluation. With more time, I would explore feature interactions that might help the linear model close the gap."

What "Code Quality" Actually Means

Code quality in a take-home is not about following every PEP 8 rule. It is about demonstrating that you write code that other people can understand and maintain.

Quality SignalFailing ExamplePassing Example
Namingdf2, X_new, tempdf_customers, X_train_scaled, features_engineered
Functions200-line notebook cellsReusable functions with docstrings
CommentsNo comments or # increment iComments explaining why, not what
OrganizationRandom cell order, dead codeClear sections with markdown headers
ReproducibilityNo seeds, no requirementsrandom_state=42 everywhere, requirements.txt
Error handlingtry: except: pass or noneMeaningful error messages, input validation
# FAILING code quality
df = pd.read_csv('data.csv')
df = df.dropna()
df['f1'] = df['a'] / df['b']
X = df.drop('target', axis=1)
y = df['target']
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X = sc.fit_transform(X) # LEAKAGE: scaling before split!
X_train, X_test, y_train, y_test = train_test_split(X, y)
# ... 300 more lines like this

# PASSING code quality
def load_and_validate_data(filepath: str) -> pd.DataFrame:
"""Load dataset and perform basic validation checks.

Args:
filepath: Path to the CSV data file.

Returns:
Validated DataFrame with basic type checks applied.

Raises:
FileNotFoundError: If the data file does not exist.
ValueError: If required columns are missing.
"""
REQUIRED_COLUMNS = ["customer_id", "tenure", "monthly_charges", "churn"]

df = pd.read_csv(filepath)

missing_cols = set(REQUIRED_COLUMNS) - set(df.columns)
if missing_cols:
raise ValueError(f"Missing required columns: {missing_cols}")

print(f"Loaded {len(df)} records with {len(df.columns)} columns")
print(f"Missing values: {df.isnull().sum().sum()} total")

return df


def create_features(df: pd.DataFrame) -> pd.DataFrame:
"""Engineer features from raw customer data.

Feature engineering rationale:
- charge_per_tenure: Captures customer value trajectory
- is_new_customer: Tenure < 6 months correlates with higher churn
- high_monthly_charge: Identifies price-sensitive segment
"""
df = df.copy() # Avoid modifying the original DataFrame

df["charge_per_tenure"] = df["monthly_charges"] / (df["tenure"] + 1) # +1 to avoid division by zero
df["is_new_customer"] = (df["tenure"] < 6).astype(int)
df["high_monthly_charge"] = (df["monthly_charges"] > df["monthly_charges"].median()).astype(int)

return df

Part 3 - The Production Readiness Signal

What Evaluators Mean by "Production Readiness"

When evaluators say they want "production-quality code," they do not expect a deployed microservice. They want evidence that you think about production concerns:

Production Readiness Signals - Reproducibility, Robustness, Maintainability, Scalability Awareness

The Reproducibility Test

Many evaluators have a simple test: they clone your repository and try to run your code. If it does not work, you fail. Period.

# The reproducibility checklist evaluators mentally run:

reproducibility_checks = {
"Can I install dependencies?": [
"requirements.txt or environment.yml exists",
"Python version specified",
"All imports resolve after installing",
],
"Can I run the code?": [
"README has clear run instructions",
"No hardcoded absolute paths (/Users/john/data/...)",
"Data files included or download instructions provided",
"Single command to execute (python main.py or Run All in notebook)",
],
"Do I get the same results?": [
"Random seeds set (numpy, sklearn, torch, python hashseed)",
"No non-deterministic operations without seeds",
"Results match what is reported in the write-up",
],
}
Instant Rejection

Here are the fastest ways to fail the reproducibility test:

  1. Hardcoded paths: pd.read_csv("/Users/yourname/Desktop/project/data.csv") - Use relative paths: pd.read_csv("data/dataset.csv")
  2. Missing dependencies: You used lightgbm but it is not in your requirements file.
  3. Notebook state pollution: Your notebook only works if cells are run in a specific non-linear order. Always restart kernel and run all before submitting.
  4. Missing data: You processed the data locally but only submitted the processed version with no way to reproduce the processing.

The Pipeline Test

Evaluators check whether you use proper ML pipelines to prevent data leakage:

# WRONG: Manual preprocessing invites leakage
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Scaling before splitting = DATA LEAKAGE
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X) # Test data statistics leak into training
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, random_state=42)

# RIGHT: Pipeline prevents leakage by design
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split

# Split FIRST
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.2)

# Pipeline ensures scaler is fit only on training data
pipeline = Pipeline([
("scaler", StandardScaler()),
("model", GradientBoostingClassifier(random_state=42)),
])

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
# The scaler was fit ONLY on X_train and applied to X_test - no leakage

Part 4 - Communication: The Hidden Differentiator

Why Communication Outweighs Model Performance

A senior ML engineer at a major tech company shared this perspective: "I have never rejected a candidate for having a model that was 2% worse than the best submission. I have rejected dozens of candidates who could not explain why they made the choices they made."

Communication demonstrates:

  • Thinking process: How you approach ambiguity and make decisions
  • Collaboration readiness: Can you explain your work to teammates and stakeholders?
  • Self-awareness: Do you know the limitations of your approach?
  • Senior-level judgment: Do you understand trade-offs, not just techniques?

The Notebook Communication Framework

Every markdown cell in your notebook should follow this pattern:

## Section Title

**What I am doing:** [One sentence describing the action]

**Why:** [One sentence explaining the reasoning]

[Code cell]

**What I found:** [Key takeaway from the results]

**Implication for next steps:** [How this finding informs the next decision]

README Structure That Passes

# [Project Title] - Take-Home Challenge

## Overview
[2-3 sentences: What is the problem? What approach did I take? What were the key results?]

## Setup
```bash
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

How to Run

# Run the full analysis
jupyter notebook analysis.ipynb

# Or run as a script
python src/main.py

Project Structure

project/
├── data/ # Dataset files
├── notebooks/ # Analysis notebooks
│ └── analysis.ipynb # Main analysis (run this)
├── src/ # Source code
│ ├── features.py # Feature engineering
│ ├── models.py # Model training and evaluation
│ └── utils.py # Utility functions
├── results/ # Output figures and results
├── requirements.txt # Python dependencies
└── README.md # This file

Key Findings

  • [Finding 1]
  • [Finding 2]
  • [Finding 3]

Approach

[3-5 sentences summarizing your methodology]

Results

ModelAUCPrecisionRecallF1
Baseline (majority)0.500---
Logistic Regression0.7640.420.680.52
Gradient Boosting0.8410.530.710.61

Limitations & Next Steps

  • [Limitation 1 and what you would do about it]
  • [Limitation 2 and what you would do about it]
  • [Next step you would take with more time]

Time Spent

Approximately X hours over Y days.


:::tip[60-Second Answer]
"The README is the first thing evaluators see and the last thing most candidates write. A strong README takes 20 minutes to write and immediately signals professionalism. It should contain: a one-paragraph overview, setup instructions (copy-pasteable), project structure, key findings, results table, and limitations. Think of it as the executive summary for a busy reviewer who may never open your notebook."
:::

### Write-Up Quality Spectrum

| Level | Description | Example |
|-------|-------------|---------|
| **No write-up** | Just code with no explanation | (Instant red flag) |
| **Minimal** | "I used random forest and got 85% accuracy" | (Fails - no reasoning) |
| **Adequate** | Describes what was done, some reasoning for choices | (Passes at junior level) |
| **Good** | Clear methodology, explains trade-offs, documents assumptions | (Passes at mid-level) |
| **Excellent** | Narrative thread from problem to solution, honest about limitations, business context | (Passes at senior level) |


## Part 5 - Scoring Criteria Deep Dive

### Criterion 1: Data Handling (10-15% of score)

What evaluators check:

| Check | What They Look For | Red Flag |
|-------|-------------------|----------|
| **Missing data** | Thoughtful imputation strategy | `df.dropna()` without justification |
| **Data types** | Correct handling of categoricals, dates, text | Feeding string columns into a model |
| **Outliers** | Detection and reasoned handling | Ignoring clear outliers or blindly removing them |
| **Data leakage** | Proper temporal splits, no target leakage | Features derived from the target variable |
| **Train/test split** | Appropriate strategy for the data type | Random split on time-series data |

### Criterion 2: Feature Engineering (10-15% of score)

```python
# Examples of feature engineering that impress evaluators

# Domain-aware feature: customer tenure bands
def create_tenure_features(df: pd.DataFrame) -> pd.DataFrame:
"""Create tenure-based features with domain rationale.

Rationale: Customer churn research shows non-linear relationship
with tenure - highest churn in first 6 months (onboarding friction)
and after 24 months (seeking better deals). Creating bins captures
these distinct behavioral segments.
"""
df = df.copy()
df["tenure_band"] = pd.cut(
df["tenure"],
bins=[0, 6, 12, 24, 48, float("inf")],
labels=["new", "settling", "established", "loyal", "veteran"],
)
df["months_since_last_interaction"] = (
pd.to_datetime("today") - pd.to_datetime(df["last_interaction_date"])
).dt.days / 30

# Interaction features that capture customer engagement patterns
df["avg_monthly_spend"] = df["total_charges"] / (df["tenure"] + 1)
df["spend_trend"] = df["last_3mo_charges"] / (df["prev_3mo_charges"] + 1)

return df

Criterion 3: Model Selection & Evaluation (15-20% of score)

The most heavily weighted technical criterion. Evaluators look for:

  1. Baseline comparison: Did you establish what a naive model achieves?
  2. Appropriate metrics: Did you choose metrics that match the business problem?
  3. Proper validation: Cross-validation, not a single train/test split for model selection
  4. No data leakage: Preprocessing inside the validation loop
  5. Error analysis: Understanding where the model fails, not just overall performance
Metric ChoiceWhen AppropriateWhen Inappropriate
AccuracyBalanced classes, equal error costsImbalanced datasets (almost always inappropriate)
ROC-AUCRanking quality matters, threshold flexibility neededWhen class distribution differs between train and production
Precision-Recall AUCImbalanced datasets, positive class is rareWhen negative class errors are equally costly
F1 ScoreNeed a single threshold-dependent metricWhen precision/recall trade-off matters (report both)
RMSERegression, errors in original unitsWhen outlier sensitivity is a concern (use MAE)
Log LossProbability calibration mattersWhen only ranking matters (use AUC)

Criterion 4: Code Quality (10-15% of score)

Evaluators mentally score code quality on these dimensions:

Code Quality Scoring Dimensions - Readability, Structure, Reproducibility, Robustness with Point Values

Criterion 5: Communication (15-20% of score)

This is the criterion that most candidates underweight and the one that most often separates "reject" from "advance":

Communication ElementWhat Evaluators WantCommon Mistake
Problem statementRestate in your own words, note ambiguitiesDiving straight into code without framing
EDA narrativeInsights, not just plotsShowing 15 plots with no interpretation
Decision rationaleWhy you chose this model/metric/approach"I used XGBoost because it is popular"
Results presentationComparison tables, key metrics highlightedPrinting raw sklearn output
LimitationsHonest about what is not idealClaiming perfect results or ignoring weaknesses
Next stepsWhat you would do with more timeNo mention of future improvements
Common Trap

Evaluators can tell the difference between genuine next steps ("With more time, I would address the class imbalance using SMOTE and explore temporal features since the data has a time dimension I did not fully leverage") and fake next steps ("With more time, I would use deep learning and deploy the model as an API"). The former shows understanding of your solution's actual weaknesses. The latter shows you are just listing buzzwords.

Part 6 - The Hidden Signals

Positive Signals That Make Evaluators Excited

These are the things that make an evaluator think "I want to work with this person":

  1. Questioning the data: "I noticed 3% of records have negative transaction amounts. I assumed these are refunds and handled them as follows..."
  2. Principled metric choice: "Given the business context (churn prediction where false negatives are costly), I optimize for recall at a precision threshold of 0.5."
  3. Feature engineering rationale: "I created a 'recency' feature because customers who have not interacted in 30+ days show 3x higher churn rate in the EDA."
  4. Calibration awareness: "The model outputs probabilities of 0.7 that correspond to actual churn rates of ~0.65, suggesting slight overconfidence. With more time, I would apply Platt scaling."
  5. Honest limitations: "My model performs poorly on customers with < 3 months of history (recall drops to 0.32). This is likely because the features I engineered require sufficient behavioral data."

Negative Signals That Cause Instant Rejection

Instant Rejection

These signals cause evaluators to reject a submission immediately, regardless of model performance:

  1. Data leakage: Scaling, encoding, or feature selection on the full dataset before splitting. This inflates all your metrics and tells the evaluator you do not understand ML fundamentals.

  2. Copy-paste code: Obviously copied from Stack Overflow or tutorials with variable names like df_example or comments that reference a different dataset. Evaluators have seen these patterns thousands of times.

  3. No evaluation: Reporting only training metrics, or using accuracy on an imbalanced dataset without acknowledging the issue.

  4. Broken notebook: Code cells that produce errors, outputs from a different run than the code shown, or cells that only work in a specific execution order.

  5. Ignoring the prompt: Building a regression model when classification was asked for, or answering questions that were not in the prompt while skipping ones that were.

The "30-Second Test"

Before submitting, perform this test. Open your submission as if you are the evaluator seeing it for the first time:

THIRTY_SECOND_TEST = {
"Second 0-5": "Is there a README? Can I understand the project in one paragraph?",
"Second 5-10": "Is the notebook named clearly? Is there a logical structure?",
"Second 10-15": "Does the first cell explain what this project does?",
"Second 15-20": "Can I see section headers? Is there a clear flow?",
"Second 20-25": "Are there visualizations? Do they have titles and labels?",
"Second 25-30": "Is there a conclusion/results section? Can I find the final answer?",
}
# If any answer is "no", fix it before submitting.

Part 7 - Practice Exercises

Exercise 1: Score a Mock Submission

Read the following mock submission details and assign scores (1-5) for each criterion:

Task: Predict customer churn from a 50,000-row dataset with 15 features.

What the candidate submitted:

  • README: "Run analysis.ipynb" (one line)
  • Notebook: 35 cells, 5 markdown cells, 30 code cells
  • EDA: Three visualizations (histogram of target, correlation heatmap, scatter plot)
  • Preprocessing: StandardScaler applied to all features, missing values filled with mean
  • Models: Logistic regression and XGBoost, selected by test set accuracy
  • Results: XGBoost accuracy 91%, logistic regression accuracy 88%
  • Target distribution: 8% churn (imbalanced)
  • Write-up: Two paragraphs at the bottom of the notebook
  • No requirements.txt, no random seeds

Your scores:

CriterionYour Score (1-5)Why?
Problem Understanding
Data Exploration
Data Handling
Feature Engineering
Model Selection & Evaluation
Code Quality
Communication
Suggested Scoring
CriterionScoreRationale
Problem Understanding3Identified the task correctly, but did not discuss class imbalance
Data Exploration2Basic plots but no insights documented, no target analysis by features
Data Handling2Mean imputation without justification, no mention of potential leakage in scaling
Feature Engineering1No features engineered, raw features only
Model Selection & Evaluation2Two models compared, but used accuracy on imbalanced data, selected on test set
Code Quality2Some structure but no seeds, no requirements, minimal documentation
Communication2Brief write-up exists but lacks reasoning and methodology explanation
OverallLikely RejectThe 91% accuracy is misleading (majority class is 92%), suggesting the model is barely better than always predicting no churn

Exercise 2: Improve the Mock Submission

Take the mock submission above and write a plan to improve it. For each criterion, list specific actions you would take. Compare your plan against the guidance in this chapter.

Exercise 3: Build Your Own Rubric

Design an evaluation rubric for a take-home project in your target role. Adjust the weights based on what you know about the company and role. Practice scoring your own past work against this rubric.

Interview Cheat Sheet

QuestionKey Points
"How do you approach a take-home project?"Read prompt carefully, baseline first, proper evaluation, invest in communication
"What metrics would you use for this problem?"Depends on class balance, error costs, and business context - never just accuracy
"Why did you choose this model?"Started simple, compared multiple approaches, selected based on validation performance
"What would you do differently with more time?"Specific improvements tied to observed weaknesses, not generic buzzwords
"How do you ensure reproducibility?"Random seeds, requirements file, relative paths, clear run instructions
"What is data leakage and how do you prevent it?"Information from test/future leaking into training; use pipelines, split before preprocessing
"How do you handle imbalanced data?"Appropriate metrics (precision-recall, F1), stratified splits, possibly resampling with justification
"What makes a good take-home submission?"Clean code, sound methodology, honest communication, answers the actual prompt

Next Steps

Now that you understand what evaluators look for, you need reusable templates to structure your work efficiently. The next chapter, Project Templates, provides complete starter templates for every common take-home format - classification, NLP, time series, recommendation systems, computer vision, and LLM/RAG tasks.

© 2026 EngineersOfAI. All rights reserved.