What Evaluators Actually Want - Inside the Scoring Room

Reading time: ~30 min | Interview relevance: Critical | Roles: MLE, Data Scientist, Applied Scientist, AI Engineer, Research Engineer

The Real Interview Moment

You submit your take-home project at 11:47 PM on a Sunday night - 13 minutes before the deadline. You spent 14 hours across the weekend, tried six different models, squeezed out a 0.89 AUC, and included three pages of hyperparameter tuning results. You feel confident.

Two weeks later, the rejection email arrives: "Thank you for your time. After careful review, we have decided to move forward with other candidates."

You are devastated. You know your model was strong. What happened?

What happened is this: the evaluator - a senior ML engineer who reviews 15 take-home submissions per hiring cycle - opened your notebook, saw no README, scrolled through 47 code cells with no markdown explanations, noticed you used train_test_split after feature scaling (data leakage), and closed the notebook within 8 minutes. Your 0.89 AUC was irrelevant because the evaluator never trusted the number.

Meanwhile, the candidate who advanced submitted a 0.84 AUC model with a clean README, well-documented EDA, a proper pipeline that prevented leakage, and a two-page write-up explaining their feature engineering rationale and what they would try with more time. That candidate got an offer.

This chapter reveals exactly what happens on the other side of the submission - how evaluators think, what they score, and what actually determines whether your take-home leads to the next round.

What You Will Master

Real evaluation rubrics used at different company types
The relative weight of code quality vs. modeling quality
What "production readiness" means in a take-home context
How communication in notebooks and READMEs is scored
Common scoring criteria across the industry
The signals that cause instant pass or instant rejection

Self-Assessment: Where Are You Now?

Level	Description	Target
Beginner	"I assumed model accuracy was the main criterion"	Read all parts - your mental model needs recalibrating
Intermediate	"I know code quality matters, but I am unsure how it is weighted"	Focus on Parts 1-3 for the rubric details
Advanced	"I write clean code and document well, but I have still been rejected"	Jump to Part 4 (hidden signals) and the practice exercise

Part 1 - Real Evaluation Rubrics

How Evaluators Actually Review Submissions

Before we discuss criteria, understand the mechanics of review:

First pass (2-3 minutes): The evaluator opens the README (if there is one), skims the notebook structure, and forms a first impression. If there is no README, no markdown, and the code is wall-to-wall, the impression is already negative.
Second pass (5-10 minutes): They look at the data handling, check for leakage, review the evaluation approach, and read any write-up. This is where most submissions are rejected.
Deep review (15-30 minutes): Only the top 20-30% of submissions reach this stage. The evaluator reads the code carefully, checks methodology, and forms their recommendation.

How Evaluators Review Submissions - Three-Pass Flow with Decision Points

60-Second Answer

"Evaluators spend 10-15 minutes on most submissions, with only the top 20-30% receiving a deep review. The first 2-3 minutes are decisive - README quality, notebook structure, and first impression determine whether the evaluator approaches your work positively or is already looking for reasons to reject. This means your submission must communicate competence before the evaluator reads a single line of code."

The Standard Rubric (Composite from Multiple Companies)

After synthesizing rubrics from tech companies, startups, and financial firms, here is the composite scoring framework that most evaluators use (explicitly or implicitly):

Criterion	Weight	Score 1 (Fail)	Score 3 (Pass)	Score 5 (Strong Pass)
Problem Understanding	10%	Misinterprets the task or ignores requirements	Correctly identifies the task and addresses all requirements	Identifies ambiguities, states assumptions, goes beyond the literal prompt
Data Exploration	15%	No EDA or only `df.describe()`	Basic visualizations and summary statistics	Insightful EDA that reveals patterns, informs feature engineering, documents findings
Data Handling	10%	Ignores missing values, no preprocessing	Handles missing data, encodes categoricals, scales features	Thoughtful imputation, handles edge cases, documents decisions
Feature Engineering	15%	Uses raw features only	Creates some derived features with rationale	Creative, domain-aware features with clear justification
Model Selection & Evaluation	20%	Single model, no baseline, wrong metric	Multiple models compared, proper train/test split, appropriate metrics	Baseline comparison, cross-validation, multiple metrics, error analysis
Code Quality	15%	Spaghetti code, no functions, unreproducible	Organized notebook, some functions, runs end-to-end	Modular code, type hints, tests, requirements file, clean abstractions
Communication	15%	No write-up, no markdown, no README	Brief write-up, some markdown in notebook	Clear narrative, executive summary, honest about limitations, next steps

Company-Specific Rubric Variations

Different companies weight these criteria differently based on what they value:

Company Variation

Google / DeepMind: Heavy emphasis on statistical rigor and evaluation methodology. They will check if your cross-validation strategy is sound and whether your metric choice matches the problem. Less emphasis on polished presentation.

Startups (Series A-C): Heavy emphasis on code quality and production readiness. They need people who can ship, not just analyze. A Dockerfile, proper error handling, and a Makefile will set you apart.

Meta / Applied ML: Balanced across all criteria but with extra weight on feature engineering. They want to see you think creatively about the data.

Finance (Two Sigma, Citadel): Extreme emphasis on proper evaluation and absence of leakage. Statistical rigor is non-negotiable. They also expect awareness of temporal dependencies.

Healthcare (Flatiron, Tempus): Extra weight on interpretability and communication. Can you explain your model's decisions to a clinician? Do you discuss false negative vs. false positive trade-offs in clinical context?

Part 2 - Code Quality vs. Modeling Quality

The Great Misconception

Most candidates believe this is the weighting:

Modeling Quality: 70%
Code Quality: 15%
Communication: 15%

The actual weighting is closer to:

Modeling Quality: 30-35%
Code Quality: 30-35%
Communication: 30-35%

This shocks candidates who come from academic backgrounds where the model result is everything. In industry, a model that works but cannot be understood, maintained, or reproduced is worthless.

What "Modeling Quality" Actually Means

Modeling quality is not about having the highest metric. It is about demonstrating sound methodology:

Signal	What It Shows	How to Demonstrate
Baseline first	You understand relative improvement	Start with a dummy classifier or simple logistic regression
Appropriate model for data size	You will not overfit 500 rows with a neural net	Match model complexity to dataset size
Proper evaluation	You understand generalization	Cross-validation, held-out test set, appropriate metrics
Hyperparameter discipline	You do not randomly guess	Grid/random search with CV, document what you tried
Error analysis	You understand your model's weaknesses	Confusion matrix, per-class metrics, failure case examination
Honest reporting	You do not cherry-pick results	Report all models tried, including ones that did not work

# WRONG: This is what most candidates do
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score

model = XGBClassifier(n_estimators=1000, max_depth=10)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
# "I used XGBoost because it usually works best"

# RIGHT: This is what evaluators want to see
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import (
    classification_report,
    roc_auc_score,
    average_precision_score,
    confusion_matrix,
)
from sklearn.model_selection import cross_val_score

# Step 1: Establish baseline
baseline = DummyClassifier(strategy="most_frequent")
baseline_scores = cross_val_score(baseline, X_train, y_train, cv=5, scoring="roc_auc")
print(f"Baseline AUC (majority class): {baseline_scores.mean():.4f} +/- {baseline_scores.std():.4f}")

# Step 2: Simple model first
lr = LogisticRegression(max_iter=1000, random_state=42)
lr_scores = cross_val_score(lr, X_train, y_train, cv=5, scoring="roc_auc")
print(f"Logistic Regression AUC: {lr_scores.mean():.4f} +/- {lr_scores.std():.4f}")

# Step 3: More complex models only if justified
models = {
    "Random Forest": RandomForestClassifier(n_estimators=100, random_state=42),
    "Gradient Boosting": GradientBoostingClassifier(n_estimators=100, random_state=42),
}

results = {"Baseline": baseline_scores.mean(), "Logistic Regression": lr_scores.mean()}

for name, model in models.items():
    scores = cross_val_score(model, X_train, y_train, cv=5, scoring="roc_auc")
    results[name] = scores.mean()
    print(f"{name} AUC: {scores.mean():.4f} +/- {scores.std():.4f}")

# Step 4: Final evaluation on held-out test set (ONE TIME ONLY)
best_model = GradientBoostingClassifier(n_estimators=100, random_state=42)
best_model.fit(X_train, y_train)
y_pred = best_model.predict(X_test)
y_prob = best_model.predict_proba(X_test)[:, 1]

print("\n--- Final Test Set Evaluation ---")
print(f"ROC-AUC: {roc_auc_score(y_test, y_prob):.4f}")
print(f"Average Precision: {average_precision_score(y_test, y_prob):.4f}")
print(f"\nClassification Report:\n{classification_report(y_test, y_pred)}")
print(f"\nConfusion Matrix:\n{confusion_matrix(y_test, y_pred)}")

Common Trap

"I used XGBoost because it usually wins Kaggle competitions" is a red flag to evaluators. It signals that you do not think about why a model is appropriate for a given problem. Instead, say: "I started with logistic regression as a baseline (AUC 0.76), then tried random forest (AUC 0.81) and gradient boosting (AUC 0.84). The gradient boosting model showed meaningful improvement on the validation set, so I selected it for final evaluation. With more time, I would explore feature interactions that might help the linear model close the gap."

What "Code Quality" Actually Means

Code quality in a take-home is not about following every PEP 8 rule. It is about demonstrating that you write code that other people can understand and maintain.

Quality Signal	Failing Example	Passing Example
Naming	`df2`, `X_new`, `temp`	`df_customers`, `X_train_scaled`, `features_engineered`
Functions	200-line notebook cells	Reusable functions with docstrings
Comments	No comments or `# increment i`	Comments explaining why, not what
Organization	Random cell order, dead code	Clear sections with markdown headers
Reproducibility	No seeds, no requirements	`random_state=42` everywhere, `requirements.txt`
Error handling	`try: except: pass` or none	Meaningful error messages, input validation

# FAILING code quality
df = pd.read_csv('data.csv')
df = df.dropna()
df['f1'] = df['a'] / df['b']
X = df.drop('target', axis=1)
y = df['target']
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X = sc.fit_transform(X)  # LEAKAGE: scaling before split!
X_train, X_test, y_train, y_test = train_test_split(X, y)
# ... 300 more lines like this

# PASSING code quality
def load_and_validate_data(filepath: str) -> pd.DataFrame:
    """Load dataset and perform basic validation checks.

    Args:
        filepath: Path to the CSV data file.

    Returns:
        Validated DataFrame with basic type checks applied.

    Raises:
        FileNotFoundError: If the data file does not exist.
        ValueError: If required columns are missing.
    """
    REQUIRED_COLUMNS = ["customer_id", "tenure", "monthly_charges", "churn"]

    df = pd.read_csv(filepath)

    missing_cols = set(REQUIRED_COLUMNS) - set(df.columns)
    if missing_cols:
        raise ValueError(f"Missing required columns: {missing_cols}")

    print(f"Loaded {len(df)} records with {len(df.columns)} columns")
    print(f"Missing values: {df.isnull().sum().sum()} total")

    return df


def create_features(df: pd.DataFrame) -> pd.DataFrame:
    """Engineer features from raw customer data.

    Feature engineering rationale:
    - charge_per_tenure: Captures customer value trajectory
    - is_new_customer: Tenure < 6 months correlates with higher churn
    - high_monthly_charge: Identifies price-sensitive segment
    """
    df = df.copy()  # Avoid modifying the original DataFrame

    df["charge_per_tenure"] = df["monthly_charges"] / (df["tenure"] + 1)  # +1 to avoid division by zero
    df["is_new_customer"] = (df["tenure"] < 6).astype(int)
    df["high_monthly_charge"] = (df["monthly_charges"] > df["monthly_charges"].median()).astype(int)

    return df

Part 3 - The Production Readiness Signal

What Evaluators Mean by "Production Readiness"

When evaluators say they want "production-quality code," they do not expect a deployed microservice. They want evidence that you think about production concerns:

Production Readiness Signals - Reproducibility, Robustness, Maintainability, Scalability Awareness

The Reproducibility Test

Many evaluators have a simple test: they clone your repository and try to run your code. If it does not work, you fail. Period.

# The reproducibility checklist evaluators mentally run:

reproducibility_checks = {
    "Can I install dependencies?": [
        "requirements.txt or environment.yml exists",
        "Python version specified",
        "All imports resolve after installing",
    ],
    "Can I run the code?": [
        "README has clear run instructions",
        "No hardcoded absolute paths (/Users/john/data/...)",
        "Data files included or download instructions provided",
        "Single command to execute (python main.py or Run All in notebook)",
    ],
    "Do I get the same results?": [
        "Random seeds set (numpy, sklearn, torch, python hashseed)",
        "No non-deterministic operations without seeds",
        "Results match what is reported in the write-up",
    ],
}

Instant Rejection

Here are the fastest ways to fail the reproducibility test:

Hardcoded paths: pd.read_csv("/Users/yourname/Desktop/project/data.csv") - Use relative paths: pd.read_csv("data/dataset.csv")
Missing dependencies: You used lightgbm but it is not in your requirements file.
Notebook state pollution: Your notebook only works if cells are run in a specific non-linear order. Always restart kernel and run all before submitting.
Missing data: You processed the data locally but only submitted the processed version with no way to reproduce the processing.

The Pipeline Test

Evaluators check whether you use proper ML pipelines to prevent data leakage:

# WRONG: Manual preprocessing invites leakage
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Scaling before splitting = DATA LEAKAGE
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # Test data statistics leak into training
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, random_state=42)

# RIGHT: Pipeline prevents leakage by design
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split

# Split FIRST
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.2)

# Pipeline ensures scaler is fit only on training data
pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("model", GradientBoostingClassifier(random_state=42)),
])

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
# The scaler was fit ONLY on X_train and applied to X_test - no leakage

Part 4 - Communication: The Hidden Differentiator

Why Communication Outweighs Model Performance

A senior ML engineer at a major tech company shared this perspective: "I have never rejected a candidate for having a model that was 2% worse than the best submission. I have rejected dozens of candidates who could not explain why they made the choices they made."

Communication demonstrates:

Thinking process: How you approach ambiguity and make decisions
Collaboration readiness: Can you explain your work to teammates and stakeholders?
Self-awareness: Do you know the limitations of your approach?
Senior-level judgment: Do you understand trade-offs, not just techniques?

The Notebook Communication Framework

Every markdown cell in your notebook should follow this pattern:

## Section Title

**What I am doing:** [One sentence describing the action]

**Why:** [One sentence explaining the reasoning]

[Code cell]

**What I found:** [Key takeaway from the results]

**Implication for next steps:** [How this finding informs the next decision]

README Structure That Passes

# [Project Title] - Take-Home Challenge

## Overview
[2-3 sentences: What is the problem? What approach did I take? What were the key results?]

## Setup
```bash
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

How to Run

# Run the full analysis
jupyter notebook analysis.ipynb

# Or run as a script
python src/main.py

Project Structure

project/
├── data/               # Dataset files
├── notebooks/          # Analysis notebooks
│   └── analysis.ipynb  # Main analysis (run this)
├── src/                # Source code
│   ├── features.py     # Feature engineering
│   ├── models.py       # Model training and evaluation
│   └── utils.py        # Utility functions
├── results/            # Output figures and results
├── requirements.txt    # Python dependencies
└── README.md           # This file

Key Findings

[Finding 1]
[Finding 2]
[Finding 3]

Approach

[3-5 sentences summarizing your methodology]

Results

Model	AUC	Precision	Recall	F1
Baseline (majority)	0.500	-	-	-
Logistic Regression	0.764	0.42	0.68	0.52
Gradient Boosting	0.841	0.53	0.71	0.61

Limitations & Next Steps

[Limitation 1 and what you would do about it]
[Limitation 2 and what you would do about it]
[Next step you would take with more time]

Time Spent

Approximately X hours over Y days.

:::tip[60-Second Answer]
"The README is the first thing evaluators see and the last thing most candidates write. A strong README takes 20 minutes to write and immediately signals professionalism. It should contain: a one-paragraph overview, setup instructions (copy-pasteable), project structure, key findings, results table, and limitations. Think of it as the executive summary for a busy reviewer who may never open your notebook."
:::

### Write-Up Quality Spectrum

| Level | Description | Example |
|-------|-------------|---------|
| **No write-up** | Just code with no explanation | (Instant red flag) |
| **Minimal** | "I used random forest and got 85% accuracy" | (Fails - no reasoning) |
| **Adequate** | Describes what was done, some reasoning for choices | (Passes at junior level) |
| **Good** | Clear methodology, explains trade-offs, documents assumptions | (Passes at mid-level) |
| **Excellent** | Narrative thread from problem to solution, honest about limitations, business context | (Passes at senior level) |


## Part 5 - Scoring Criteria Deep Dive

### Criterion 1: Data Handling (10-15% of score)

What evaluators check:

| Check | What They Look For | Red Flag |
|-------|-------------------|----------|
| **Missing data** | Thoughtful imputation strategy | `df.dropna()` without justification |
| **Data types** | Correct handling of categoricals, dates, text | Feeding string columns into a model |
| **Outliers** | Detection and reasoned handling | Ignoring clear outliers or blindly removing them |
| **Data leakage** | Proper temporal splits, no target leakage | Features derived from the target variable |
| **Train/test split** | Appropriate strategy for the data type | Random split on time-series data |

### Criterion 2: Feature Engineering (10-15% of score)

```python
# Examples of feature engineering that impress evaluators

# Domain-aware feature: customer tenure bands
def create_tenure_features(df: pd.DataFrame) -> pd.DataFrame:
    """Create tenure-based features with domain rationale.

    Rationale: Customer churn research shows non-linear relationship
    with tenure - highest churn in first 6 months (onboarding friction)
    and after 24 months (seeking better deals). Creating bins captures
    these distinct behavioral segments.
    """
    df = df.copy()
    df["tenure_band"] = pd.cut(
        df["tenure"],
        bins=[0, 6, 12, 24, 48, float("inf")],
        labels=["new", "settling", "established", "loyal", "veteran"],
    )
    df["months_since_last_interaction"] = (
        pd.to_datetime("today") - pd.to_datetime(df["last_interaction_date"])
    ).dt.days / 30

    # Interaction features that capture customer engagement patterns
    df["avg_monthly_spend"] = df["total_charges"] / (df["tenure"] + 1)
    df["spend_trend"] = df["last_3mo_charges"] / (df["prev_3mo_charges"] + 1)

    return df

Criterion 3: Model Selection & Evaluation (15-20% of score)

The most heavily weighted technical criterion. Evaluators look for:

Baseline comparison: Did you establish what a naive model achieves?
Appropriate metrics: Did you choose metrics that match the business problem?
Proper validation: Cross-validation, not a single train/test split for model selection
No data leakage: Preprocessing inside the validation loop
Error analysis: Understanding where the model fails, not just overall performance

Metric Choice	When Appropriate	When Inappropriate
Accuracy	Balanced classes, equal error costs	Imbalanced datasets (almost always inappropriate)
ROC-AUC	Ranking quality matters, threshold flexibility needed	When class distribution differs between train and production
Precision-Recall AUC	Imbalanced datasets, positive class is rare	When negative class errors are equally costly
F1 Score	Need a single threshold-dependent metric	When precision/recall trade-off matters (report both)
RMSE	Regression, errors in original units	When outlier sensitivity is a concern (use MAE)
Log Loss	Probability calibration matters	When only ranking matters (use AUC)

Criterion 4: Code Quality (10-15% of score)

Evaluators mentally score code quality on these dimensions:

Code Quality Scoring Dimensions - Readability, Structure, Reproducibility, Robustness with Point Values

Criterion 5: Communication (15-20% of score)

This is the criterion that most candidates underweight and the one that most often separates "reject" from "advance":

Communication Element	What Evaluators Want	Common Mistake
Problem statement	Restate in your own words, note ambiguities	Diving straight into code without framing
EDA narrative	Insights, not just plots	Showing 15 plots with no interpretation
Decision rationale	Why you chose this model/metric/approach	"I used XGBoost because it is popular"
Results presentation	Comparison tables, key metrics highlighted	Printing raw sklearn output
Limitations	Honest about what is not ideal	Claiming perfect results or ignoring weaknesses
Next steps	What you would do with more time	No mention of future improvements

Common Trap

Evaluators can tell the difference between genuine next steps ("With more time, I would address the class imbalance using SMOTE and explore temporal features since the data has a time dimension I did not fully leverage") and fake next steps ("With more time, I would use deep learning and deploy the model as an API"). The former shows understanding of your solution's actual weaknesses. The latter shows you are just listing buzzwords.

Part 6 - The Hidden Signals

Positive Signals That Make Evaluators Excited

These are the things that make an evaluator think "I want to work with this person":

Questioning the data: "I noticed 3% of records have negative transaction amounts. I assumed these are refunds and handled them as follows..."
Principled metric choice: "Given the business context (churn prediction where false negatives are costly), I optimize for recall at a precision threshold of 0.5."
Feature engineering rationale: "I created a 'recency' feature because customers who have not interacted in 30+ days show 3x higher churn rate in the EDA."
Calibration awareness: "The model outputs probabilities of 0.7 that correspond to actual churn rates of ~0.65, suggesting slight overconfidence. With more time, I would apply Platt scaling."
Honest limitations: "My model performs poorly on customers with < 3 months of history (recall drops to 0.32). This is likely because the features I engineered require sufficient behavioral data."

Negative Signals That Cause Instant Rejection

Instant Rejection

These signals cause evaluators to reject a submission immediately, regardless of model performance:

Data leakage: Scaling, encoding, or feature selection on the full dataset before splitting. This inflates all your metrics and tells the evaluator you do not understand ML fundamentals.
Copy-paste code: Obviously copied from Stack Overflow or tutorials with variable names like df_example or comments that reference a different dataset. Evaluators have seen these patterns thousands of times.
No evaluation: Reporting only training metrics, or using accuracy on an imbalanced dataset without acknowledging the issue.
Broken notebook: Code cells that produce errors, outputs from a different run than the code shown, or cells that only work in a specific execution order.
Ignoring the prompt: Building a regression model when classification was asked for, or answering questions that were not in the prompt while skipping ones that were.

The "30-Second Test"

Before submitting, perform this test. Open your submission as if you are the evaluator seeing it for the first time:

THIRTY_SECOND_TEST = {
    "Second 0-5": "Is there a README? Can I understand the project in one paragraph?",
    "Second 5-10": "Is the notebook named clearly? Is there a logical structure?",
    "Second 10-15": "Does the first cell explain what this project does?",
    "Second 15-20": "Can I see section headers? Is there a clear flow?",
    "Second 20-25": "Are there visualizations? Do they have titles and labels?",
    "Second 25-30": "Is there a conclusion/results section? Can I find the final answer?",
}
# If any answer is "no", fix it before submitting.

Part 7 - Practice Exercises

Exercise 1: Score a Mock Submission

Read the following mock submission details and assign scores (1-5) for each criterion:

Task: Predict customer churn from a 50,000-row dataset with 15 features.

What the candidate submitted:

README: "Run analysis.ipynb" (one line)
Notebook: 35 cells, 5 markdown cells, 30 code cells
EDA: Three visualizations (histogram of target, correlation heatmap, scatter plot)
Preprocessing: StandardScaler applied to all features, missing values filled with mean
Models: Logistic regression and XGBoost, selected by test set accuracy
Results: XGBoost accuracy 91%, logistic regression accuracy 88%
Target distribution: 8% churn (imbalanced)
Write-up: Two paragraphs at the bottom of the notebook
No requirements.txt, no random seeds

Your scores:

Criterion	Your Score (1-5)	Why?
Problem Understanding
Data Exploration
Data Handling
Feature Engineering
Model Selection & Evaluation
Code Quality
Communication

Suggested Scoring

Criterion	Score	Rationale
Problem Understanding	3	Identified the task correctly, but did not discuss class imbalance
Data Exploration	2	Basic plots but no insights documented, no target analysis by features
Data Handling	2	Mean imputation without justification, no mention of potential leakage in scaling
Feature Engineering	1	No features engineered, raw features only
Model Selection & Evaluation	2	Two models compared, but used accuracy on imbalanced data, selected on test set
Code Quality	2	Some structure but no seeds, no requirements, minimal documentation
Communication	2	Brief write-up exists but lacks reasoning and methodology explanation
Overall	Likely Reject	The 91% accuracy is misleading (majority class is 92%), suggesting the model is barely better than always predicting no churn

Exercise 2: Improve the Mock Submission

Take the mock submission above and write a plan to improve it. For each criterion, list specific actions you would take. Compare your plan against the guidance in this chapter.

Exercise 3: Build Your Own Rubric

Design an evaluation rubric for a take-home project in your target role. Adjust the weights based on what you know about the company and role. Practice scoring your own past work against this rubric.

Interview Cheat Sheet

Question	Key Points
"How do you approach a take-home project?"	Read prompt carefully, baseline first, proper evaluation, invest in communication
"What metrics would you use for this problem?"	Depends on class balance, error costs, and business context - never just accuracy
"Why did you choose this model?"	Started simple, compared multiple approaches, selected based on validation performance
"What would you do differently with more time?"	Specific improvements tied to observed weaknesses, not generic buzzwords
"How do you ensure reproducibility?"	Random seeds, requirements file, relative paths, clear run instructions
"What is data leakage and how do you prevent it?"	Information from test/future leaking into training; use pipelines, split before preprocessing
"How do you handle imbalanced data?"	Appropriate metrics (precision-recall, F1), stratified splits, possibly resampling with justification
"What makes a good take-home submission?"	Clean code, sound methodology, honest communication, answers the actual prompt

Next Steps

Now that you understand what evaluators look for, you need reusable templates to structure your work efficiently. The next chapter, Project Templates, provides complete starter templates for every common take-home format - classification, NLP, time series, recommendation systems, computer vision, and LLM/RAG tasks.

The Real Interview Moment​

What You Will Master​

Self-Assessment: Where Are You Now?​

Part 1 - Real Evaluation Rubrics​

How Evaluators Actually Review Submissions​

The Standard Rubric (Composite from Multiple Companies)​

Company-Specific Rubric Variations​

Part 2 - Code Quality vs. Modeling Quality​

The Great Misconception​

What "Modeling Quality" Actually Means​

What "Code Quality" Actually Means​

Part 3 - The Production Readiness Signal​

What Evaluators Mean by "Production Readiness"​

The Reproducibility Test​

The Pipeline Test​

Part 4 - Communication: The Hidden Differentiator​

Why Communication Outweighs Model Performance​

The Notebook Communication Framework​

README Structure That Passes​

How to Run​

Project Structure​

Key Findings​

Approach​

Results​

Limitations & Next Steps​

Time Spent​

Criterion 3: Model Selection & Evaluation (15-20% of score)​

Criterion 4: Code Quality (10-15% of score)​

Criterion 5: Communication (15-20% of score)​

Part 6 - The Hidden Signals​

Positive Signals That Make Evaluators Excited​

Negative Signals That Cause Instant Rejection​

The "30-Second Test"​

Part 7 - Practice Exercises​

Exercise 1: Score a Mock Submission​

Exercise 2: Improve the Mock Submission​

Exercise 3: Build Your Own Rubric​

Interview Cheat Sheet​

Next Steps​