Code Quality Standards - The Silent Evaluator
Reading time: ~40 min | Interview relevance: Critical | Roles: MLE, AI Eng, Data Scientist, Research Engineer, MLOps
The Real Interview Moment
You are reviewing take-home submissions at a Series B startup. Two candidates solved the same churn prediction problem. Both achieved an AUC of 0.87. You open Candidate A's notebook: 47 cells, no markdown, variable names like df2, df_final_v3, temp, commented-out code scattered everywhere, and a single 200-line cell that does feature engineering, training, and evaluation in one block. You have no idea what is happening by cell 15. You close the notebook and write "No hire - cannot assess thought process."
You open Candidate B's notebook: a clean table of contents at the top, each section separated by markdown headers, functions with type hints and docstrings, a requirements.txt pinned to exact versions, a README.md explaining how to reproduce the results, and a final "Summary and Next Steps" section. You can follow the logic in five minutes. You write "Strong hire - clear thinker, production-ready habits."
Both candidates had the same model performance. One got hired. One did not. The difference was code quality. This page teaches you exactly how to be Candidate B.
What You Will Master
- Structure a Jupyter notebook with professional-grade organization and flow
- Decompose monolithic cells into clean, reusable functions
- Apply type hints and docstrings in data science code
- Handle errors and edge cases gracefully in exploratory code
- Guarantee reproducibility with seeds, dependency pinning, and environment management
- Write meaningful tests for data pipelines and model logic
- Follow clean code principles adapted for data science workflows
- Create a submission package that signals production readiness
Self-Assessment: Where Are You Now?
| Skill | 1 -- Cannot | 2 -- Vaguely | 3 -- Can Do | 4 -- Consistently | 5 -- Can Teach | Your Score |
|---|---|---|---|---|---|---|
| Organize a notebook with clear sections | ___ | |||||
| Extract reusable functions from notebook cells | ___ | |||||
| Add type hints to data science functions | ___ | |||||
| Handle errors in data loading and preprocessing | ___ | |||||
| Set random seeds for full reproducibility | ___ | |||||
| Write unit tests for feature engineering | ___ | |||||
| Pin dependencies and document environment | ___ | |||||
| Create a professional README for a take-home | ___ |
Target: All 4s and 5s before you submit any take-home.
Part 1 -- Notebook Organization
The Professional Notebook Structure
Every take-home notebook should follow a consistent structure that allows the evaluator to navigate your thought process in under two minutes.
Section 1: Header and Setup
The first cell of your notebook sets the tone. It should contain a title, your name, the date, and a brief problem statement. The second cell should contain all imports, grouped logically.
# Cell 1 - Markdown
"""
# Customer Churn Prediction - Take-Home Assessment
**Candidate:** Jane Smith
**Date:** 2026-03-07
**Time spent:** ~6 hours
## Problem Statement
Predict which customers will churn in the next 30 days using
transaction history, demographics, and engagement data.
## Approach Summary
1. EDA reveals class imbalance (8% churn rate) and strong temporal patterns
2. Feature engineering: RFM features, rolling aggregates, engagement velocity
3. Model: LightGBM with stratified 5-fold CV, optimized for PR-AUC
4. Final PR-AUC: 0.43 (baseline: 0.08)
"""
# Cell 2 - Imports (grouped by category)
# Standard library
import os
import json
import logging
from pathlib import Path
from datetime import datetime, timedelta
from typing import Tuple, Dict, List, Optional
# Data manipulation
import numpy as np
import pandas as pd
# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
# ML
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import (
roc_auc_score,
precision_recall_curve,
average_precision_score,
classification_report,
)
from sklearn.preprocessing import StandardScaler
import lightgbm as lgb
# Configuration
plt.style.use("seaborn-v0_8-whitegrid")
pd.set_option("display.max_columns", 50)
pd.set_option("display.max_rows", 100)
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
"I organize my notebooks into seven clear sections: header, data loading, EDA, feature engineering, modeling, evaluation, and summary. Each section starts with a markdown cell explaining what I am doing and why. All imports are in one cell at the top. Constants and configuration are defined once. This makes it possible for a reviewer to understand my approach in under two minutes without running any code."
Section Separators and Markdown
Every section should begin with a markdown cell that explains the purpose of that section and any key decisions. Think of markdown cells as the narration of your analysis - they tell the story that your code implements.
Bad notebook flow:
[Code cell] → [Code cell] → [Code cell] → [Code cell] → [Code cell]
Good notebook flow:
[Markdown: What and Why] → [Code cell] → [Markdown: Observation] →
[Code cell] → [Markdown: Decision and Rationale]
Do not over-narrate. Evaluators do not want a paragraph explaining what df.shape does. Save markdown for decisions, observations, and rationale. "I chose PR-AUC over ROC-AUC because the classes are heavily imbalanced (8% positive rate), and we care more about precision at low recall thresholds" is useful. "Now I will check the shape of the dataframe" is noise.
The "Cell Length" Rule
No single code cell should exceed 30 lines. If a cell is longer, it is doing too much. Extract a function, split the cell, or move logic to a utility module.
# BAD - 60-line monolithic cell
# ... loads data, cleans data, creates features, trains model ...
# GOOD - focused cells
# Cell: Load and validate raw data
raw_df = load_and_validate("data/transactions.csv")
# Cell: Create RFM features
rfm_features = create_rfm_features(raw_df, reference_date="2026-01-01")
# Cell: Create engagement features
engagement_features = create_engagement_features(raw_df, window_days=30)
Part 2 -- Function Decomposition
Why Functions Matter in Take-Homes
Evaluators are not just checking whether your code runs. They are checking whether you can write code that a team could maintain. Functions serve three purposes in a take-home:
- Readability - A function name like
create_rfm_features()is self-documenting - Reusability - The same function can be applied to train and test sets consistently
- Testability - Functions can be unit-tested; raw cells cannot
The Anatomy of a Well-Written Data Science Function
def create_rfm_features(
transactions: pd.DataFrame,
customer_id_col: str = "customer_id",
date_col: str = "transaction_date",
amount_col: str = "amount",
reference_date: Optional[str] = None,
) -> pd.DataFrame:
"""Create Recency, Frequency, Monetary features per customer.
Computes three features for each customer:
- Recency: days since last transaction
- Frequency: total number of transactions
- Monetary: average transaction amount
Args:
transactions: Raw transaction DataFrame with at least customer_id,
transaction_date, and amount columns.
customer_id_col: Name of the customer identifier column.
date_col: Name of the transaction date column.
amount_col: Name of the transaction amount column.
reference_date: Date to compute recency from. If None, uses max date
in the dataset.
Returns:
DataFrame indexed by customer_id with columns:
recency_days, frequency, monetary_avg.
Raises:
ValueError: If required columns are missing from the input DataFrame.
ValueError: If transactions DataFrame is empty.
Example:
>>> rfm = create_rfm_features(transactions_df)
>>> rfm.head()
recency_days frequency monetary_avg
customer_id
C001 3 15 42.50
C002 45 2 120.00
"""
# Validate inputs
required_cols = {customer_id_col, date_col, amount_col}
missing_cols = required_cols - set(transactions.columns)
if missing_cols:
raise ValueError(f"Missing columns: {missing_cols}")
if transactions.empty:
raise ValueError("Input DataFrame is empty")
df = transactions.copy()
df[date_col] = pd.to_datetime(df[date_col])
if reference_date is None:
ref_date = df[date_col].max()
else:
ref_date = pd.to_datetime(reference_date)
rfm = (
df.groupby(customer_id_col)
.agg(
recency_days=(date_col, lambda x: (ref_date - x.max()).days),
frequency=(date_col, "count"),
monetary_avg=(amount_col, "mean"),
)
)
logger.info(
f"Created RFM features for {len(rfm)} customers. "
f"Recency range: [{rfm['recency_days'].min()}, {rfm['recency_days'].max()}]"
)
return rfm
When to Use Functions vs. Inline Code
| Use a Function When | Keep Inline When |
|---|---|
| Logic is reused on train AND test sets | One-off exploratory visualization |
| Logic exceeds 10 lines | Simple pandas one-liner (df.describe()) |
| Logic has clear input/output contract | Quick sanity check (print shape, dtypes) |
| Logic needs to be tested | Markdown-adjacent explanation code |
| Logic involves non-obvious transformations | Standard library calls with obvious intent |
Never apply different preprocessing to train and test sets by writing the logic twice inline. This is the number one source of train-test skew in take-homes. Extract a function and call it on both sets with the same parameters.
# CATASTROPHIC - different logic for train and test
train_df["age_bin"] = pd.cut(train_df["age"], bins=5)
test_df["age_bin"] = pd.cut(test_df["age"], bins=4) # Different bins!
# CORRECT - single function, consistent application
def bin_age(df: pd.DataFrame, bins: int = 5) -> pd.DataFrame:
df = df.copy()
df["age_bin"] = pd.cut(df["age"], bins=bins)
return df
train_df = bin_age(train_df)
test_df = bin_age(test_df)
Extracting Functions: A Step-by-Step Process
When you have a working monolithic cell, follow this process to refactor it:
- Identify the inputs - What data does this block need?
- Identify the outputs - What does it produce?
- Name the operation - What verb describes the transformation?
- Extract and parameterize - Move hardcoded values to parameters with defaults
- Add types and docstring - Document the contract
- Validate inputs - Add checks for common errors
- Test - Call it on a small sample and verify output
# BEFORE: Monolithic feature engineering cell (40 lines)
df["days_since_signup"] = (pd.Timestamp("2026-01-01") - df["signup_date"]).dt.days
df["log_revenue"] = np.log1p(df["total_revenue"])
df["orders_per_month"] = df["total_orders"] / (df["days_since_signup"] / 30)
df["avg_order_value"] = df["total_revenue"] / df["total_orders"].clip(lower=1)
# ... 30 more lines ...
# AFTER: Clean function calls
df = add_temporal_features(df, reference_date="2026-01-01")
df = add_revenue_features(df)
df = add_behavioral_features(df)
Part 3 -- Type Hints and Docstrings
Type Hints for Data Science
Type hints are not just for software engineers. In data science code, they communicate the expected data contract at a glance.
from typing import Tuple, Dict, List, Optional, Union
import numpy as np
import pandas as pd
from numpy.typing import NDArray
def train_evaluate_model(
X_train: pd.DataFrame,
y_train: pd.Series,
X_val: pd.DataFrame,
y_val: pd.Series,
params: Dict[str, Union[int, float, str]],
feature_names: Optional[List[str]] = None,
) -> Tuple[lgb.Booster, Dict[str, float]]:
"""Train a LightGBM model and return it with evaluation metrics.
Args:
X_train: Training features.
y_train: Training labels (binary).
X_val: Validation features.
y_val: Validation labels (binary).
params: LightGBM hyperparameters.
feature_names: Subset of columns to use. If None, uses all columns.
Returns:
Tuple of (trained model, dict of evaluation metrics).
"""
if feature_names is not None:
X_train = X_train[feature_names]
X_val = X_val[feature_names]
train_data = lgb.Dataset(X_train, label=y_train)
val_data = lgb.Dataset(X_val, label=y_val, reference=train_data)
model = lgb.train(
params,
train_data,
valid_sets=[val_data],
num_boost_round=1000,
callbacks=[lgb.early_stopping(50), lgb.log_evaluation(100)],
)
y_pred = model.predict(X_val)
metrics = {
"roc_auc": roc_auc_score(y_val, y_pred),
"pr_auc": average_precision_score(y_val, y_pred),
}
return model, metrics
Common Type Hint Patterns in Data Science
# DataFrames and Series
def process(df: pd.DataFrame) -> pd.DataFrame: ...
def get_labels(df: pd.DataFrame) -> pd.Series: ...
# NumPy arrays
def normalize(arr: NDArray[np.float64]) -> NDArray[np.float64]: ...
# Multiple return values
def split_data(df: pd.DataFrame) -> Tuple[pd.DataFrame, pd.DataFrame]: ...
# Optional parameters
def plot_results(
metrics: Dict[str, float],
save_path: Optional[str] = None,
figsize: Tuple[int, int] = (10, 6),
) -> None: ...
# Union types for flexibility
def load_data(source: Union[str, Path]) -> pd.DataFrame: ...
You do not need to type-hint every helper function in an exploratory notebook. Focus type hints on the core pipeline functions - data loading, feature engineering, model training, and evaluation. These are the functions the evaluator will read most carefully, and type hints there signal maturity without adding bureaucratic overhead everywhere.
Docstring Styles
Use the Google style for consistency. It is compact and readable in notebooks.
def compute_feature_importance(
model: lgb.Booster,
feature_names: List[str],
importance_type: str = "gain",
top_n: int = 20,
) -> pd.DataFrame:
"""Compute and format feature importance from a trained model.
Args:
model: Trained LightGBM Booster.
feature_names: List of feature names matching model input.
importance_type: Type of importance. One of 'gain', 'split'.
top_n: Number of top features to return.
Returns:
DataFrame with columns 'feature' and 'importance', sorted descending.
Raises:
ValueError: If importance_type is not 'gain' or 'split'.
"""
if importance_type not in ("gain", "split"):
raise ValueError(
f"importance_type must be 'gain' or 'split', got '{importance_type}'"
)
importance = model.feature_importance(importance_type=importance_type)
importance_df = (
pd.DataFrame({"feature": feature_names, "importance": importance})
.sort_values("importance", ascending=False)
.head(top_n)
.reset_index(drop=True)
)
return importance_df
Part 4 -- Error Handling
Defensive Coding in Data Science
Data is messy. Your code should handle the mess gracefully instead of crashing with an inscrutable traceback. Evaluators look for evidence that you anticipate real-world data problems.
The Three Layers of Data Validation
Layer 1: Schema Validation (on load)
def load_and_validate(
filepath: Union[str, Path],
required_columns: List[str],
date_columns: Optional[List[str]] = None,
) -> pd.DataFrame:
"""Load a CSV and validate its schema before any processing.
Args:
filepath: Path to the CSV file.
required_columns: Columns that must be present.
date_columns: Columns to parse as datetime.
Returns:
Validated DataFrame with correct dtypes.
Raises:
FileNotFoundError: If the file does not exist.
ValueError: If required columns are missing.
"""
filepath = Path(filepath)
if not filepath.exists():
raise FileNotFoundError(f"Data file not found: {filepath}")
df = pd.read_csv(filepath, parse_dates=date_columns)
missing = set(required_columns) - set(df.columns)
if missing:
raise ValueError(
f"Missing required columns: {missing}. "
f"Available columns: {list(df.columns)}"
)
logger.info(f"Loaded {len(df)} rows, {len(df.columns)} columns from {filepath}")
return df
Layer 2: Data Quality Checks (during EDA)
def check_data_quality(df: pd.DataFrame) -> Dict[str, any]:
"""Run data quality checks and return a summary report.
Returns a dictionary with quality metrics. Does NOT raise errors -
instead, logs warnings for issues that should be investigated.
"""
report = {
"n_rows": len(df),
"n_cols": len(df.columns),
"duplicate_rows": df.duplicated().sum(),
"null_counts": df.isnull().sum().to_dict(),
"null_pct": (df.isnull().sum() / len(df) * 100).to_dict(),
}
if report["duplicate_rows"] > 0:
logger.warning(
f"Found {report['duplicate_rows']} duplicate rows "
f"({report['duplicate_rows']/len(df)*100:.1f}%)"
)
high_null_cols = {
col: pct
for col, pct in report["null_pct"].items()
if pct > 50
}
if high_null_cols:
logger.warning(f"Columns with >50% nulls: {high_null_cols}")
return report
Layer 3: Output Validation (after transformation)
def validate_features(
features: pd.DataFrame,
expected_rows: int,
no_null_columns: Optional[List[str]] = None,
) -> None:
"""Validate feature DataFrame after engineering.
Args:
features: The feature DataFrame to validate.
expected_rows: Expected number of rows.
no_null_columns: Columns that must have zero nulls.
Raises:
AssertionError: If any validation check fails.
"""
assert len(features) == expected_rows, (
f"Row count mismatch: expected {expected_rows}, got {len(features)}"
)
inf_cols = [
col for col in features.select_dtypes(include=[np.number]).columns
if np.isinf(features[col]).any()
]
assert not inf_cols, f"Infinite values found in columns: {inf_cols}"
if no_null_columns:
null_cols = [
col for col in no_null_columns
if features[col].isnull().any()
]
assert not null_cols, f"Unexpected nulls in columns: {null_cols}"
Do not silently drop rows with missing values. Every row you drop should be logged with a reason. Evaluators who see df.dropna() without explanation will wonder whether you introduced survivorship bias.
# BAD - silent data loss
df = df.dropna()
# GOOD - explicit, logged, justified
n_before = len(df)
df = df.dropna(subset=["target_variable"])
n_after = len(df)
logger.info(
f"Dropped {n_before - n_after} rows with missing target "
f"({(n_before - n_after) / n_before * 100:.1f}%)"
)
Part 5 -- Reproducibility
The Reproducibility Checklist
If an evaluator clones your repository and runs your notebook, they should get exactly the same results. This is non-negotiable.
Setting Random Seeds Properly
Setting np.random.seed(42) is not enough. You must seed every library that uses randomness.
import random
import os
import numpy as np
def set_all_seeds(seed: int = 42) -> None:
"""Set random seeds for full reproducibility.
Sets seeds for Python's random module, NumPy, and optionally
PyTorch and TensorFlow if they are available.
Args:
seed: The random seed value.
"""
random.seed(seed)
np.random.seed(seed)
os.environ["PYTHONHASHSEED"] = str(seed)
try:
import torch
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
except ImportError:
pass
try:
import tensorflow as tf
tf.random.set_seed(seed)
except ImportError:
pass
logger.info(f"All random seeds set to {seed}")
# Call at the very top of your notebook
SEED = 42
set_all_seeds(SEED)
Dependency Pinning
Always include a requirements.txt with exact versions. Evaluators should be able to recreate your environment.
# requirements.txt - pinned for reproducibility
numpy==1.26.4
pandas==2.2.1
scikit-learn==1.4.1
lightgbm==4.3.0
matplotlib==3.8.3
seaborn==0.13.2
jupyter==1.0.0
Generate this automatically:
pip freeze > requirements.txt
Or better, use a minimal requirements file listing only what you directly import:
# Generate minimal requirements
pip install pipreqs
pipreqs . --force
The README Template
Every take-home submission should include a README:
# README.md template for take-home submissions
README_TEMPLATE = """
# {Project Title} - Take-Home Assessment
## Quick Start
```bash
python -m venv venv
source venv/bin/activate # On Windows: venv\\Scripts\\activate
pip install -r requirements.txt
jupyter notebook solution.ipynb
```
## Project Structure
```
.
├── README.md # This file
├── requirements.txt # Pinned dependencies
├── solution.ipynb # Main analysis notebook
├── src/
│ ├── features.py # Feature engineering functions
│ ├── evaluation.py # Evaluation utilities
│ └── visualization.py # Plotting helpers
├── tests/
│ ├── test_features.py # Feature engineering tests
│ └── test_evaluation.py # Evaluation metric tests
├── data/
│ └── README.md # Data description and source
└── outputs/
├── figures/ # Generated plots
└── model/ # Saved model artifacts
```
## Approach Summary
{Brief 3-4 sentence summary of methodology and key results}
## Key Results
- Metric 1: value
- Metric 2: value
- Baseline comparison: improvement
## Reproducibility
- Python {version}
- All random seeds set to 42
- Expected runtime: ~{X} minutes on {hardware description}
## Assumptions and Limitations
{List of explicit assumptions and known limitations}
"""
A README tells me two things: (1) this candidate thinks about the person who has to read their code, and (2) they have experience working on teams where documentation matters. In a stack of 30 submissions, the one with a clear README gets read first.
Part 6 -- Testing in Take-Homes
Why Tests Matter (Even in Notebooks)
You do not need 100% test coverage. You need tests for the logic that is most likely to be wrong: feature engineering transformations, custom metrics, and data preprocessing steps.
What to Test
Writing Tests for Feature Engineering
# tests/test_features.py
import pytest
import pandas as pd
import numpy as np
from src.features import create_rfm_features, create_engagement_features
@pytest.fixture
def sample_transactions() -> pd.DataFrame:
"""Create a minimal transaction DataFrame for testing."""
return pd.DataFrame({
"customer_id": ["A", "A", "A", "B", "B"],
"transaction_date": pd.to_datetime([
"2026-01-01", "2026-01-15", "2026-02-01",
"2026-01-10", "2026-01-20",
]),
"amount": [100.0, 50.0, 75.0, 200.0, 30.0],
})
class TestRFMFeatures:
"""Tests for RFM feature engineering."""
def test_output_shape(self, sample_transactions: pd.DataFrame) -> None:
"""Output should have one row per customer."""
rfm = create_rfm_features(
sample_transactions, reference_date="2026-03-01"
)
assert len(rfm) == 2 # Two unique customers
def test_output_columns(self, sample_transactions: pd.DataFrame) -> None:
"""Output should contain exactly the expected columns."""
rfm = create_rfm_features(
sample_transactions, reference_date="2026-03-01"
)
expected_cols = {"recency_days", "frequency", "monetary_avg"}
assert set(rfm.columns) == expected_cols
def test_recency_values(self, sample_transactions: pd.DataFrame) -> None:
"""Recency should be days since last transaction."""
rfm = create_rfm_features(
sample_transactions, reference_date="2026-03-01"
)
# Customer A's last transaction: 2026-02-01, ref: 2026-03-01 = 28 days
assert rfm.loc["A", "recency_days"] == 28
def test_frequency_values(self, sample_transactions: pd.DataFrame) -> None:
"""Frequency should be count of transactions."""
rfm = create_rfm_features(
sample_transactions, reference_date="2026-03-01"
)
assert rfm.loc["A", "frequency"] == 3
assert rfm.loc["B", "frequency"] == 2
def test_monetary_values(self, sample_transactions: pd.DataFrame) -> None:
"""Monetary should be average transaction amount."""
rfm = create_rfm_features(
sample_transactions, reference_date="2026-03-01"
)
assert rfm.loc["A", "monetary_avg"] == pytest.approx(75.0)
assert rfm.loc["B", "monetary_avg"] == pytest.approx(115.0)
def test_missing_columns_raises(self) -> None:
"""Should raise ValueError for missing required columns."""
bad_df = pd.DataFrame({"wrong_col": [1, 2, 3]})
with pytest.raises(ValueError, match="Missing columns"):
create_rfm_features(bad_df)
def test_empty_dataframe_raises(self) -> None:
"""Should raise ValueError for empty input."""
empty_df = pd.DataFrame(
columns=["customer_id", "transaction_date", "amount"]
)
with pytest.raises(ValueError, match="empty"):
create_rfm_features(empty_df)
def test_no_nulls_in_output(self, sample_transactions: pd.DataFrame) -> None:
"""Output should contain no null values."""
rfm = create_rfm_features(
sample_transactions, reference_date="2026-03-01"
)
assert rfm.isnull().sum().sum() == 0
Running Tests in a Notebook
If you want to keep everything in a single notebook (some evaluators prefer this), you can run tests inline:
# Cell - Quick validation tests (run inline)
def run_quick_tests() -> None:
"""Run quick validation tests for key functions."""
# Test 1: Feature engineering produces correct shape
test_df = transactions.head(100)
test_features = create_rfm_features(test_df, reference_date="2026-03-01")
n_customers = test_df["customer_id"].nunique()
assert len(test_features) == n_customers, (
f"Expected {n_customers} rows, got {len(test_features)}"
)
# Test 2: No nulls in critical features
assert test_features.isnull().sum().sum() == 0, "Nulls found in features"
# Test 3: No infinite values
numeric_cols = test_features.select_dtypes(include=[np.number]).columns
for col in numeric_cols:
assert not np.isinf(test_features[col]).any(), f"Inf found in {col}"
# Test 4: Feature values are in reasonable ranges
assert (test_features["recency_days"] >= 0).all(), "Negative recency"
assert (test_features["frequency"] >= 1).all(), "Zero frequency"
assert (test_features["monetary_avg"] > 0).all(), "Non-positive monetary"
print("All quick tests passed!")
run_quick_tests()
Testing should take no more than 10-15% of your total time on a take-home. For a 6-hour project, spend about 30-45 minutes writing tests for the 3-4 most critical functions. Do not aim for full coverage - aim for confidence in your core logic.
Part 7 -- Clean Code Principles for Data Science
Naming Conventions
# BAD names - what do these mean?
df2 = process(df1)
temp = df2.groupby("x").agg({"y": "mean"})
result = temp.merge(df3, on="id")
X = result.drop("target", axis=1)
# GOOD names - self-documenting
customer_features = engineer_features(raw_transactions)
avg_revenue_by_segment = customer_features.groupby("segment").agg(
{"revenue": "mean"}
)
enriched_customers = avg_revenue_by_segment.merge(
demographics, on="customer_id"
)
X_train = enriched_customers.drop("churn_label", axis=1)
The No Dead Code Rule
Remove all dead code before submission. Commented-out code, unused imports, and abandoned experiments make your notebook look messy and undermine confidence in your work.
# BAD - graveyard of abandoned experiments
# from sklearn.ensemble import RandomForestClassifier # tried this, didn't work
# model = RandomForestClassifier(n_estimators=100)
# model = RandomForestClassifier(n_estimators=500) # better but slow
# model = GradientBoostingClassifier() # keep this maybe?
model = lgb.LGBMClassifier() # final choice
# GOOD - clean, with rationale in markdown
# Markdown cell: "Chose LightGBM over Random Forest based on 5-fold CV
# comparison (LightGBM PR-AUC: 0.43 vs RF PR-AUC: 0.38). See Section 5
# for the full comparison table."
model = lgb.LGBMClassifier(
n_estimators=500,
learning_rate=0.05,
max_depth=6,
random_state=SEED,
)
Constants and Configuration
Define all constants in one place at the top of your notebook.
# === Configuration ===
SEED = 42
TEST_SIZE = 0.2
N_FOLDS = 5
TARGET_COL = "churned"
ID_COL = "customer_id"
DATE_COL = "event_date"
# Feature engineering parameters
RFM_REFERENCE_DATE = "2026-01-01"
ROLLING_WINDOWS = [7, 14, 30, 60]
MIN_TRANSACTIONS = 3
# Model hyperparameters
LGBM_PARAMS = {
"objective": "binary",
"metric": "average_precision",
"learning_rate": 0.05,
"max_depth": 6,
"num_leaves": 31,
"min_child_samples": 20,
"subsample": 0.8,
"colsample_bytree": 0.8,
"reg_alpha": 0.1,
"reg_lambda": 0.1,
"random_state": SEED,
"verbose": -1,
}
The Pipeline Pattern
For multi-step transformations, use a pipeline pattern to keep the flow clear and consistent.
from typing import Callable, List
def build_feature_pipeline(
steps: List[Callable[[pd.DataFrame], pd.DataFrame]],
) -> Callable[[pd.DataFrame], pd.DataFrame]:
"""Compose multiple feature engineering steps into a single function.
Args:
steps: List of functions, each taking and returning a DataFrame.
Returns:
A single function that applies all steps in order.
Example:
>>> pipeline = build_feature_pipeline([
... add_temporal_features,
... add_rfm_features,
... add_engagement_features,
... drop_raw_columns,
... ])
>>> features = pipeline(raw_df)
"""
def pipeline(df: pd.DataFrame) -> pd.DataFrame:
result = df.copy()
for step in steps:
n_before = len(result.columns)
result = step(result)
n_after = len(result.columns)
logger.info(
f"{step.__name__}: {n_before} -> {n_after} columns"
)
return result
return pipeline
# Usage - apply the same pipeline to train and test
feature_pipeline = build_feature_pipeline([
add_temporal_features,
add_rfm_features,
add_engagement_features,
encode_categoricals,
drop_raw_columns,
])
train_features = feature_pipeline(train_df)
test_features = feature_pipeline(test_df)
When I see a pipeline pattern in a take-home, I know this candidate has worked on production ML systems. It shows they understand that the same transformations must apply to both training and serving data, which is a critical production requirement that most junior candidates miss.
Part 8 -- Project Structure for Multi-File Submissions
When to Go Beyond a Single Notebook
For take-homes that allow 8+ hours, consider splitting your code into modules. This demonstrates software engineering maturity.
The Standard Layout
The config.py Pattern
# src/config.py
"""Central configuration for the take-home project."""
from dataclasses import dataclass
from typing import List
@dataclass(frozen=True)
class Config:
"""Immutable configuration for the analysis pipeline."""
# Data
raw_data_path: str = "data/transactions.csv"
target_col: str = "churned"
id_col: str = "customer_id"
# Feature engineering
reference_date: str = "2026-01-01"
rolling_windows: tuple = (7, 14, 30, 60)
min_transactions: int = 3
# Model
seed: int = 42
n_folds: int = 5
test_size: float = 0.2
# Outputs
output_dir: str = "outputs"
figures_dir: str = "outputs/figures"
metrics_path: str = "outputs/metrics.json"
# Singleton instance
config = Config()
Part 9 -- Code Quality Anti-Patterns
The Hall of Shame
These patterns will cost you the offer. Each one signals a different kind of immaturity.
| Anti-Pattern | What It Signals | Fix |
|---|---|---|
df2, df_final, df_final_v2 | No naming discipline | Use descriptive names: customer_features, enriched_customers |
# TODO: fix this later | Unfinished work left visible | Either fix it or remove the comment |
| Commented-out code blocks | Messy experimentation habits | Delete dead code; use git for history |
import * | Does not understand namespaces | Import specific names |
Hardcoded file paths (/Users/john/data/) | Not portable | Use relative paths or config |
try: except: pass | Swallowing errors silently | Catch specific exceptions, log them |
| Print statements for debugging | Not using logging | Use logging module |
| Mixing tabs and spaces | Editor configuration issues | Use a linter (black, ruff) |
| No .gitignore | Committing data, caches, venvs | Add standard Python .gitignore |
| 500-line cells | Cannot decompose logic | Max 30 lines per cell |
Hardcoded absolute paths like /Users/yourname/Desktop/data.csv are an instant credibility killer. They tell the evaluator your code cannot run on any machine except yours. Always use relative paths or a configuration file.
Pre-Submission Checklist
Run through this checklist before submitting:
PRE_SUBMISSION_CHECKLIST = """
Code Quality Checklist - Run Before Submission
================================================
[ ] Notebook runs top-to-bottom without errors (Kernel > Restart & Run All)
[ ] All imports are at the top, no unused imports
[ ] No commented-out code blocks
[ ] No hardcoded absolute paths
[ ] No print() for debugging - use logging
[ ] All functions have type hints and docstrings
[ ] Constants defined in one place (top of notebook or config.py)
[ ] Random seeds set for all libraries
[ ] requirements.txt with pinned versions included
[ ] README.md with setup instructions included
[ ] No large files (data, model artifacts) committed to git
[ ] .gitignore includes __pycache__, .ipynb_checkpoints, data/, *.pkl
[ ] Cell outputs cleared and re-run (no stale outputs)
[ ] Variable names are descriptive (no df2, temp, x, result)
[ ] Markdown cells explain decisions and observations, not obvious code
[ ] Final section has summary, key results, and next steps
"""
Practice Problems
Problem 1: Refactor This Cell
You receive the following cell in a take-home notebook. Refactor it into clean, well-structured code with proper functions, type hints, and error handling.
# Original messy cell
df = pd.read_csv("data.csv")
df = df.dropna()
df["date"] = pd.to_datetime(df["date"])
df["days"] = (pd.Timestamp("2026-01-01") - df["date"]).dt.days
df["log_amt"] = np.log(df["amount"])
df["amt_per_day"] = df["amount"] / df["days"]
df2 = df.groupby("user").agg({"days": "min", "amount": ["sum", "mean"], "log_amt": "mean"})
df2.columns = ["recency", "total_spend", "avg_spend", "avg_log_spend"]
X = df2.drop("churn", axis=1)
y = df2["churn"]
model = lgb.LGBMClassifier()
model.fit(X, y)
print(model.score(X, y))
Hint 1 -- Direction
Identify the separate concerns in this cell: data loading, feature engineering, model training, and evaluation. Each should be its own function. Look for bugs too - there is at least one (log of potentially zero/negative values, division by zero).
Hint 2 -- Key Issues
np.log(df["amount"])- crashes on zero or negative amounts. Usenp.log1p.df["amt_per_day"] = df["amount"] / df["days"]- division by zero when days = 0.df.dropna()- silent, unjustified data loss.df2["churn"]- where does this column come from? The groupby lost it.model.score(X, y)- evaluating on training data only.- No random seed, no train-test split, no cross-validation.
Hint 3 -- Full Refactored Solution
def load_transactions(filepath: str) -> pd.DataFrame:
"""Load and validate transaction data."""
df = pd.read_csv(filepath, parse_dates=["date"])
required = {"user", "date", "amount"}
missing = required - set(df.columns)
if missing:
raise ValueError(f"Missing columns: {missing}")
n_nulls = df.isnull().sum().sum()
if n_nulls > 0:
logger.warning(f"Found {n_nulls} null values - investigating")
return df
def create_user_features(
transactions: pd.DataFrame,
reference_date: str = "2026-01-01",
) -> pd.DataFrame:
"""Create per-user features from transaction history."""
df = transactions.copy()
ref = pd.to_datetime(reference_date)
df["days_since"] = (ref - df["date"]).dt.days
df["log_amount"] = np.log1p(df["amount"].clip(lower=0))
df["amount_per_day"] = df["amount"] / df["days_since"].clip(lower=1)
features = df.groupby("user").agg(
recency=("days_since", "min"),
total_spend=("amount", "sum"),
avg_spend=("amount", "mean"),
avg_log_spend=("log_amount", "mean"),
)
return features
def train_and_evaluate(
X: pd.DataFrame,
y: pd.Series,
seed: int = 42,
n_folds: int = 5,
) -> Tuple[lgb.LGBMClassifier, Dict[str, float]]:
"""Train with cross-validation and return model + metrics."""
cv = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=seed)
scores = []
for train_idx, val_idx in cv.split(X, y):
model = lgb.LGBMClassifier(random_state=seed, verbose=-1)
model.fit(X.iloc[train_idx], y.iloc[train_idx])
y_pred = model.predict_proba(X.iloc[val_idx])[:, 1]
scores.append(roc_auc_score(y.iloc[val_idx], y_pred))
metrics = {"cv_auc_mean": np.mean(scores), "cv_auc_std": np.std(scores)}
final_model = lgb.LGBMClassifier(random_state=seed, verbose=-1)
final_model.fit(X, y)
return final_model, metrics
Scoring Rubric:
- Strong Hire: Identifies all 6 bugs, separates concerns into 3+ functions, adds type hints and docstrings, uses
log1pandclip, implements cross-validation, logs data quality issues. - Lean Hire: Separates into functions and fixes 3-4 bugs, but misses subtle issues like the missing churn column after groupby.
- No Hire: Rearranges code without fixing bugs or adding structure.
Problem 2: Design a Test Suite
Write a test suite for a function encode_categoricals(df, columns, method="target") that performs target encoding on specified columns. Consider edge cases.
Hint 1 -- Direction
Think about: What happens with unseen categories at test time? What if a category has only one example? What about null values in categorical columns? Does the function leak target information?
Hint 2 -- Key Test Cases
- Normal case: known categories produce correct encoded values
- Unseen categories: should fall back to global mean, not crash
- Single-instance categories: should use smoothed estimate, not raw average
- Null categories: should handle gracefully
- Data leakage: encoding should be fit on train, applied to test
- Output shape: same number of rows, same or fewer columns
- Determinism: same input produces same output
Hint 3 -- Full Test Suite
class TestTargetEncoding:
@pytest.fixture
def train_data(self):
return pd.DataFrame({
"city": ["NYC", "NYC", "LA", "LA", "CHI", "CHI"],
"target": [1, 1, 0, 1, 0, 0],
})
def test_known_categories_encoded(self, train_data):
result = encode_categoricals(train_data, ["city"], method="target")
assert "city" in result.columns
assert result["city"].dtype == float
def test_output_shape_preserved(self, train_data):
result = encode_categoricals(train_data, ["city"], method="target")
assert len(result) == len(train_data)
def test_unseen_category_falls_back(self, train_data):
test_df = pd.DataFrame({"city": ["BOSTON"], "target": [0]})
encoder = fit_encoder(train_data, ["city"])
result = apply_encoder(test_df, encoder)
global_mean = train_data["target"].mean()
assert result.loc[0, "city"] == pytest.approx(global_mean)
def test_null_category_handled(self, train_data):
train_data.loc[0, "city"] = None
result = encode_categoricals(train_data, ["city"], method="target")
assert not result["city"].isnull().any()
def test_no_data_leakage(self, train_data):
"""Each row's encoding should NOT include its own target."""
result = encode_categoricals(
train_data, ["city"], method="target", fold_aware=True
)
# Verify leave-one-out or fold-based encoding was used
nyc_encoded = result.loc[train_data["city"] == "NYC", "city"]
assert not all(nyc_encoded == 1.0) # Should not be pure target mean
def test_deterministic(self, train_data):
r1 = encode_categoricals(train_data, ["city"], method="target")
r2 = encode_categoricals(train_data, ["city"], method="target")
pd.testing.assert_frame_equal(r1, r2)
Problem 3: Code Review
Review the following submission excerpt. List every code quality issue and assign a severity (Critical, Major, Minor).
import pandas as pd, numpy as np, sklearn, lightgbm
from sklearn.model_selection import *
df = pd.read_csv("/Users/candidate/Downloads/interview_data.csv")
df_clean = df.drop_duplicates()
print(f"shape: {df_clean.shape}")
# feature eng
df_clean['f1'] = df_clean['col_a'] * df_clean['col_b']
df_clean['f2'] = df_clean['col_c'].apply(lambda x: 1 if x > 0 else 0)
# df_clean['f3'] = df_clean['col_d'].map(some_dict) # didn't work
# df_clean['f4'] = df_clean['col_e'] ** 2 # maybe later
X = df_clean.drop('target', axis=1)
y = df_clean['target']
X_train, X_test, y_train, y_test = train_test_split(X, y)
model = lightgbm.LGBMClassifier(n_estimators=1000)
model.fit(X_train, y_train)
print(model.score(X_test, y_test))
Hint 1 -- Direction
Look for: import style, file paths, data handling, dead code, reproducibility, evaluation method, naming, and documentation.
Hint 2 -- Category of Issues
There are at least 12 issues spanning: imports (2), data loading (1), data handling (1), dead code (2), naming (2), reproducibility (2), evaluation (1), documentation (1).
Hint 3 -- Full Review
| Issue | Severity | Fix |
|---|---|---|
from sklearn.model_selection import * - wildcard import | Major | Import specific: from sklearn.model_selection import train_test_split |
import pandas as pd, numpy as np, sklearn, lightgbm - multiple imports on one line | Minor | One import per line |
Hardcoded path /Users/candidate/Downloads/ | Critical | Use relative path or config |
df.drop_duplicates() without explanation | Major | Log how many rows dropped and why |
| Two blocks of commented-out code | Major | Delete dead code |
Variable names f1, f2 | Major | Use descriptive names: interaction_ab, col_c_positive |
df_clean mutated in place | Minor | Use functions for feature engineering |
No random seed in train_test_split | Critical | Add random_state=42 |
| No random seed in LGBMClassifier | Critical | Add random_state=42 |
model.score() - accuracy on imbalanced data | Major | Use appropriate metric (AUC, PR-AUC) |
| No cross-validation | Major | Use StratifiedKFold |
| No markdown cells, no docstrings, no comments | Major | Add narrative structure |
Scoring Rubric:
- Strong Hire: Identifies 10+ issues with correct severity classifications. Provides specific fixes. Mentions that the entire structure needs reorganization, not just individual line fixes.
- Lean Hire: Identifies 6-9 issues, catches the critical ones (path, seeds) but misses evaluation and documentation issues.
- No Hire: Identifies fewer than 5 issues or misclassifies severities (e.g., calling dead code "minor").
Interview Cheat Sheet
| Concept | Key Practice | One-Liner | Red Flag |
|---|---|---|---|
| Notebook structure | 7 sections with markdown narrative | Clear sections = clear thinking | 50 cells with no markdown |
| Function decomposition | Extract reusable logic from cells | Same function for train and test | Copy-paste preprocessing for train vs test |
| Type hints | Annotate core pipeline functions | Types document the data contract | def f(x, y, z) with no hints on key functions |
| Error handling | Validate inputs, log warnings | Defensive code = production code | try: except: pass |
| Reproducibility | Seeds + pinned deps + README | Anyone can re-run and get same results | np.random.seed(42) only, no requirements.txt |
| Testing | Test feature engineering and metrics | Tests prove your logic is correct | No tests and no assertions anywhere |
| Naming | Descriptive variable names | Good names eliminate comments | df2, temp, result, X without context |
| Dead code | Remove all commented-out code | Clean submission = finished work | # TODO, # tried this, commented blocks |
| Constants | Define once at the top | One place to change parameters | Magic numbers scattered in code |
| Project structure | README + requirements + clean layout | Professional packaging = professional work | Single notebook with no supporting files |
Spaced Repetition Checkpoints
Day 0 -- Initial Learning
- Read this entire page
- Refactor one of your old notebooks using the 7-section structure
- Add type hints and docstrings to three functions in an existing project
- Complete the self-assessment
Day 3 -- First Recall
- Without looking, write the 7-section notebook template from memory
- Write a
load_and_validatefunction with error handling from scratch - Create a
requirements.txtfor a current project
Day 7 -- Practice
- Do Practice Problem 1 (refactoring) without looking at hints
- Take an old take-home or project and apply the pre-submission checklist
- Write three unit tests for one of your feature engineering functions
Day 14 -- Application
- Complete a mock take-home with full code quality standards in 4 hours
- Have a peer review your submission using the anti-pattern table
- Do Practice Problem 3 (code review) under timed conditions (10 minutes)
Day 21 -- Mock Review
- Submit a take-home to a friend or mentor for code quality feedback
- Time yourself applying the pre-submission checklist (should take < 15 minutes)
- Review any areas where you still default to bad habits
Key Takeaways
-
Code quality is the tiebreaker. When two candidates have similar model performance, the one with cleaner code gets the offer. Evaluators hire people they want to work with, and messy code signals messy thinking.
-
Structure your notebook like a document, not a scratchpad. Seven clear sections with markdown narrative between code cells lets the evaluator follow your logic without running anything.
-
Extract functions for anything that touches both train and test data. This single practice eliminates the most common source of bugs in take-homes and demonstrates production awareness.
-
Reproducibility is non-negotiable. Random seeds, pinned dependencies, relative paths, and a README that explains how to run your code. If the evaluator cannot reproduce your results, your results do not count.
-
Tests are a signal of engineering maturity. You do not need 100% coverage. You need tests for the logic most likely to be wrong - feature engineering, custom metrics, and edge cases. Even 5 targeted tests set you apart from 90% of candidates.
