Skip to main content

Code Quality Standards - The Silent Evaluator

Reading time: ~40 min | Interview relevance: Critical | Roles: MLE, AI Eng, Data Scientist, Research Engineer, MLOps

The Real Interview Moment

You are reviewing take-home submissions at a Series B startup. Two candidates solved the same churn prediction problem. Both achieved an AUC of 0.87. You open Candidate A's notebook: 47 cells, no markdown, variable names like df2, df_final_v3, temp, commented-out code scattered everywhere, and a single 200-line cell that does feature engineering, training, and evaluation in one block. You have no idea what is happening by cell 15. You close the notebook and write "No hire - cannot assess thought process."

You open Candidate B's notebook: a clean table of contents at the top, each section separated by markdown headers, functions with type hints and docstrings, a requirements.txt pinned to exact versions, a README.md explaining how to reproduce the results, and a final "Summary and Next Steps" section. You can follow the logic in five minutes. You write "Strong hire - clear thinker, production-ready habits."

Both candidates had the same model performance. One got hired. One did not. The difference was code quality. This page teaches you exactly how to be Candidate B.

What You Will Master

  • Structure a Jupyter notebook with professional-grade organization and flow
  • Decompose monolithic cells into clean, reusable functions
  • Apply type hints and docstrings in data science code
  • Handle errors and edge cases gracefully in exploratory code
  • Guarantee reproducibility with seeds, dependency pinning, and environment management
  • Write meaningful tests for data pipelines and model logic
  • Follow clean code principles adapted for data science workflows
  • Create a submission package that signals production readiness

Self-Assessment: Where Are You Now?

Skill1 -- Cannot2 -- Vaguely3 -- Can Do4 -- Consistently5 -- Can TeachYour Score
Organize a notebook with clear sections___
Extract reusable functions from notebook cells___
Add type hints to data science functions___
Handle errors in data loading and preprocessing___
Set random seeds for full reproducibility___
Write unit tests for feature engineering___
Pin dependencies and document environment___
Create a professional README for a take-home___

Target: All 4s and 5s before you submit any take-home.

Part 1 -- Notebook Organization

The Professional Notebook Structure

Every take-home notebook should follow a consistent structure that allows the evaluator to navigate your thought process in under two minutes.

Professional Notebook Structure - Seven Sections from Header to Summary

Section 1: Header and Setup

The first cell of your notebook sets the tone. It should contain a title, your name, the date, and a brief problem statement. The second cell should contain all imports, grouped logically.

# Cell 1 - Markdown
"""
# Customer Churn Prediction - Take-Home Assessment
**Candidate:** Jane Smith
**Date:** 2026-03-07
**Time spent:** ~6 hours

## Problem Statement
Predict which customers will churn in the next 30 days using
transaction history, demographics, and engagement data.

## Approach Summary
1. EDA reveals class imbalance (8% churn rate) and strong temporal patterns
2. Feature engineering: RFM features, rolling aggregates, engagement velocity
3. Model: LightGBM with stratified 5-fold CV, optimized for PR-AUC
4. Final PR-AUC: 0.43 (baseline: 0.08)
"""
# Cell 2 - Imports (grouped by category)
# Standard library
import os
import json
import logging
from pathlib import Path
from datetime import datetime, timedelta
from typing import Tuple, Dict, List, Optional

# Data manipulation
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# ML
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import (
roc_auc_score,
precision_recall_curve,
average_precision_score,
classification_report,
)
from sklearn.preprocessing import StandardScaler
import lightgbm as lgb

# Configuration
plt.style.use("seaborn-v0_8-whitegrid")
pd.set_option("display.max_columns", 50)
pd.set_option("display.max_rows", 100)

RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
60-Second Answer

"I organize my notebooks into seven clear sections: header, data loading, EDA, feature engineering, modeling, evaluation, and summary. Each section starts with a markdown cell explaining what I am doing and why. All imports are in one cell at the top. Constants and configuration are defined once. This makes it possible for a reviewer to understand my approach in under two minutes without running any code."

Section Separators and Markdown

Every section should begin with a markdown cell that explains the purpose of that section and any key decisions. Think of markdown cells as the narration of your analysis - they tell the story that your code implements.

Bad notebook flow:

[Code cell] → [Code cell] → [Code cell] → [Code cell] → [Code cell]

Good notebook flow:

[Markdown: What and Why] → [Code cell] → [Markdown: Observation] →
[Code cell] → [Markdown: Decision and Rationale]
Common Trap

Do not over-narrate. Evaluators do not want a paragraph explaining what df.shape does. Save markdown for decisions, observations, and rationale. "I chose PR-AUC over ROC-AUC because the classes are heavily imbalanced (8% positive rate), and we care more about precision at low recall thresholds" is useful. "Now I will check the shape of the dataframe" is noise.

The "Cell Length" Rule

No single code cell should exceed 30 lines. If a cell is longer, it is doing too much. Extract a function, split the cell, or move logic to a utility module.

# BAD - 60-line monolithic cell
# ... loads data, cleans data, creates features, trains model ...

# GOOD - focused cells
# Cell: Load and validate raw data
raw_df = load_and_validate("data/transactions.csv")

# Cell: Create RFM features
rfm_features = create_rfm_features(raw_df, reference_date="2026-01-01")

# Cell: Create engagement features
engagement_features = create_engagement_features(raw_df, window_days=30)

Part 2 -- Function Decomposition

Why Functions Matter in Take-Homes

Evaluators are not just checking whether your code runs. They are checking whether you can write code that a team could maintain. Functions serve three purposes in a take-home:

  1. Readability - A function name like create_rfm_features() is self-documenting
  2. Reusability - The same function can be applied to train and test sets consistently
  3. Testability - Functions can be unit-tested; raw cells cannot

Function Decomposition - From Monolithic Cell to Named Function, Type Hints, Docstring, Error Handling, Unit Test

The Anatomy of a Well-Written Data Science Function

def create_rfm_features(
transactions: pd.DataFrame,
customer_id_col: str = "customer_id",
date_col: str = "transaction_date",
amount_col: str = "amount",
reference_date: Optional[str] = None,
) -> pd.DataFrame:
"""Create Recency, Frequency, Monetary features per customer.

Computes three features for each customer:
- Recency: days since last transaction
- Frequency: total number of transactions
- Monetary: average transaction amount

Args:
transactions: Raw transaction DataFrame with at least customer_id,
transaction_date, and amount columns.
customer_id_col: Name of the customer identifier column.
date_col: Name of the transaction date column.
amount_col: Name of the transaction amount column.
reference_date: Date to compute recency from. If None, uses max date
in the dataset.

Returns:
DataFrame indexed by customer_id with columns:
recency_days, frequency, monetary_avg.

Raises:
ValueError: If required columns are missing from the input DataFrame.
ValueError: If transactions DataFrame is empty.

Example:
>>> rfm = create_rfm_features(transactions_df)
>>> rfm.head()
recency_days frequency monetary_avg
customer_id
C001 3 15 42.50
C002 45 2 120.00
"""
# Validate inputs
required_cols = {customer_id_col, date_col, amount_col}
missing_cols = required_cols - set(transactions.columns)
if missing_cols:
raise ValueError(f"Missing columns: {missing_cols}")

if transactions.empty:
raise ValueError("Input DataFrame is empty")

df = transactions.copy()
df[date_col] = pd.to_datetime(df[date_col])

if reference_date is None:
ref_date = df[date_col].max()
else:
ref_date = pd.to_datetime(reference_date)

rfm = (
df.groupby(customer_id_col)
.agg(
recency_days=(date_col, lambda x: (ref_date - x.max()).days),
frequency=(date_col, "count"),
monetary_avg=(amount_col, "mean"),
)
)

logger.info(
f"Created RFM features for {len(rfm)} customers. "
f"Recency range: [{rfm['recency_days'].min()}, {rfm['recency_days'].max()}]"
)

return rfm

When to Use Functions vs. Inline Code

Use a Function WhenKeep Inline When
Logic is reused on train AND test setsOne-off exploratory visualization
Logic exceeds 10 linesSimple pandas one-liner (df.describe())
Logic has clear input/output contractQuick sanity check (print shape, dtypes)
Logic needs to be testedMarkdown-adjacent explanation code
Logic involves non-obvious transformationsStandard library calls with obvious intent
Instant Rejection

Never apply different preprocessing to train and test sets by writing the logic twice inline. This is the number one source of train-test skew in take-homes. Extract a function and call it on both sets with the same parameters.

# CATASTROPHIC - different logic for train and test
train_df["age_bin"] = pd.cut(train_df["age"], bins=5)
test_df["age_bin"] = pd.cut(test_df["age"], bins=4) # Different bins!

# CORRECT - single function, consistent application
def bin_age(df: pd.DataFrame, bins: int = 5) -> pd.DataFrame:
df = df.copy()
df["age_bin"] = pd.cut(df["age"], bins=bins)
return df

train_df = bin_age(train_df)
test_df = bin_age(test_df)

Extracting Functions: A Step-by-Step Process

When you have a working monolithic cell, follow this process to refactor it:

  1. Identify the inputs - What data does this block need?
  2. Identify the outputs - What does it produce?
  3. Name the operation - What verb describes the transformation?
  4. Extract and parameterize - Move hardcoded values to parameters with defaults
  5. Add types and docstring - Document the contract
  6. Validate inputs - Add checks for common errors
  7. Test - Call it on a small sample and verify output
# BEFORE: Monolithic feature engineering cell (40 lines)
df["days_since_signup"] = (pd.Timestamp("2026-01-01") - df["signup_date"]).dt.days
df["log_revenue"] = np.log1p(df["total_revenue"])
df["orders_per_month"] = df["total_orders"] / (df["days_since_signup"] / 30)
df["avg_order_value"] = df["total_revenue"] / df["total_orders"].clip(lower=1)
# ... 30 more lines ...

# AFTER: Clean function calls
df = add_temporal_features(df, reference_date="2026-01-01")
df = add_revenue_features(df)
df = add_behavioral_features(df)

Part 3 -- Type Hints and Docstrings

Type Hints for Data Science

Type hints are not just for software engineers. In data science code, they communicate the expected data contract at a glance.

from typing import Tuple, Dict, List, Optional, Union
import numpy as np
import pandas as pd
from numpy.typing import NDArray


def train_evaluate_model(
X_train: pd.DataFrame,
y_train: pd.Series,
X_val: pd.DataFrame,
y_val: pd.Series,
params: Dict[str, Union[int, float, str]],
feature_names: Optional[List[str]] = None,
) -> Tuple[lgb.Booster, Dict[str, float]]:
"""Train a LightGBM model and return it with evaluation metrics.

Args:
X_train: Training features.
y_train: Training labels (binary).
X_val: Validation features.
y_val: Validation labels (binary).
params: LightGBM hyperparameters.
feature_names: Subset of columns to use. If None, uses all columns.

Returns:
Tuple of (trained model, dict of evaluation metrics).
"""
if feature_names is not None:
X_train = X_train[feature_names]
X_val = X_val[feature_names]

train_data = lgb.Dataset(X_train, label=y_train)
val_data = lgb.Dataset(X_val, label=y_val, reference=train_data)

model = lgb.train(
params,
train_data,
valid_sets=[val_data],
num_boost_round=1000,
callbacks=[lgb.early_stopping(50), lgb.log_evaluation(100)],
)

y_pred = model.predict(X_val)
metrics = {
"roc_auc": roc_auc_score(y_val, y_pred),
"pr_auc": average_precision_score(y_val, y_pred),
}

return model, metrics

Common Type Hint Patterns in Data Science

# DataFrames and Series
def process(df: pd.DataFrame) -> pd.DataFrame: ...
def get_labels(df: pd.DataFrame) -> pd.Series: ...

# NumPy arrays
def normalize(arr: NDArray[np.float64]) -> NDArray[np.float64]: ...

# Multiple return values
def split_data(df: pd.DataFrame) -> Tuple[pd.DataFrame, pd.DataFrame]: ...

# Optional parameters
def plot_results(
metrics: Dict[str, float],
save_path: Optional[str] = None,
figsize: Tuple[int, int] = (10, 6),
) -> None: ...

# Union types for flexibility
def load_data(source: Union[str, Path]) -> pd.DataFrame: ...
Practical Reality

You do not need to type-hint every helper function in an exploratory notebook. Focus type hints on the core pipeline functions - data loading, feature engineering, model training, and evaluation. These are the functions the evaluator will read most carefully, and type hints there signal maturity without adding bureaucratic overhead everywhere.

Docstring Styles

Use the Google style for consistency. It is compact and readable in notebooks.

def compute_feature_importance(
model: lgb.Booster,
feature_names: List[str],
importance_type: str = "gain",
top_n: int = 20,
) -> pd.DataFrame:
"""Compute and format feature importance from a trained model.

Args:
model: Trained LightGBM Booster.
feature_names: List of feature names matching model input.
importance_type: Type of importance. One of 'gain', 'split'.
top_n: Number of top features to return.

Returns:
DataFrame with columns 'feature' and 'importance', sorted descending.

Raises:
ValueError: If importance_type is not 'gain' or 'split'.
"""
if importance_type not in ("gain", "split"):
raise ValueError(
f"importance_type must be 'gain' or 'split', got '{importance_type}'"
)

importance = model.feature_importance(importance_type=importance_type)

importance_df = (
pd.DataFrame({"feature": feature_names, "importance": importance})
.sort_values("importance", ascending=False)
.head(top_n)
.reset_index(drop=True)
)

return importance_df

Part 4 -- Error Handling

Defensive Coding in Data Science

Data is messy. Your code should handle the mess gracefully instead of crashing with an inscrutable traceback. Evaluators look for evidence that you anticipate real-world data problems.

Defensive Coding Pattern - Validate Input, Process, Check Output, Handle Failures

The Three Layers of Data Validation

Layer 1: Schema Validation (on load)

def load_and_validate(
filepath: Union[str, Path],
required_columns: List[str],
date_columns: Optional[List[str]] = None,
) -> pd.DataFrame:
"""Load a CSV and validate its schema before any processing.

Args:
filepath: Path to the CSV file.
required_columns: Columns that must be present.
date_columns: Columns to parse as datetime.

Returns:
Validated DataFrame with correct dtypes.

Raises:
FileNotFoundError: If the file does not exist.
ValueError: If required columns are missing.
"""
filepath = Path(filepath)
if not filepath.exists():
raise FileNotFoundError(f"Data file not found: {filepath}")

df = pd.read_csv(filepath, parse_dates=date_columns)

missing = set(required_columns) - set(df.columns)
if missing:
raise ValueError(
f"Missing required columns: {missing}. "
f"Available columns: {list(df.columns)}"
)

logger.info(f"Loaded {len(df)} rows, {len(df.columns)} columns from {filepath}")

return df

Layer 2: Data Quality Checks (during EDA)

def check_data_quality(df: pd.DataFrame) -> Dict[str, any]:
"""Run data quality checks and return a summary report.

Returns a dictionary with quality metrics. Does NOT raise errors -
instead, logs warnings for issues that should be investigated.
"""
report = {
"n_rows": len(df),
"n_cols": len(df.columns),
"duplicate_rows": df.duplicated().sum(),
"null_counts": df.isnull().sum().to_dict(),
"null_pct": (df.isnull().sum() / len(df) * 100).to_dict(),
}

if report["duplicate_rows"] > 0:
logger.warning(
f"Found {report['duplicate_rows']} duplicate rows "
f"({report['duplicate_rows']/len(df)*100:.1f}%)"
)

high_null_cols = {
col: pct
for col, pct in report["null_pct"].items()
if pct > 50
}
if high_null_cols:
logger.warning(f"Columns with >50% nulls: {high_null_cols}")

return report

Layer 3: Output Validation (after transformation)

def validate_features(
features: pd.DataFrame,
expected_rows: int,
no_null_columns: Optional[List[str]] = None,
) -> None:
"""Validate feature DataFrame after engineering.

Args:
features: The feature DataFrame to validate.
expected_rows: Expected number of rows.
no_null_columns: Columns that must have zero nulls.

Raises:
AssertionError: If any validation check fails.
"""
assert len(features) == expected_rows, (
f"Row count mismatch: expected {expected_rows}, got {len(features)}"
)

inf_cols = [
col for col in features.select_dtypes(include=[np.number]).columns
if np.isinf(features[col]).any()
]
assert not inf_cols, f"Infinite values found in columns: {inf_cols}"

if no_null_columns:
null_cols = [
col for col in no_null_columns
if features[col].isnull().any()
]
assert not null_cols, f"Unexpected nulls in columns: {null_cols}"
Common Trap

Do not silently drop rows with missing values. Every row you drop should be logged with a reason. Evaluators who see df.dropna() without explanation will wonder whether you introduced survivorship bias.

# BAD - silent data loss
df = df.dropna()

# GOOD - explicit, logged, justified
n_before = len(df)
df = df.dropna(subset=["target_variable"])
n_after = len(df)
logger.info(
f"Dropped {n_before - n_after} rows with missing target "
f"({(n_before - n_after) / n_before * 100:.1f}%)"
)

Part 5 -- Reproducibility

The Reproducibility Checklist

If an evaluator clones your repository and runs your notebook, they should get exactly the same results. This is non-negotiable.

Reproducibility Checklist - Random Seeds, Dependency Pinning, Data Versioning, Environment Documentation

Setting Random Seeds Properly

Setting np.random.seed(42) is not enough. You must seed every library that uses randomness.

import random
import os
import numpy as np

def set_all_seeds(seed: int = 42) -> None:
"""Set random seeds for full reproducibility.

Sets seeds for Python's random module, NumPy, and optionally
PyTorch and TensorFlow if they are available.

Args:
seed: The random seed value.
"""
random.seed(seed)
np.random.seed(seed)
os.environ["PYTHONHASHSEED"] = str(seed)

try:
import torch
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
except ImportError:
pass

try:
import tensorflow as tf
tf.random.set_seed(seed)
except ImportError:
pass

logger.info(f"All random seeds set to {seed}")


# Call at the very top of your notebook
SEED = 42
set_all_seeds(SEED)

Dependency Pinning

Always include a requirements.txt with exact versions. Evaluators should be able to recreate your environment.

# requirements.txt - pinned for reproducibility
numpy==1.26.4
pandas==2.2.1
scikit-learn==1.4.1
lightgbm==4.3.0
matplotlib==3.8.3
seaborn==0.13.2
jupyter==1.0.0

Generate this automatically:

pip freeze > requirements.txt

Or better, use a minimal requirements file listing only what you directly import:

# Generate minimal requirements
pip install pipreqs
pipreqs . --force

The README Template

Every take-home submission should include a README:

# README.md template for take-home submissions
README_TEMPLATE = """
# {Project Title} - Take-Home Assessment

## Quick Start
```bash
python -m venv venv
source venv/bin/activate # On Windows: venv\\Scripts\\activate
pip install -r requirements.txt
jupyter notebook solution.ipynb
```

## Project Structure
```
.
├── README.md # This file
├── requirements.txt # Pinned dependencies
├── solution.ipynb # Main analysis notebook
├── src/
│ ├── features.py # Feature engineering functions
│ ├── evaluation.py # Evaluation utilities
│ └── visualization.py # Plotting helpers
├── tests/
│ ├── test_features.py # Feature engineering tests
│ └── test_evaluation.py # Evaluation metric tests
├── data/
│ └── README.md # Data description and source
└── outputs/
├── figures/ # Generated plots
└── model/ # Saved model artifacts
```

## Approach Summary
{Brief 3-4 sentence summary of methodology and key results}

## Key Results
- Metric 1: value
- Metric 2: value
- Baseline comparison: improvement

## Reproducibility
- Python {version}
- All random seeds set to 42
- Expected runtime: ~{X} minutes on {hardware description}

## Assumptions and Limitations
{List of explicit assumptions and known limitations}
"""
Evaluator's Perspective

A README tells me two things: (1) this candidate thinks about the person who has to read their code, and (2) they have experience working on teams where documentation matters. In a stack of 30 submissions, the one with a clear README gets read first.

Part 6 -- Testing in Take-Homes

Why Tests Matter (Even in Notebooks)

You do not need 100% test coverage. You need tests for the logic that is most likely to be wrong: feature engineering transformations, custom metrics, and data preprocessing steps.

What to Test

What to Test - Feature Engineering, Custom Metrics, Data Transformations, and Edge Cases

Writing Tests for Feature Engineering

# tests/test_features.py
import pytest
import pandas as pd
import numpy as np
from src.features import create_rfm_features, create_engagement_features


@pytest.fixture
def sample_transactions() -> pd.DataFrame:
"""Create a minimal transaction DataFrame for testing."""
return pd.DataFrame({
"customer_id": ["A", "A", "A", "B", "B"],
"transaction_date": pd.to_datetime([
"2026-01-01", "2026-01-15", "2026-02-01",
"2026-01-10", "2026-01-20",
]),
"amount": [100.0, 50.0, 75.0, 200.0, 30.0],
})


class TestRFMFeatures:
"""Tests for RFM feature engineering."""

def test_output_shape(self, sample_transactions: pd.DataFrame) -> None:
"""Output should have one row per customer."""
rfm = create_rfm_features(
sample_transactions, reference_date="2026-03-01"
)
assert len(rfm) == 2 # Two unique customers

def test_output_columns(self, sample_transactions: pd.DataFrame) -> None:
"""Output should contain exactly the expected columns."""
rfm = create_rfm_features(
sample_transactions, reference_date="2026-03-01"
)
expected_cols = {"recency_days", "frequency", "monetary_avg"}
assert set(rfm.columns) == expected_cols

def test_recency_values(self, sample_transactions: pd.DataFrame) -> None:
"""Recency should be days since last transaction."""
rfm = create_rfm_features(
sample_transactions, reference_date="2026-03-01"
)
# Customer A's last transaction: 2026-02-01, ref: 2026-03-01 = 28 days
assert rfm.loc["A", "recency_days"] == 28

def test_frequency_values(self, sample_transactions: pd.DataFrame) -> None:
"""Frequency should be count of transactions."""
rfm = create_rfm_features(
sample_transactions, reference_date="2026-03-01"
)
assert rfm.loc["A", "frequency"] == 3
assert rfm.loc["B", "frequency"] == 2

def test_monetary_values(self, sample_transactions: pd.DataFrame) -> None:
"""Monetary should be average transaction amount."""
rfm = create_rfm_features(
sample_transactions, reference_date="2026-03-01"
)
assert rfm.loc["A", "monetary_avg"] == pytest.approx(75.0)
assert rfm.loc["B", "monetary_avg"] == pytest.approx(115.0)

def test_missing_columns_raises(self) -> None:
"""Should raise ValueError for missing required columns."""
bad_df = pd.DataFrame({"wrong_col": [1, 2, 3]})
with pytest.raises(ValueError, match="Missing columns"):
create_rfm_features(bad_df)

def test_empty_dataframe_raises(self) -> None:
"""Should raise ValueError for empty input."""
empty_df = pd.DataFrame(
columns=["customer_id", "transaction_date", "amount"]
)
with pytest.raises(ValueError, match="empty"):
create_rfm_features(empty_df)

def test_no_nulls_in_output(self, sample_transactions: pd.DataFrame) -> None:
"""Output should contain no null values."""
rfm = create_rfm_features(
sample_transactions, reference_date="2026-03-01"
)
assert rfm.isnull().sum().sum() == 0

Running Tests in a Notebook

If you want to keep everything in a single notebook (some evaluators prefer this), you can run tests inline:

# Cell - Quick validation tests (run inline)
def run_quick_tests() -> None:
"""Run quick validation tests for key functions."""

# Test 1: Feature engineering produces correct shape
test_df = transactions.head(100)
test_features = create_rfm_features(test_df, reference_date="2026-03-01")
n_customers = test_df["customer_id"].nunique()
assert len(test_features) == n_customers, (
f"Expected {n_customers} rows, got {len(test_features)}"
)

# Test 2: No nulls in critical features
assert test_features.isnull().sum().sum() == 0, "Nulls found in features"

# Test 3: No infinite values
numeric_cols = test_features.select_dtypes(include=[np.number]).columns
for col in numeric_cols:
assert not np.isinf(test_features[col]).any(), f"Inf found in {col}"

# Test 4: Feature values are in reasonable ranges
assert (test_features["recency_days"] >= 0).all(), "Negative recency"
assert (test_features["frequency"] >= 1).all(), "Zero frequency"
assert (test_features["monetary_avg"] > 0).all(), "Non-positive monetary"

print("All quick tests passed!")


run_quick_tests()
Time Budget

Testing should take no more than 10-15% of your total time on a take-home. For a 6-hour project, spend about 30-45 minutes writing tests for the 3-4 most critical functions. Do not aim for full coverage - aim for confidence in your core logic.

Part 7 -- Clean Code Principles for Data Science

Naming Conventions

# BAD names - what do these mean?
df2 = process(df1)
temp = df2.groupby("x").agg({"y": "mean"})
result = temp.merge(df3, on="id")
X = result.drop("target", axis=1)

# GOOD names - self-documenting
customer_features = engineer_features(raw_transactions)
avg_revenue_by_segment = customer_features.groupby("segment").agg(
{"revenue": "mean"}
)
enriched_customers = avg_revenue_by_segment.merge(
demographics, on="customer_id"
)
X_train = enriched_customers.drop("churn_label", axis=1)

The No Dead Code Rule

Remove all dead code before submission. Commented-out code, unused imports, and abandoned experiments make your notebook look messy and undermine confidence in your work.

# BAD - graveyard of abandoned experiments
# from sklearn.ensemble import RandomForestClassifier # tried this, didn't work
# model = RandomForestClassifier(n_estimators=100)
# model = RandomForestClassifier(n_estimators=500) # better but slow
# model = GradientBoostingClassifier() # keep this maybe?
model = lgb.LGBMClassifier() # final choice

# GOOD - clean, with rationale in markdown
# Markdown cell: "Chose LightGBM over Random Forest based on 5-fold CV
# comparison (LightGBM PR-AUC: 0.43 vs RF PR-AUC: 0.38). See Section 5
# for the full comparison table."
model = lgb.LGBMClassifier(
n_estimators=500,
learning_rate=0.05,
max_depth=6,
random_state=SEED,
)

Constants and Configuration

Define all constants in one place at the top of your notebook.

# === Configuration ===
SEED = 42
TEST_SIZE = 0.2
N_FOLDS = 5
TARGET_COL = "churned"
ID_COL = "customer_id"
DATE_COL = "event_date"

# Feature engineering parameters
RFM_REFERENCE_DATE = "2026-01-01"
ROLLING_WINDOWS = [7, 14, 30, 60]
MIN_TRANSACTIONS = 3

# Model hyperparameters
LGBM_PARAMS = {
"objective": "binary",
"metric": "average_precision",
"learning_rate": 0.05,
"max_depth": 6,
"num_leaves": 31,
"min_child_samples": 20,
"subsample": 0.8,
"colsample_bytree": 0.8,
"reg_alpha": 0.1,
"reg_lambda": 0.1,
"random_state": SEED,
"verbose": -1,
}

The Pipeline Pattern

For multi-step transformations, use a pipeline pattern to keep the flow clear and consistent.

from typing import Callable, List

def build_feature_pipeline(
steps: List[Callable[[pd.DataFrame], pd.DataFrame]],
) -> Callable[[pd.DataFrame], pd.DataFrame]:
"""Compose multiple feature engineering steps into a single function.

Args:
steps: List of functions, each taking and returning a DataFrame.

Returns:
A single function that applies all steps in order.

Example:
>>> pipeline = build_feature_pipeline([
... add_temporal_features,
... add_rfm_features,
... add_engagement_features,
... drop_raw_columns,
... ])
>>> features = pipeline(raw_df)
"""
def pipeline(df: pd.DataFrame) -> pd.DataFrame:
result = df.copy()
for step in steps:
n_before = len(result.columns)
result = step(result)
n_after = len(result.columns)
logger.info(
f"{step.__name__}: {n_before} -> {n_after} columns"
)
return result

return pipeline


# Usage - apply the same pipeline to train and test
feature_pipeline = build_feature_pipeline([
add_temporal_features,
add_rfm_features,
add_engagement_features,
encode_categoricals,
drop_raw_columns,
])

train_features = feature_pipeline(train_df)
test_features = feature_pipeline(test_df)
Evaluator's Perspective

When I see a pipeline pattern in a take-home, I know this candidate has worked on production ML systems. It shows they understand that the same transformations must apply to both training and serving data, which is a critical production requirement that most junior candidates miss.

Part 8 -- Project Structure for Multi-File Submissions

When to Go Beyond a Single Notebook

For take-homes that allow 8+ hours, consider splitting your code into modules. This demonstrates software engineering maturity.

Project Structure by Time Allocation - Single Notebook, Notebook Plus Modules, Full Project

The Standard Layout

Standard Take-Home Project Layout - Code Quality

The config.py Pattern

# src/config.py
"""Central configuration for the take-home project."""
from dataclasses import dataclass
from typing import List


@dataclass(frozen=True)
class Config:
"""Immutable configuration for the analysis pipeline."""

# Data
raw_data_path: str = "data/transactions.csv"
target_col: str = "churned"
id_col: str = "customer_id"

# Feature engineering
reference_date: str = "2026-01-01"
rolling_windows: tuple = (7, 14, 30, 60)
min_transactions: int = 3

# Model
seed: int = 42
n_folds: int = 5
test_size: float = 0.2

# Outputs
output_dir: str = "outputs"
figures_dir: str = "outputs/figures"
metrics_path: str = "outputs/metrics.json"


# Singleton instance
config = Config()

Part 9 -- Code Quality Anti-Patterns

The Hall of Shame

These patterns will cost you the offer. Each one signals a different kind of immaturity.

Anti-PatternWhat It SignalsFix
df2, df_final, df_final_v2No naming disciplineUse descriptive names: customer_features, enriched_customers
# TODO: fix this laterUnfinished work left visibleEither fix it or remove the comment
Commented-out code blocksMessy experimentation habitsDelete dead code; use git for history
import *Does not understand namespacesImport specific names
Hardcoded file paths (/Users/john/data/)Not portableUse relative paths or config
try: except: passSwallowing errors silentlyCatch specific exceptions, log them
Print statements for debuggingNot using loggingUse logging module
Mixing tabs and spacesEditor configuration issuesUse a linter (black, ruff)
No .gitignoreCommitting data, caches, venvsAdd standard Python .gitignore
500-line cellsCannot decompose logicMax 30 lines per cell
Instant Rejection

Hardcoded absolute paths like /Users/yourname/Desktop/data.csv are an instant credibility killer. They tell the evaluator your code cannot run on any machine except yours. Always use relative paths or a configuration file.

Pre-Submission Checklist

Run through this checklist before submitting:

PRE_SUBMISSION_CHECKLIST = """
Code Quality Checklist - Run Before Submission
================================================

[ ] Notebook runs top-to-bottom without errors (Kernel > Restart & Run All)
[ ] All imports are at the top, no unused imports
[ ] No commented-out code blocks
[ ] No hardcoded absolute paths
[ ] No print() for debugging - use logging
[ ] All functions have type hints and docstrings
[ ] Constants defined in one place (top of notebook or config.py)
[ ] Random seeds set for all libraries
[ ] requirements.txt with pinned versions included
[ ] README.md with setup instructions included
[ ] No large files (data, model artifacts) committed to git
[ ] .gitignore includes __pycache__, .ipynb_checkpoints, data/, *.pkl
[ ] Cell outputs cleared and re-run (no stale outputs)
[ ] Variable names are descriptive (no df2, temp, x, result)
[ ] Markdown cells explain decisions and observations, not obvious code
[ ] Final section has summary, key results, and next steps
"""

Practice Problems

Problem 1: Refactor This Cell

You receive the following cell in a take-home notebook. Refactor it into clean, well-structured code with proper functions, type hints, and error handling.

# Original messy cell
df = pd.read_csv("data.csv")
df = df.dropna()
df["date"] = pd.to_datetime(df["date"])
df["days"] = (pd.Timestamp("2026-01-01") - df["date"]).dt.days
df["log_amt"] = np.log(df["amount"])
df["amt_per_day"] = df["amount"] / df["days"]
df2 = df.groupby("user").agg({"days": "min", "amount": ["sum", "mean"], "log_amt": "mean"})
df2.columns = ["recency", "total_spend", "avg_spend", "avg_log_spend"]
X = df2.drop("churn", axis=1)
y = df2["churn"]
model = lgb.LGBMClassifier()
model.fit(X, y)
print(model.score(X, y))
Hint 1 -- Direction

Identify the separate concerns in this cell: data loading, feature engineering, model training, and evaluation. Each should be its own function. Look for bugs too - there is at least one (log of potentially zero/negative values, division by zero).

Hint 2 -- Key Issues
  1. np.log(df["amount"]) - crashes on zero or negative amounts. Use np.log1p.
  2. df["amt_per_day"] = df["amount"] / df["days"] - division by zero when days = 0.
  3. df.dropna() - silent, unjustified data loss.
  4. df2["churn"] - where does this column come from? The groupby lost it.
  5. model.score(X, y) - evaluating on training data only.
  6. No random seed, no train-test split, no cross-validation.
Hint 3 -- Full Refactored Solution
def load_transactions(filepath: str) -> pd.DataFrame:
"""Load and validate transaction data."""
df = pd.read_csv(filepath, parse_dates=["date"])
required = {"user", "date", "amount"}
missing = required - set(df.columns)
if missing:
raise ValueError(f"Missing columns: {missing}")
n_nulls = df.isnull().sum().sum()
if n_nulls > 0:
logger.warning(f"Found {n_nulls} null values - investigating")
return df


def create_user_features(
transactions: pd.DataFrame,
reference_date: str = "2026-01-01",
) -> pd.DataFrame:
"""Create per-user features from transaction history."""
df = transactions.copy()
ref = pd.to_datetime(reference_date)
df["days_since"] = (ref - df["date"]).dt.days
df["log_amount"] = np.log1p(df["amount"].clip(lower=0))
df["amount_per_day"] = df["amount"] / df["days_since"].clip(lower=1)

features = df.groupby("user").agg(
recency=("days_since", "min"),
total_spend=("amount", "sum"),
avg_spend=("amount", "mean"),
avg_log_spend=("log_amount", "mean"),
)
return features


def train_and_evaluate(
X: pd.DataFrame,
y: pd.Series,
seed: int = 42,
n_folds: int = 5,
) -> Tuple[lgb.LGBMClassifier, Dict[str, float]]:
"""Train with cross-validation and return model + metrics."""
cv = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=seed)
scores = []
for train_idx, val_idx in cv.split(X, y):
model = lgb.LGBMClassifier(random_state=seed, verbose=-1)
model.fit(X.iloc[train_idx], y.iloc[train_idx])
y_pred = model.predict_proba(X.iloc[val_idx])[:, 1]
scores.append(roc_auc_score(y.iloc[val_idx], y_pred))
metrics = {"cv_auc_mean": np.mean(scores), "cv_auc_std": np.std(scores)}
final_model = lgb.LGBMClassifier(random_state=seed, verbose=-1)
final_model.fit(X, y)
return final_model, metrics

Scoring Rubric:

  • Strong Hire: Identifies all 6 bugs, separates concerns into 3+ functions, adds type hints and docstrings, uses log1p and clip, implements cross-validation, logs data quality issues.
  • Lean Hire: Separates into functions and fixes 3-4 bugs, but misses subtle issues like the missing churn column after groupby.
  • No Hire: Rearranges code without fixing bugs or adding structure.

Problem 2: Design a Test Suite

Write a test suite for a function encode_categoricals(df, columns, method="target") that performs target encoding on specified columns. Consider edge cases.

Hint 1 -- Direction

Think about: What happens with unseen categories at test time? What if a category has only one example? What about null values in categorical columns? Does the function leak target information?

Hint 2 -- Key Test Cases
  1. Normal case: known categories produce correct encoded values
  2. Unseen categories: should fall back to global mean, not crash
  3. Single-instance categories: should use smoothed estimate, not raw average
  4. Null categories: should handle gracefully
  5. Data leakage: encoding should be fit on train, applied to test
  6. Output shape: same number of rows, same or fewer columns
  7. Determinism: same input produces same output
Hint 3 -- Full Test Suite
class TestTargetEncoding:
@pytest.fixture
def train_data(self):
return pd.DataFrame({
"city": ["NYC", "NYC", "LA", "LA", "CHI", "CHI"],
"target": [1, 1, 0, 1, 0, 0],
})

def test_known_categories_encoded(self, train_data):
result = encode_categoricals(train_data, ["city"], method="target")
assert "city" in result.columns
assert result["city"].dtype == float

def test_output_shape_preserved(self, train_data):
result = encode_categoricals(train_data, ["city"], method="target")
assert len(result) == len(train_data)

def test_unseen_category_falls_back(self, train_data):
test_df = pd.DataFrame({"city": ["BOSTON"], "target": [0]})
encoder = fit_encoder(train_data, ["city"])
result = apply_encoder(test_df, encoder)
global_mean = train_data["target"].mean()
assert result.loc[0, "city"] == pytest.approx(global_mean)

def test_null_category_handled(self, train_data):
train_data.loc[0, "city"] = None
result = encode_categoricals(train_data, ["city"], method="target")
assert not result["city"].isnull().any()

def test_no_data_leakage(self, train_data):
"""Each row's encoding should NOT include its own target."""
result = encode_categoricals(
train_data, ["city"], method="target", fold_aware=True
)
# Verify leave-one-out or fold-based encoding was used
nyc_encoded = result.loc[train_data["city"] == "NYC", "city"]
assert not all(nyc_encoded == 1.0) # Should not be pure target mean

def test_deterministic(self, train_data):
r1 = encode_categoricals(train_data, ["city"], method="target")
r2 = encode_categoricals(train_data, ["city"], method="target")
pd.testing.assert_frame_equal(r1, r2)

Problem 3: Code Review

Review the following submission excerpt. List every code quality issue and assign a severity (Critical, Major, Minor).

import pandas as pd, numpy as np, sklearn, lightgbm
from sklearn.model_selection import *

df = pd.read_csv("/Users/candidate/Downloads/interview_data.csv")
df_clean = df.drop_duplicates()
print(f"shape: {df_clean.shape}")

# feature eng
df_clean['f1'] = df_clean['col_a'] * df_clean['col_b']
df_clean['f2'] = df_clean['col_c'].apply(lambda x: 1 if x > 0 else 0)
# df_clean['f3'] = df_clean['col_d'].map(some_dict) # didn't work
# df_clean['f4'] = df_clean['col_e'] ** 2 # maybe later

X = df_clean.drop('target', axis=1)
y = df_clean['target']
X_train, X_test, y_train, y_test = train_test_split(X, y)
model = lightgbm.LGBMClassifier(n_estimators=1000)
model.fit(X_train, y_train)
print(model.score(X_test, y_test))
Hint 1 -- Direction

Look for: import style, file paths, data handling, dead code, reproducibility, evaluation method, naming, and documentation.

Hint 2 -- Category of Issues

There are at least 12 issues spanning: imports (2), data loading (1), data handling (1), dead code (2), naming (2), reproducibility (2), evaluation (1), documentation (1).

Hint 3 -- Full Review
IssueSeverityFix
from sklearn.model_selection import * - wildcard importMajorImport specific: from sklearn.model_selection import train_test_split
import pandas as pd, numpy as np, sklearn, lightgbm - multiple imports on one lineMinorOne import per line
Hardcoded path /Users/candidate/Downloads/CriticalUse relative path or config
df.drop_duplicates() without explanationMajorLog how many rows dropped and why
Two blocks of commented-out codeMajorDelete dead code
Variable names f1, f2MajorUse descriptive names: interaction_ab, col_c_positive
df_clean mutated in placeMinorUse functions for feature engineering
No random seed in train_test_splitCriticalAdd random_state=42
No random seed in LGBMClassifierCriticalAdd random_state=42
model.score() - accuracy on imbalanced dataMajorUse appropriate metric (AUC, PR-AUC)
No cross-validationMajorUse StratifiedKFold
No markdown cells, no docstrings, no commentsMajorAdd narrative structure

Scoring Rubric:

  • Strong Hire: Identifies 10+ issues with correct severity classifications. Provides specific fixes. Mentions that the entire structure needs reorganization, not just individual line fixes.
  • Lean Hire: Identifies 6-9 issues, catches the critical ones (path, seeds) but misses evaluation and documentation issues.
  • No Hire: Identifies fewer than 5 issues or misclassifies severities (e.g., calling dead code "minor").

Interview Cheat Sheet

ConceptKey PracticeOne-LinerRed Flag
Notebook structure7 sections with markdown narrativeClear sections = clear thinking50 cells with no markdown
Function decompositionExtract reusable logic from cellsSame function for train and testCopy-paste preprocessing for train vs test
Type hintsAnnotate core pipeline functionsTypes document the data contractdef f(x, y, z) with no hints on key functions
Error handlingValidate inputs, log warningsDefensive code = production codetry: except: pass
ReproducibilitySeeds + pinned deps + READMEAnyone can re-run and get same resultsnp.random.seed(42) only, no requirements.txt
TestingTest feature engineering and metricsTests prove your logic is correctNo tests and no assertions anywhere
NamingDescriptive variable namesGood names eliminate commentsdf2, temp, result, X without context
Dead codeRemove all commented-out codeClean submission = finished work# TODO, # tried this, commented blocks
ConstantsDefine once at the topOne place to change parametersMagic numbers scattered in code
Project structureREADME + requirements + clean layoutProfessional packaging = professional workSingle notebook with no supporting files

Spaced Repetition Checkpoints

Day 0 -- Initial Learning

  • Read this entire page
  • Refactor one of your old notebooks using the 7-section structure
  • Add type hints and docstrings to three functions in an existing project
  • Complete the self-assessment

Day 3 -- First Recall

  • Without looking, write the 7-section notebook template from memory
  • Write a load_and_validate function with error handling from scratch
  • Create a requirements.txt for a current project

Day 7 -- Practice

  • Do Practice Problem 1 (refactoring) without looking at hints
  • Take an old take-home or project and apply the pre-submission checklist
  • Write three unit tests for one of your feature engineering functions

Day 14 -- Application

  • Complete a mock take-home with full code quality standards in 4 hours
  • Have a peer review your submission using the anti-pattern table
  • Do Practice Problem 3 (code review) under timed conditions (10 minutes)

Day 21 -- Mock Review

  • Submit a take-home to a friend or mentor for code quality feedback
  • Time yourself applying the pre-submission checklist (should take < 15 minutes)
  • Review any areas where you still default to bad habits

Key Takeaways

  1. Code quality is the tiebreaker. When two candidates have similar model performance, the one with cleaner code gets the offer. Evaluators hire people they want to work with, and messy code signals messy thinking.

  2. Structure your notebook like a document, not a scratchpad. Seven clear sections with markdown narrative between code cells lets the evaluator follow your logic without running anything.

  3. Extract functions for anything that touches both train and test data. This single practice eliminates the most common source of bugs in take-homes and demonstrates production awareness.

  4. Reproducibility is non-negotiable. Random seeds, pinned dependencies, relative paths, and a README that explains how to run your code. If the evaluator cannot reproduce your results, your results do not count.

  5. Tests are a signal of engineering maturity. You do not need 100% coverage. You need tests for the logic most likely to be wrong - feature engineering, custom metrics, and edge cases. Even 5 targeted tests set you apart from 90% of candidates.

© 2026 EngineersOfAI. All rights reserved.