Code Quality Standards - The Silent Evaluator

Reading time: ~40 min | Interview relevance: Critical | Roles: MLE, AI Eng, Data Scientist, Research Engineer, MLOps

The Real Interview Moment

You are reviewing take-home submissions at a Series B startup. Two candidates solved the same churn prediction problem. Both achieved an AUC of 0.87. You open Candidate A's notebook: 47 cells, no markdown, variable names like df2, df_final_v3, temp, commented-out code scattered everywhere, and a single 200-line cell that does feature engineering, training, and evaluation in one block. You have no idea what is happening by cell 15. You close the notebook and write "No hire - cannot assess thought process."

You open Candidate B's notebook: a clean table of contents at the top, each section separated by markdown headers, functions with type hints and docstrings, a requirements.txt pinned to exact versions, a README.md explaining how to reproduce the results, and a final "Summary and Next Steps" section. You can follow the logic in five minutes. You write "Strong hire - clear thinker, production-ready habits."

Both candidates had the same model performance. One got hired. One did not. The difference was code quality. This page teaches you exactly how to be Candidate B.

What You Will Master

Structure a Jupyter notebook with professional-grade organization and flow
Decompose monolithic cells into clean, reusable functions
Apply type hints and docstrings in data science code
Handle errors and edge cases gracefully in exploratory code
Guarantee reproducibility with seeds, dependency pinning, and environment management
Write meaningful tests for data pipelines and model logic
Follow clean code principles adapted for data science workflows
Create a submission package that signals production readiness

Self-Assessment: Where Are You Now?

Skill	1 -- Cannot	2 -- Vaguely	3 -- Can Do	4 -- Consistently	5 -- Can Teach	Your Score
Organize a notebook with clear sections						___
Extract reusable functions from notebook cells						___
Add type hints to data science functions						___
Handle errors in data loading and preprocessing						___
Set random seeds for full reproducibility						___
Write unit tests for feature engineering						___
Pin dependencies and document environment						___
Create a professional README for a take-home						___

Target: All 4s and 5s before you submit any take-home.

Part 1 -- Notebook Organization

The Professional Notebook Structure

Every take-home notebook should follow a consistent structure that allows the evaluator to navigate your thought process in under two minutes.

Professional Notebook Structure - Seven Sections from Header to Summary

Section 1: Header and Setup

The first cell of your notebook sets the tone. It should contain a title, your name, the date, and a brief problem statement. The second cell should contain all imports, grouped logically.

# Cell 1 - Markdown
"""
# Customer Churn Prediction - Take-Home Assessment
**Candidate:** Jane Smith
**Date:** 2026-03-07
**Time spent:** ~6 hours

## Problem Statement
Predict which customers will churn in the next 30 days using
transaction history, demographics, and engagement data.

## Approach Summary
1. EDA reveals class imbalance (8% churn rate) and strong temporal patterns
2. Feature engineering: RFM features, rolling aggregates, engagement velocity
3. Model: LightGBM with stratified 5-fold CV, optimized for PR-AUC
4. Final PR-AUC: 0.43 (baseline: 0.08)
"""

# Cell 2 - Imports (grouped by category)
# Standard library
import os
import json
import logging
from pathlib import Path
from datetime import datetime, timedelta
from typing import Tuple, Dict, List, Optional

# Data manipulation
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# ML
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import (
    roc_auc_score,
    precision_recall_curve,
    average_precision_score,
    classification_report,
)
from sklearn.preprocessing import StandardScaler
import lightgbm as lgb

# Configuration
plt.style.use("seaborn-v0_8-whitegrid")
pd.set_option("display.max_columns", 50)
pd.set_option("display.max_rows", 100)

RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

60-Second Answer

"I organize my notebooks into seven clear sections: header, data loading, EDA, feature engineering, modeling, evaluation, and summary. Each section starts with a markdown cell explaining what I am doing and why. All imports are in one cell at the top. Constants and configuration are defined once. This makes it possible for a reviewer to understand my approach in under two minutes without running any code."

Section Separators and Markdown

Every section should begin with a markdown cell that explains the purpose of that section and any key decisions. Think of markdown cells as the narration of your analysis - they tell the story that your code implements.

Bad notebook flow:

[Code cell] → [Code cell] → [Code cell] → [Code cell] → [Code cell]

Good notebook flow:

[Markdown: What and Why] → [Code cell] → [Markdown: Observation] →
[Code cell] → [Markdown: Decision and Rationale]

Common Trap

Do not over-narrate. Evaluators do not want a paragraph explaining what df.shape does. Save markdown for decisions, observations, and rationale. "I chose PR-AUC over ROC-AUC because the classes are heavily imbalanced (8% positive rate), and we care more about precision at low recall thresholds" is useful. "Now I will check the shape of the dataframe" is noise.

The "Cell Length" Rule

No single code cell should exceed 30 lines. If a cell is longer, it is doing too much. Extract a function, split the cell, or move logic to a utility module.

# BAD - 60-line monolithic cell
# ... loads data, cleans data, creates features, trains model ...

# GOOD - focused cells
# Cell: Load and validate raw data
raw_df = load_and_validate("data/transactions.csv")

# Cell: Create RFM features
rfm_features = create_rfm_features(raw_df, reference_date="2026-01-01")

# Cell: Create engagement features
engagement_features = create_engagement_features(raw_df, window_days=30)

Part 2 -- Function Decomposition

Why Functions Matter in Take-Homes

Evaluators are not just checking whether your code runs. They are checking whether you can write code that a team could maintain. Functions serve three purposes in a take-home:

Readability - A function name like create_rfm_features() is self-documenting
Reusability - The same function can be applied to train and test sets consistently
Testability - Functions can be unit-tested; raw cells cannot

Function Decomposition - From Monolithic Cell to Named Function, Type Hints, Docstring, Error Handling, Unit Test

The Anatomy of a Well-Written Data Science Function

def create_rfm_features(
    transactions: pd.DataFrame,
    customer_id_col: str = "customer_id",
    date_col: str = "transaction_date",
    amount_col: str = "amount",
    reference_date: Optional[str] = None,
) -> pd.DataFrame:
    """Create Recency, Frequency, Monetary features per customer.

    Computes three features for each customer:
    - Recency: days since last transaction
    - Frequency: total number of transactions
    - Monetary: average transaction amount

    Args:
        transactions: Raw transaction DataFrame with at least customer_id,
            transaction_date, and amount columns.
        customer_id_col: Name of the customer identifier column.
        date_col: Name of the transaction date column.
        amount_col: Name of the transaction amount column.
        reference_date: Date to compute recency from. If None, uses max date
            in the dataset.

    Returns:
        DataFrame indexed by customer_id with columns:
        recency_days, frequency, monetary_avg.

    Raises:
        ValueError: If required columns are missing from the input DataFrame.
        ValueError: If transactions DataFrame is empty.

    Example:
        >>> rfm = create_rfm_features(transactions_df)
        >>> rfm.head()
                     recency_days  frequency  monetary_avg
        customer_id
        C001                    3         15        42.50
        C002                   45          2       120.00
    """
    # Validate inputs
    required_cols = {customer_id_col, date_col, amount_col}
    missing_cols = required_cols - set(transactions.columns)
    if missing_cols:
        raise ValueError(f"Missing columns: {missing_cols}")

    if transactions.empty:
        raise ValueError("Input DataFrame is empty")

    df = transactions.copy()
    df[date_col] = pd.to_datetime(df[date_col])

    if reference_date is None:
        ref_date = df[date_col].max()
    else:
        ref_date = pd.to_datetime(reference_date)

    rfm = (
        df.groupby(customer_id_col)
        .agg(
            recency_days=(date_col, lambda x: (ref_date - x.max()).days),
            frequency=(date_col, "count"),
            monetary_avg=(amount_col, "mean"),
        )
    )

    logger.info(
        f"Created RFM features for {len(rfm)} customers. "
        f"Recency range: [{rfm['recency_days'].min()}, {rfm['recency_days'].max()}]"
    )

    return rfm

When to Use Functions vs. Inline Code

Use a Function When	Keep Inline When
Logic is reused on train AND test sets	One-off exploratory visualization
Logic exceeds 10 lines	Simple pandas one-liner (df.describe())
Logic has clear input/output contract	Quick sanity check (print shape, dtypes)
Logic needs to be tested	Markdown-adjacent explanation code
Logic involves non-obvious transformations	Standard library calls with obvious intent

Instant Rejection

Never apply different preprocessing to train and test sets by writing the logic twice inline. This is the number one source of train-test skew in take-homes. Extract a function and call it on both sets with the same parameters.

# CATASTROPHIC - different logic for train and test
train_df["age_bin"] = pd.cut(train_df["age"], bins=5)
test_df["age_bin"] = pd.cut(test_df["age"], bins=4)  # Different bins!

# CORRECT - single function, consistent application
def bin_age(df: pd.DataFrame, bins: int = 5) -> pd.DataFrame:
    df = df.copy()
    df["age_bin"] = pd.cut(df["age"], bins=bins)
    return df

train_df = bin_age(train_df)
test_df = bin_age(test_df)

Extracting Functions: A Step-by-Step Process

When you have a working monolithic cell, follow this process to refactor it:

Identify the inputs - What data does this block need?
Identify the outputs - What does it produce?
Name the operation - What verb describes the transformation?
Extract and parameterize - Move hardcoded values to parameters with defaults
Add types and docstring - Document the contract
Validate inputs - Add checks for common errors
Test - Call it on a small sample and verify output

# BEFORE: Monolithic feature engineering cell (40 lines)
df["days_since_signup"] = (pd.Timestamp("2026-01-01") - df["signup_date"]).dt.days
df["log_revenue"] = np.log1p(df["total_revenue"])
df["orders_per_month"] = df["total_orders"] / (df["days_since_signup"] / 30)
df["avg_order_value"] = df["total_revenue"] / df["total_orders"].clip(lower=1)
# ... 30 more lines ...

# AFTER: Clean function calls
df = add_temporal_features(df, reference_date="2026-01-01")
df = add_revenue_features(df)
df = add_behavioral_features(df)

Part 3 -- Type Hints and Docstrings

Type Hints for Data Science

Type hints are not just for software engineers. In data science code, they communicate the expected data contract at a glance.

from typing import Tuple, Dict, List, Optional, Union
import numpy as np
import pandas as pd
from numpy.typing import NDArray


def train_evaluate_model(
    X_train: pd.DataFrame,
    y_train: pd.Series,
    X_val: pd.DataFrame,
    y_val: pd.Series,
    params: Dict[str, Union[int, float, str]],
    feature_names: Optional[List[str]] = None,
) -> Tuple[lgb.Booster, Dict[str, float]]:
    """Train a LightGBM model and return it with evaluation metrics.

    Args:
        X_train: Training features.
        y_train: Training labels (binary).
        X_val: Validation features.
        y_val: Validation labels (binary).
        params: LightGBM hyperparameters.
        feature_names: Subset of columns to use. If None, uses all columns.

    Returns:
        Tuple of (trained model, dict of evaluation metrics).
    """
    if feature_names is not None:
        X_train = X_train[feature_names]
        X_val = X_val[feature_names]

    train_data = lgb.Dataset(X_train, label=y_train)
    val_data = lgb.Dataset(X_val, label=y_val, reference=train_data)

    model = lgb.train(
        params,
        train_data,
        valid_sets=[val_data],
        num_boost_round=1000,
        callbacks=[lgb.early_stopping(50), lgb.log_evaluation(100)],
    )

    y_pred = model.predict(X_val)
    metrics = {
        "roc_auc": roc_auc_score(y_val, y_pred),
        "pr_auc": average_precision_score(y_val, y_pred),
    }

    return model, metrics

Common Type Hint Patterns in Data Science

# DataFrames and Series
def process(df: pd.DataFrame) -> pd.DataFrame: ...
def get_labels(df: pd.DataFrame) -> pd.Series: ...

# NumPy arrays
def normalize(arr: NDArray[np.float64]) -> NDArray[np.float64]: ...

# Multiple return values
def split_data(df: pd.DataFrame) -> Tuple[pd.DataFrame, pd.DataFrame]: ...

# Optional parameters
def plot_results(
    metrics: Dict[str, float],
    save_path: Optional[str] = None,
    figsize: Tuple[int, int] = (10, 6),
) -> None: ...

# Union types for flexibility
def load_data(source: Union[str, Path]) -> pd.DataFrame: ...

Practical Reality

You do not need to type-hint every helper function in an exploratory notebook. Focus type hints on the core pipeline functions - data loading, feature engineering, model training, and evaluation. These are the functions the evaluator will read most carefully, and type hints there signal maturity without adding bureaucratic overhead everywhere.

Docstring Styles

Use the Google style for consistency. It is compact and readable in notebooks.

def compute_feature_importance(
    model: lgb.Booster,
    feature_names: List[str],
    importance_type: str = "gain",
    top_n: int = 20,
) -> pd.DataFrame:
    """Compute and format feature importance from a trained model.

    Args:
        model: Trained LightGBM Booster.
        feature_names: List of feature names matching model input.
        importance_type: Type of importance. One of 'gain', 'split'.
        top_n: Number of top features to return.

    Returns:
        DataFrame with columns 'feature' and 'importance', sorted descending.

    Raises:
        ValueError: If importance_type is not 'gain' or 'split'.
    """
    if importance_type not in ("gain", "split"):
        raise ValueError(
            f"importance_type must be 'gain' or 'split', got '{importance_type}'"
        )

    importance = model.feature_importance(importance_type=importance_type)

    importance_df = (
        pd.DataFrame({"feature": feature_names, "importance": importance})
        .sort_values("importance", ascending=False)
        .head(top_n)
        .reset_index(drop=True)
    )

    return importance_df

Part 4 -- Error Handling

Defensive Coding in Data Science

Data is messy. Your code should handle the mess gracefully instead of crashing with an inscrutable traceback. Evaluators look for evidence that you anticipate real-world data problems.

Defensive Coding Pattern - Validate Input, Process, Check Output, Handle Failures

The Three Layers of Data Validation

Layer 1: Schema Validation (on load)

def load_and_validate(
    filepath: Union[str, Path],
    required_columns: List[str],
    date_columns: Optional[List[str]] = None,
) -> pd.DataFrame:
    """Load a CSV and validate its schema before any processing.

    Args:
        filepath: Path to the CSV file.
        required_columns: Columns that must be present.
        date_columns: Columns to parse as datetime.

    Returns:
        Validated DataFrame with correct dtypes.

    Raises:
        FileNotFoundError: If the file does not exist.
        ValueError: If required columns are missing.
    """
    filepath = Path(filepath)
    if not filepath.exists():
        raise FileNotFoundError(f"Data file not found: {filepath}")

    df = pd.read_csv(filepath, parse_dates=date_columns)

    missing = set(required_columns) - set(df.columns)
    if missing:
        raise ValueError(
            f"Missing required columns: {missing}. "
            f"Available columns: {list(df.columns)}"
        )

    logger.info(f"Loaded {len(df)} rows, {len(df.columns)} columns from {filepath}")

    return df

Layer 2: Data Quality Checks (during EDA)

def check_data_quality(df: pd.DataFrame) -> Dict[str, any]:
    """Run data quality checks and return a summary report.

    Returns a dictionary with quality metrics. Does NOT raise errors -
    instead, logs warnings for issues that should be investigated.
    """
    report = {
        "n_rows": len(df),
        "n_cols": len(df.columns),
        "duplicate_rows": df.duplicated().sum(),
        "null_counts": df.isnull().sum().to_dict(),
        "null_pct": (df.isnull().sum() / len(df) * 100).to_dict(),
    }

    if report["duplicate_rows"] > 0:
        logger.warning(
            f"Found {report['duplicate_rows']} duplicate rows "
            f"({report['duplicate_rows']/len(df)*100:.1f}%)"
        )

    high_null_cols = {
        col: pct
        for col, pct in report["null_pct"].items()
        if pct > 50
    }
    if high_null_cols:
        logger.warning(f"Columns with >50% nulls: {high_null_cols}")

    return report

Layer 3: Output Validation (after transformation)

def validate_features(
    features: pd.DataFrame,
    expected_rows: int,
    no_null_columns: Optional[List[str]] = None,
) -> None:
    """Validate feature DataFrame after engineering.

    Args:
        features: The feature DataFrame to validate.
        expected_rows: Expected number of rows.
        no_null_columns: Columns that must have zero nulls.

    Raises:
        AssertionError: If any validation check fails.
    """
    assert len(features) == expected_rows, (
        f"Row count mismatch: expected {expected_rows}, got {len(features)}"
    )

    inf_cols = [
        col for col in features.select_dtypes(include=[np.number]).columns
        if np.isinf(features[col]).any()
    ]
    assert not inf_cols, f"Infinite values found in columns: {inf_cols}"

    if no_null_columns:
        null_cols = [
            col for col in no_null_columns
            if features[col].isnull().any()
        ]
        assert not null_cols, f"Unexpected nulls in columns: {null_cols}"

Common Trap

Do not silently drop rows with missing values. Every row you drop should be logged with a reason. Evaluators who see df.dropna() without explanation will wonder whether you introduced survivorship bias.

# BAD - silent data loss
df = df.dropna()

# GOOD - explicit, logged, justified
n_before = len(df)
df = df.dropna(subset=["target_variable"])
n_after = len(df)
logger.info(
    f"Dropped {n_before - n_after} rows with missing target "
    f"({(n_before - n_after) / n_before * 100:.1f}%)"
)

Part 5 -- Reproducibility

The Reproducibility Checklist

If an evaluator clones your repository and runs your notebook, they should get exactly the same results. This is non-negotiable.

Reproducibility Checklist - Random Seeds, Dependency Pinning, Data Versioning, Environment Documentation

Setting Random Seeds Properly

Setting np.random.seed(42) is not enough. You must seed every library that uses randomness.

import random
import os
import numpy as np

def set_all_seeds(seed: int = 42) -> None:
    """Set random seeds for full reproducibility.

    Sets seeds for Python's random module, NumPy, and optionally
    PyTorch and TensorFlow if they are available.

    Args:
        seed: The random seed value.
    """
    random.seed(seed)
    np.random.seed(seed)
    os.environ["PYTHONHASHSEED"] = str(seed)

    try:
        import torch
        torch.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False
    except ImportError:
        pass

    try:
        import tensorflow as tf
        tf.random.set_seed(seed)
    except ImportError:
        pass

    logger.info(f"All random seeds set to {seed}")


# Call at the very top of your notebook
SEED = 42
set_all_seeds(SEED)

Dependency Pinning

Always include a requirements.txt with exact versions. Evaluators should be able to recreate your environment.

# requirements.txt - pinned for reproducibility
numpy==1.26.4
pandas==2.2.1
scikit-learn==1.4.1
lightgbm==4.3.0
matplotlib==3.8.3
seaborn==0.13.2
jupyter==1.0.0

Generate this automatically:

pip freeze > requirements.txt

Or better, use a minimal requirements file listing only what you directly import:

# Generate minimal requirements
pip install pipreqs
pipreqs . --force

The README Template

Every take-home submission should include a README:

# README.md template for take-home submissions
README_TEMPLATE = """
# {Project Title} - Take-Home Assessment

## Quick Start
```bash
python -m venv venv
source venv/bin/activate  # On Windows: venv\\Scripts\\activate
pip install -r requirements.txt
jupyter notebook solution.ipynb
```

## Project Structure
```
.
├── README.md                  # This file
├── requirements.txt           # Pinned dependencies
├── solution.ipynb             # Main analysis notebook
├── src/
│   ├── features.py            # Feature engineering functions
│   ├── evaluation.py          # Evaluation utilities
│   └── visualization.py       # Plotting helpers
├── tests/
│   ├── test_features.py       # Feature engineering tests
│   └── test_evaluation.py     # Evaluation metric tests
├── data/
│   └── README.md              # Data description and source
└── outputs/
    ├── figures/               # Generated plots
    └── model/                 # Saved model artifacts
```

## Approach Summary
{Brief 3-4 sentence summary of methodology and key results}

## Key Results
- Metric 1: value
- Metric 2: value
- Baseline comparison: improvement

## Reproducibility
- Python {version}
- All random seeds set to 42
- Expected runtime: ~{X} minutes on {hardware description}

## Assumptions and Limitations
{List of explicit assumptions and known limitations}
"""

Evaluator's Perspective

A README tells me two things: (1) this candidate thinks about the person who has to read their code, and (2) they have experience working on teams where documentation matters. In a stack of 30 submissions, the one with a clear README gets read first.

Part 6 -- Testing in Take-Homes

Why Tests Matter (Even in Notebooks)

You do not need 100% test coverage. You need tests for the logic that is most likely to be wrong: feature engineering transformations, custom metrics, and data preprocessing steps.

What to Test

What to Test - Feature Engineering, Custom Metrics, Data Transformations, and Edge Cases

Writing Tests for Feature Engineering

# tests/test_features.py
import pytest
import pandas as pd
import numpy as np
from src.features import create_rfm_features, create_engagement_features


@pytest.fixture
def sample_transactions() -> pd.DataFrame:
    """Create a minimal transaction DataFrame for testing."""
    return pd.DataFrame({
        "customer_id": ["A", "A", "A", "B", "B"],
        "transaction_date": pd.to_datetime([
            "2026-01-01", "2026-01-15", "2026-02-01",
            "2026-01-10", "2026-01-20",
        ]),
        "amount": [100.0, 50.0, 75.0, 200.0, 30.0],
    })


class TestRFMFeatures:
    """Tests for RFM feature engineering."""

    def test_output_shape(self, sample_transactions: pd.DataFrame) -> None:
        """Output should have one row per customer."""
        rfm = create_rfm_features(
            sample_transactions, reference_date="2026-03-01"
        )
        assert len(rfm) == 2  # Two unique customers

    def test_output_columns(self, sample_transactions: pd.DataFrame) -> None:
        """Output should contain exactly the expected columns."""
        rfm = create_rfm_features(
            sample_transactions, reference_date="2026-03-01"
        )
        expected_cols = {"recency_days", "frequency", "monetary_avg"}
        assert set(rfm.columns) == expected_cols

    def test_recency_values(self, sample_transactions: pd.DataFrame) -> None:
        """Recency should be days since last transaction."""
        rfm = create_rfm_features(
            sample_transactions, reference_date="2026-03-01"
        )
        # Customer A's last transaction: 2026-02-01, ref: 2026-03-01 = 28 days
        assert rfm.loc["A", "recency_days"] == 28

    def test_frequency_values(self, sample_transactions: pd.DataFrame) -> None:
        """Frequency should be count of transactions."""
        rfm = create_rfm_features(
            sample_transactions, reference_date="2026-03-01"
        )
        assert rfm.loc["A", "frequency"] == 3
        assert rfm.loc["B", "frequency"] == 2

    def test_monetary_values(self, sample_transactions: pd.DataFrame) -> None:
        """Monetary should be average transaction amount."""
        rfm = create_rfm_features(
            sample_transactions, reference_date="2026-03-01"
        )
        assert rfm.loc["A", "monetary_avg"] == pytest.approx(75.0)
        assert rfm.loc["B", "monetary_avg"] == pytest.approx(115.0)

    def test_missing_columns_raises(self) -> None:
        """Should raise ValueError for missing required columns."""
        bad_df = pd.DataFrame({"wrong_col": [1, 2, 3]})
        with pytest.raises(ValueError, match="Missing columns"):
            create_rfm_features(bad_df)

    def test_empty_dataframe_raises(self) -> None:
        """Should raise ValueError for empty input."""
        empty_df = pd.DataFrame(
            columns=["customer_id", "transaction_date", "amount"]
        )
        with pytest.raises(ValueError, match="empty"):
            create_rfm_features(empty_df)

    def test_no_nulls_in_output(self, sample_transactions: pd.DataFrame) -> None:
        """Output should contain no null values."""
        rfm = create_rfm_features(
            sample_transactions, reference_date="2026-03-01"
        )
        assert rfm.isnull().sum().sum() == 0

Running Tests in a Notebook

If you want to keep everything in a single notebook (some evaluators prefer this), you can run tests inline:

# Cell - Quick validation tests (run inline)
def run_quick_tests() -> None:
    """Run quick validation tests for key functions."""

    # Test 1: Feature engineering produces correct shape
    test_df = transactions.head(100)
    test_features = create_rfm_features(test_df, reference_date="2026-03-01")
    n_customers = test_df["customer_id"].nunique()
    assert len(test_features) == n_customers, (
        f"Expected {n_customers} rows, got {len(test_features)}"
    )

    # Test 2: No nulls in critical features
    assert test_features.isnull().sum().sum() == 0, "Nulls found in features"

    # Test 3: No infinite values
    numeric_cols = test_features.select_dtypes(include=[np.number]).columns
    for col in numeric_cols:
        assert not np.isinf(test_features[col]).any(), f"Inf found in {col}"

    # Test 4: Feature values are in reasonable ranges
    assert (test_features["recency_days"] >= 0).all(), "Negative recency"
    assert (test_features["frequency"] >= 1).all(), "Zero frequency"
    assert (test_features["monetary_avg"] > 0).all(), "Non-positive monetary"

    print("All quick tests passed!")


run_quick_tests()

Time Budget

Testing should take no more than 10-15% of your total time on a take-home. For a 6-hour project, spend about 30-45 minutes writing tests for the 3-4 most critical functions. Do not aim for full coverage - aim for confidence in your core logic.

Part 7 -- Clean Code Principles for Data Science

Naming Conventions

# BAD names - what do these mean?
df2 = process(df1)
temp = df2.groupby("x").agg({"y": "mean"})
result = temp.merge(df3, on="id")
X = result.drop("target", axis=1)

# GOOD names - self-documenting
customer_features = engineer_features(raw_transactions)
avg_revenue_by_segment = customer_features.groupby("segment").agg(
    {"revenue": "mean"}
)
enriched_customers = avg_revenue_by_segment.merge(
    demographics, on="customer_id"
)
X_train = enriched_customers.drop("churn_label", axis=1)

The No Dead Code Rule

Remove all dead code before submission. Commented-out code, unused imports, and abandoned experiments make your notebook look messy and undermine confidence in your work.

# BAD - graveyard of abandoned experiments
# from sklearn.ensemble import RandomForestClassifier  # tried this, didn't work
# model = RandomForestClassifier(n_estimators=100)
# model = RandomForestClassifier(n_estimators=500)  # better but slow
# model = GradientBoostingClassifier()  # keep this maybe?
model = lgb.LGBMClassifier()  # final choice

# GOOD - clean, with rationale in markdown
# Markdown cell: "Chose LightGBM over Random Forest based on 5-fold CV
# comparison (LightGBM PR-AUC: 0.43 vs RF PR-AUC: 0.38). See Section 5
# for the full comparison table."
model = lgb.LGBMClassifier(
    n_estimators=500,
    learning_rate=0.05,
    max_depth=6,
    random_state=SEED,
)

Constants and Configuration

Define all constants in one place at the top of your notebook.

# === Configuration ===
SEED = 42
TEST_SIZE = 0.2
N_FOLDS = 5
TARGET_COL = "churned"
ID_COL = "customer_id"
DATE_COL = "event_date"

# Feature engineering parameters
RFM_REFERENCE_DATE = "2026-01-01"
ROLLING_WINDOWS = [7, 14, 30, 60]
MIN_TRANSACTIONS = 3

# Model hyperparameters
LGBM_PARAMS = {
    "objective": "binary",
    "metric": "average_precision",
    "learning_rate": 0.05,
    "max_depth": 6,
    "num_leaves": 31,
    "min_child_samples": 20,
    "subsample": 0.8,
    "colsample_bytree": 0.8,
    "reg_alpha": 0.1,
    "reg_lambda": 0.1,
    "random_state": SEED,
    "verbose": -1,
}

The Pipeline Pattern

For multi-step transformations, use a pipeline pattern to keep the flow clear and consistent.

from typing import Callable, List

def build_feature_pipeline(
    steps: List[Callable[[pd.DataFrame], pd.DataFrame]],
) -> Callable[[pd.DataFrame], pd.DataFrame]:
    """Compose multiple feature engineering steps into a single function.

    Args:
        steps: List of functions, each taking and returning a DataFrame.

    Returns:
        A single function that applies all steps in order.

    Example:
        >>> pipeline = build_feature_pipeline([
        ...     add_temporal_features,
        ...     add_rfm_features,
        ...     add_engagement_features,
        ...     drop_raw_columns,
        ... ])
        >>> features = pipeline(raw_df)
    """
    def pipeline(df: pd.DataFrame) -> pd.DataFrame:
        result = df.copy()
        for step in steps:
            n_before = len(result.columns)
            result = step(result)
            n_after = len(result.columns)
            logger.info(
                f"{step.__name__}: {n_before} -> {n_after} columns"
            )
        return result

    return pipeline


# Usage - apply the same pipeline to train and test
feature_pipeline = build_feature_pipeline([
    add_temporal_features,
    add_rfm_features,
    add_engagement_features,
    encode_categoricals,
    drop_raw_columns,
])

train_features = feature_pipeline(train_df)
test_features = feature_pipeline(test_df)

Evaluator's Perspective

When I see a pipeline pattern in a take-home, I know this candidate has worked on production ML systems. It shows they understand that the same transformations must apply to both training and serving data, which is a critical production requirement that most junior candidates miss.

Part 8 -- Project Structure for Multi-File Submissions

When to Go Beyond a Single Notebook

For take-homes that allow 8+ hours, consider splitting your code into modules. This demonstrates software engineering maturity.

Project Structure by Time Allocation - Single Notebook, Notebook Plus Modules, Full Project

The Standard Layout

Standard Take-Home Project Layout - Code Quality

The config.py Pattern

# src/config.py
"""Central configuration for the take-home project."""
from dataclasses import dataclass
from typing import List


@dataclass(frozen=True)
class Config:
    """Immutable configuration for the analysis pipeline."""

    # Data
    raw_data_path: str = "data/transactions.csv"
    target_col: str = "churned"
    id_col: str = "customer_id"

    # Feature engineering
    reference_date: str = "2026-01-01"
    rolling_windows: tuple = (7, 14, 30, 60)
    min_transactions: int = 3

    # Model
    seed: int = 42
    n_folds: int = 5
    test_size: float = 0.2

    # Outputs
    output_dir: str = "outputs"
    figures_dir: str = "outputs/figures"
    metrics_path: str = "outputs/metrics.json"


# Singleton instance
config = Config()

Part 9 -- Code Quality Anti-Patterns

The Hall of Shame

These patterns will cost you the offer. Each one signals a different kind of immaturity.

Anti-Pattern	What It Signals	Fix
`df2`, `df_final`, `df_final_v2`	No naming discipline	Use descriptive names: `customer_features`, `enriched_customers`
`# TODO: fix this later`	Unfinished work left visible	Either fix it or remove the comment
Commented-out code blocks	Messy experimentation habits	Delete dead code; use git for history
`import *`	Does not understand namespaces	Import specific names
Hardcoded file paths (`/Users/john/data/`)	Not portable	Use relative paths or config
`try: except: pass`	Swallowing errors silently	Catch specific exceptions, log them
Print statements for debugging	Not using logging	Use `logging` module
Mixing tabs and spaces	Editor configuration issues	Use a linter (black, ruff)
No .gitignore	Committing data, caches, venvs	Add standard Python .gitignore
500-line cells	Cannot decompose logic	Max 30 lines per cell

Instant Rejection

Hardcoded absolute paths like /Users/yourname/Desktop/data.csv are an instant credibility killer. They tell the evaluator your code cannot run on any machine except yours. Always use relative paths or a configuration file.

Pre-Submission Checklist

Run through this checklist before submitting:

PRE_SUBMISSION_CHECKLIST = """
Code Quality Checklist - Run Before Submission
================================================

[ ] Notebook runs top-to-bottom without errors (Kernel > Restart & Run All)
[ ] All imports are at the top, no unused imports
[ ] No commented-out code blocks
[ ] No hardcoded absolute paths
[ ] No print() for debugging - use logging
[ ] All functions have type hints and docstrings
[ ] Constants defined in one place (top of notebook or config.py)
[ ] Random seeds set for all libraries
[ ] requirements.txt with pinned versions included
[ ] README.md with setup instructions included
[ ] No large files (data, model artifacts) committed to git
[ ] .gitignore includes __pycache__, .ipynb_checkpoints, data/, *.pkl
[ ] Cell outputs cleared and re-run (no stale outputs)
[ ] Variable names are descriptive (no df2, temp, x, result)
[ ] Markdown cells explain decisions and observations, not obvious code
[ ] Final section has summary, key results, and next steps
"""

Practice Problems

Problem 1: Refactor This Cell

You receive the following cell in a take-home notebook. Refactor it into clean, well-structured code with proper functions, type hints, and error handling.

# Original messy cell
df = pd.read_csv("data.csv")
df = df.dropna()
df["date"] = pd.to_datetime(df["date"])
df["days"] = (pd.Timestamp("2026-01-01") - df["date"]).dt.days
df["log_amt"] = np.log(df["amount"])
df["amt_per_day"] = df["amount"] / df["days"]
df2 = df.groupby("user").agg({"days": "min", "amount": ["sum", "mean"], "log_amt": "mean"})
df2.columns = ["recency", "total_spend", "avg_spend", "avg_log_spend"]
X = df2.drop("churn", axis=1)
y = df2["churn"]
model = lgb.LGBMClassifier()
model.fit(X, y)
print(model.score(X, y))

Hint 1 -- Direction

Identify the separate concerns in this cell: data loading, feature engineering, model training, and evaluation. Each should be its own function. Look for bugs too - there is at least one (log of potentially zero/negative values, division by zero).

Hint 2 -- Key Issues

np.log(df["amount"]) - crashes on zero or negative amounts. Use np.log1p.
df["amt_per_day"] = df["amount"] / df["days"] - division by zero when days = 0.
df.dropna() - silent, unjustified data loss.
df2["churn"] - where does this column come from? The groupby lost it.
model.score(X, y) - evaluating on training data only.
No random seed, no train-test split, no cross-validation.

Hint 3 -- Full Refactored Solution

def load_transactions(filepath: str) -> pd.DataFrame:
    """Load and validate transaction data."""
    df = pd.read_csv(filepath, parse_dates=["date"])
    required = {"user", "date", "amount"}
    missing = required - set(df.columns)
    if missing:
        raise ValueError(f"Missing columns: {missing}")
    n_nulls = df.isnull().sum().sum()
    if n_nulls > 0:
        logger.warning(f"Found {n_nulls} null values - investigating")
    return df


def create_user_features(
    transactions: pd.DataFrame,
    reference_date: str = "2026-01-01",
) -> pd.DataFrame:
    """Create per-user features from transaction history."""
    df = transactions.copy()
    ref = pd.to_datetime(reference_date)
    df["days_since"] = (ref - df["date"]).dt.days
    df["log_amount"] = np.log1p(df["amount"].clip(lower=0))
    df["amount_per_day"] = df["amount"] / df["days_since"].clip(lower=1)

    features = df.groupby("user").agg(
        recency=("days_since", "min"),
        total_spend=("amount", "sum"),
        avg_spend=("amount", "mean"),
        avg_log_spend=("log_amount", "mean"),
    )
    return features


def train_and_evaluate(
    X: pd.DataFrame,
    y: pd.Series,
    seed: int = 42,
    n_folds: int = 5,
) -> Tuple[lgb.LGBMClassifier, Dict[str, float]]:
    """Train with cross-validation and return model + metrics."""
    cv = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=seed)
    scores = []
    for train_idx, val_idx in cv.split(X, y):
        model = lgb.LGBMClassifier(random_state=seed, verbose=-1)
        model.fit(X.iloc[train_idx], y.iloc[train_idx])
        y_pred = model.predict_proba(X.iloc[val_idx])[:, 1]
        scores.append(roc_auc_score(y.iloc[val_idx], y_pred))
    metrics = {"cv_auc_mean": np.mean(scores), "cv_auc_std": np.std(scores)}
    final_model = lgb.LGBMClassifier(random_state=seed, verbose=-1)
    final_model.fit(X, y)
    return final_model, metrics

Scoring Rubric:

Strong Hire: Identifies all 6 bugs, separates concerns into 3+ functions, adds type hints and docstrings, uses log1p and clip, implements cross-validation, logs data quality issues.
Lean Hire: Separates into functions and fixes 3-4 bugs, but misses subtle issues like the missing churn column after groupby.
No Hire: Rearranges code without fixing bugs or adding structure.

Problem 2: Design a Test Suite

Write a test suite for a function encode_categoricals(df, columns, method="target") that performs target encoding on specified columns. Consider edge cases.

Hint 1 -- Direction

Think about: What happens with unseen categories at test time? What if a category has only one example? What about null values in categorical columns? Does the function leak target information?

Hint 2 -- Key Test Cases

Normal case: known categories produce correct encoded values
Unseen categories: should fall back to global mean, not crash
Single-instance categories: should use smoothed estimate, not raw average
Null categories: should handle gracefully
Data leakage: encoding should be fit on train, applied to test
Output shape: same number of rows, same or fewer columns
Determinism: same input produces same output

Hint 3 -- Full Test Suite

class TestTargetEncoding:
    @pytest.fixture
    def train_data(self):
        return pd.DataFrame({
            "city": ["NYC", "NYC", "LA", "LA", "CHI", "CHI"],
            "target": [1, 1, 0, 1, 0, 0],
        })

    def test_known_categories_encoded(self, train_data):
        result = encode_categoricals(train_data, ["city"], method="target")
        assert "city" in result.columns
        assert result["city"].dtype == float

    def test_output_shape_preserved(self, train_data):
        result = encode_categoricals(train_data, ["city"], method="target")
        assert len(result) == len(train_data)

    def test_unseen_category_falls_back(self, train_data):
        test_df = pd.DataFrame({"city": ["BOSTON"], "target": [0]})
        encoder = fit_encoder(train_data, ["city"])
        result = apply_encoder(test_df, encoder)
        global_mean = train_data["target"].mean()
        assert result.loc[0, "city"] == pytest.approx(global_mean)

    def test_null_category_handled(self, train_data):
        train_data.loc[0, "city"] = None
        result = encode_categoricals(train_data, ["city"], method="target")
        assert not result["city"].isnull().any()

    def test_no_data_leakage(self, train_data):
        """Each row's encoding should NOT include its own target."""
        result = encode_categoricals(
            train_data, ["city"], method="target", fold_aware=True
        )
        # Verify leave-one-out or fold-based encoding was used
        nyc_encoded = result.loc[train_data["city"] == "NYC", "city"]
        assert not all(nyc_encoded == 1.0)  # Should not be pure target mean

    def test_deterministic(self, train_data):
        r1 = encode_categoricals(train_data, ["city"], method="target")
        r2 = encode_categoricals(train_data, ["city"], method="target")
        pd.testing.assert_frame_equal(r1, r2)

Problem 3: Code Review

Review the following submission excerpt. List every code quality issue and assign a severity (Critical, Major, Minor).

import pandas as pd, numpy as np, sklearn, lightgbm
from sklearn.model_selection import *

df = pd.read_csv("/Users/candidate/Downloads/interview_data.csv")
df_clean = df.drop_duplicates()
print(f"shape: {df_clean.shape}")

# feature eng
df_clean['f1'] = df_clean['col_a'] * df_clean['col_b']
df_clean['f2'] = df_clean['col_c'].apply(lambda x: 1 if x > 0 else 0)
# df_clean['f3'] = df_clean['col_d'].map(some_dict)  # didn't work
# df_clean['f4'] = df_clean['col_e'] ** 2  # maybe later

X = df_clean.drop('target', axis=1)
y = df_clean['target']
X_train, X_test, y_train, y_test = train_test_split(X, y)
model = lightgbm.LGBMClassifier(n_estimators=1000)
model.fit(X_train, y_train)
print(model.score(X_test, y_test))

Hint 1 -- Direction

Look for: import style, file paths, data handling, dead code, reproducibility, evaluation method, naming, and documentation.

Hint 2 -- Category of Issues

There are at least 12 issues spanning: imports (2), data loading (1), data handling (1), dead code (2), naming (2), reproducibility (2), evaluation (1), documentation (1).

Hint 3 -- Full Review

Issue	Severity	Fix
`from sklearn.model_selection import *` - wildcard import	Major	Import specific: `from sklearn.model_selection import train_test_split`
`import pandas as pd, numpy as np, sklearn, lightgbm` - multiple imports on one line	Minor	One import per line
Hardcoded path `/Users/candidate/Downloads/`	Critical	Use relative path or config
`df.drop_duplicates()` without explanation	Major	Log how many rows dropped and why
Two blocks of commented-out code	Major	Delete dead code
Variable names `f1`, `f2`	Major	Use descriptive names: `interaction_ab`, `col_c_positive`
`df_clean` mutated in place	Minor	Use functions for feature engineering
No random seed in `train_test_split`	Critical	Add `random_state=42`
No random seed in LGBMClassifier	Critical	Add `random_state=42`
`model.score()` - accuracy on imbalanced data	Major	Use appropriate metric (AUC, PR-AUC)
No cross-validation	Major	Use StratifiedKFold
No markdown cells, no docstrings, no comments	Major	Add narrative structure

Scoring Rubric:

Strong Hire: Identifies 10+ issues with correct severity classifications. Provides specific fixes. Mentions that the entire structure needs reorganization, not just individual line fixes.
Lean Hire: Identifies 6-9 issues, catches the critical ones (path, seeds) but misses evaluation and documentation issues.
No Hire: Identifies fewer than 5 issues or misclassifies severities (e.g., calling dead code "minor").

Interview Cheat Sheet

Concept	Key Practice	One-Liner	Red Flag
Notebook structure	7 sections with markdown narrative	Clear sections = clear thinking	50 cells with no markdown
Function decomposition	Extract reusable logic from cells	Same function for train and test	Copy-paste preprocessing for train vs test
Type hints	Annotate core pipeline functions	Types document the data contract	`def f(x, y, z)` with no hints on key functions
Error handling	Validate inputs, log warnings	Defensive code = production code	`try: except: pass`
Reproducibility	Seeds + pinned deps + README	Anyone can re-run and get same results	`np.random.seed(42)` only, no requirements.txt
Testing	Test feature engineering and metrics	Tests prove your logic is correct	No tests and no assertions anywhere
Naming	Descriptive variable names	Good names eliminate comments	`df2`, `temp`, `result`, `X` without context
Dead code	Remove all commented-out code	Clean submission = finished work	`# TODO`, `# tried this`, commented blocks
Constants	Define once at the top	One place to change parameters	Magic numbers scattered in code
Project structure	README + requirements + clean layout	Professional packaging = professional work	Single notebook with no supporting files

Spaced Repetition Checkpoints

Day 0 -- Initial Learning

Read this entire page
Refactor one of your old notebooks using the 7-section structure
Add type hints and docstrings to three functions in an existing project
Complete the self-assessment

Day 3 -- First Recall

Without looking, write the 7-section notebook template from memory
Write a load_and_validate function with error handling from scratch
Create a requirements.txt for a current project

Day 7 -- Practice

Do Practice Problem 1 (refactoring) without looking at hints
Take an old take-home or project and apply the pre-submission checklist
Write three unit tests for one of your feature engineering functions

Day 14 -- Application

Complete a mock take-home with full code quality standards in 4 hours
Have a peer review your submission using the anti-pattern table
Do Practice Problem 3 (code review) under timed conditions (10 minutes)

Day 21 -- Mock Review

Submit a take-home to a friend or mentor for code quality feedback
Time yourself applying the pre-submission checklist (should take < 15 minutes)
Review any areas where you still default to bad habits

Key Takeaways

Code quality is the tiebreaker. When two candidates have similar model performance, the one with cleaner code gets the offer. Evaluators hire people they want to work with, and messy code signals messy thinking.
Structure your notebook like a document, not a scratchpad. Seven clear sections with markdown narrative between code cells lets the evaluator follow your logic without running anything.
Extract functions for anything that touches both train and test data. This single practice eliminates the most common source of bugs in take-homes and demonstrates production awareness.
Reproducibility is non-negotiable. Random seeds, pinned dependencies, relative paths, and a README that explains how to run your code. If the evaluator cannot reproduce your results, your results do not count.
Tests are a signal of engineering maturity. You do not need 100% coverage. You need tests for the logic most likely to be wrong - feature engineering, custom metrics, and edge cases. Even 5 targeted tests set you apart from 90% of candidates.

The Real Interview Moment​

What You Will Master​

Self-Assessment: Where Are You Now?​

Part 1 -- Notebook Organization​

The Professional Notebook Structure​

Section 1: Header and Setup​

Section Separators and Markdown​

The "Cell Length" Rule​

Part 2 -- Function Decomposition​

Why Functions Matter in Take-Homes​

The Anatomy of a Well-Written Data Science Function​

When to Use Functions vs. Inline Code​

Extracting Functions: A Step-by-Step Process​

Part 3 -- Type Hints and Docstrings​

Type Hints for Data Science​

Common Type Hint Patterns in Data Science​

Docstring Styles​

Part 4 -- Error Handling​

Defensive Coding in Data Science​

The Three Layers of Data Validation​

Part 5 -- Reproducibility​

The Reproducibility Checklist​

Setting Random Seeds Properly​

Dependency Pinning​

The README Template​

Part 6 -- Testing in Take-Homes​

Why Tests Matter (Even in Notebooks)​

What to Test​

Writing Tests for Feature Engineering​

Running Tests in a Notebook​

Part 7 -- Clean Code Principles for Data Science​

Naming Conventions​

The No Dead Code Rule​

Constants and Configuration​

The Pipeline Pattern​

Part 8 -- Project Structure for Multi-File Submissions​

When to Go Beyond a Single Notebook​

The Standard Layout​

The config.py Pattern​

Part 9 -- Code Quality Anti-Patterns​

The Hall of Shame​

Pre-Submission Checklist​

Practice Problems​

Problem 1: Refactor This Cell​

Problem 2: Design a Test Suite​

Problem 3: Code Review​

Interview Cheat Sheet​

Spaced Repetition Checkpoints​

Day 0 -- Initial Learning​

Day 3 -- First Recall​

Day 7 -- Practice​

Day 14 -- Application​

Day 21 -- Mock Review​

Key Takeaways​

The Real Interview Moment

What You Will Master

Self-Assessment: Where Are You Now?

Part 1 -- Notebook Organization

The Professional Notebook Structure

Section 1: Header and Setup

Section Separators and Markdown

The "Cell Length" Rule

Part 2 -- Function Decomposition

Why Functions Matter in Take-Homes

The Anatomy of a Well-Written Data Science Function

When to Use Functions vs. Inline Code

Extracting Functions: A Step-by-Step Process

Part 3 -- Type Hints and Docstrings

Type Hints for Data Science

Common Type Hint Patterns in Data Science

Docstring Styles

Part 4 -- Error Handling

Defensive Coding in Data Science

The Three Layers of Data Validation

Part 5 -- Reproducibility

The Reproducibility Checklist

Setting Random Seeds Properly

Dependency Pinning

The README Template

Part 6 -- Testing in Take-Homes

Why Tests Matter (Even in Notebooks)

What to Test

Writing Tests for Feature Engineering

Running Tests in a Notebook

Part 7 -- Clean Code Principles for Data Science

Naming Conventions

The No Dead Code Rule

Constants and Configuration

The Pipeline Pattern

Part 8 -- Project Structure for Multi-File Submissions

When to Go Beyond a Single Notebook

The Standard Layout

The config.py Pattern

Part 9 -- Code Quality Anti-Patterns

The Hall of Shame

Pre-Submission Checklist

Practice Problems

Problem 1: Refactor This Cell

Problem 2: Design a Test Suite

Problem 3: Code Review

Interview Cheat Sheet

Spaced Repetition Checkpoints

Day 0 -- Initial Learning

Day 3 -- First Recall

Day 7 -- Practice

Day 14 -- Application

Day 21 -- Mock Review

Key Takeaways