Model Selection Strategy - From Baseline to Best Model
Reading time: ~30 min | Interview relevance: Critical | Roles: MLE, Data Scientist, Applied Scientist, AI Engineer
The Real Interview Moment
You are presenting your take-home results to a panel of three ML engineers. You walk through your EDA, your feature engineering, and then reveal your model: a fine-tuned XGBoost with 500 estimators, max_depth of 8, a learning rate of 0.01, and 47 hyperparameters tuned via Bayesian optimization. Your test AUC is 0.87.
The senior engineer asks a simple question: "What was your baseline?"
You hesitate. You did not build a baseline. You went straight to XGBoost because you knew it would be competitive.
She follows up: "So how do you know that 0.87 is good? What would a logistic regression achieve on this data? What about just predicting the majority class?"
You realize that without a baseline, your 0.87 AUC is meaningless. It could be barely better than random guessing (AUC 0.50), or it could be within 1% of a simple linear model that trains in 100 milliseconds. Without that context, the evaluator cannot assess your model's value - and neither can you.
Model selection in a take-home is not about finding the best possible model. It is about demonstrating a principled process that starts simple, adds complexity only when justified, and communicates the reasoning at every step.
What You Will Master
- Why baselines are non-negotiable and how to build them
- A decision framework for choosing models based on data characteristics
- How to compare models fairly using proper cross-validation
- Hyperparameter tuning strategies that maximize signal per hour invested
- When to stop iterating and ship your solution
- How to document model selection rationale for evaluators
Self-Assessment: Where Are You Now?
| Level | Description | Target |
|---|---|---|
| Beginner | "I use XGBoost for everything" | Read all parts - you need the full framework |
| Intermediate | "I try multiple models but am unsure about proper comparison" | Focus on Parts 2-3 (comparison and tuning) |
| Advanced | "I have a good process but want to optimize for time pressure" | Jump to Parts 4-5 (stopping criteria and documentation) |
Part 1 - The Baseline Imperative
Why Baselines Are Non-Negotiable
A baseline is the simplest possible model that establishes the performance floor. Without it, you cannot answer the fundamental question: "Is my model adding value?"
Building Baselines for Every Task Type
from sklearn.dummy import DummyClassifier, DummyRegressor
from sklearn.metrics import (
roc_auc_score, average_precision_score, f1_score,
mean_squared_error, mean_absolute_error, r2_score,
)
import numpy as np
def build_classification_baselines(X_train, y_train, X_test, y_test):
"""Build baseline models for classification tasks.
These baselines establish the performance floor.
Any model that does not beat these is adding no value.
"""
baselines = {
"Always majority class": DummyClassifier(strategy="most_frequent"),
"Class proportions": DummyClassifier(strategy="stratified", random_state=42),
"Always positive": DummyClassifier(strategy="constant", constant=1),
}
print("BASELINE MODELS")
print("=" * 70)
results = {}
for name, model in baselines.items():
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# For AUC, we need probability estimates
try:
y_prob = model.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_test, y_prob)
except Exception:
auc = 0.5 # Random baseline AUC
f1 = f1_score(y_test, y_pred, zero_division=0)
accuracy = (y_pred == y_test).mean()
results[name] = {"accuracy": accuracy, "f1": f1, "auc": auc}
print(f" {name:30s} Accuracy={accuracy:.4f} F1={f1:.4f} AUC={auc:.4f}")
print("\n Any useful model MUST beat these baselines.")
return results
def build_regression_baselines(X_train, y_train, X_test, y_test):
"""Build baseline models for regression tasks."""
baselines = {
"Predict mean": DummyRegressor(strategy="mean"),
"Predict median": DummyRegressor(strategy="median"),
}
print("BASELINE MODELS")
print("=" * 70)
results = {}
for name, model in baselines.items():
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
results[name] = {"rmse": rmse, "mae": mae, "r2": r2}
print(f" {name:30s} RMSE={rmse:.4f} MAE={mae:.4f} R2={r2:.4f}")
return results
"Every take-home project must start with a baseline. For classification, the baseline is the majority-class predictor (DummyClassifier). For regression, it is predicting the mean or median. The baseline tells you two things: (1) the minimum performance any useful model must exceed, and (2) how much room for improvement exists. If your fancy model only beats the baseline by 1%, that is important context - it means the problem is either very hard or your features are not informative enough. Evaluators always check for a baseline; not having one is a red flag."
Part 2 - Choosing the Right Model
The Model Selection Decision Tree
Model Selection Guide by Problem Type
| Problem Type | Start With | Then Try | Avoid Unless Justified |
|---|---|---|---|
| Binary classification (tabular) | Logistic Regression | Random Forest, XGBoost/LightGBM | Neural networks on small data |
| Multi-class classification | Logistic Regression (OVR) | Random Forest, XGBoost | SVM (slow on large data) |
| Regression (tabular) | Linear Regression / Ridge | Random Forest, XGBoost/LightGBM | Deep learning on tabular |
| Text classification | TF-IDF + Logistic Regression | TF-IDF + SVM, DistilBERT | GPT-4 for simple classification |
| Time series forecasting | Simple exponential smoothing | ARIMA/SARIMA, Prophet, LightGBM | LSTM on < 1000 data points |
| Image classification | Transfer learning (ResNet/EfficientNet) | Fine-tuning pre-trained model | Training from scratch (unless huge data) |
| Recommendation | Popularity baseline | Collaborative filtering, ALS | Graph neural networks |
| Anomaly detection | Isolation Forest | Local Outlier Factor, Autoencoders | Supervised if no labels |
"I used XGBoost because it wins Kaggle competitions" is a statement that evaluators hear constantly and view negatively. It signals that you apply the same tool to every problem without thinking. What evaluators want to hear: "I started with logistic regression as a baseline because it is fast, interpretable, and establishes a performance floor. I then tried gradient boosting to capture potential non-linear interactions. Gradient boosting improved AUC by 0.04, which I consider meaningful for this problem."
When Simple Models Win
There are several situations where simple models outperform complex ones - and recognizing them is a sign of maturity:
| Situation | Why Simple Wins | What to Do |
|---|---|---|
| Small dataset (< 1K rows) | Complex models overfit; not enough data to learn patterns | Stick with logistic regression or regularized models |
| High signal-to-noise ratio | Relationship is nearly linear; complexity adds noise | Linear model with good features |
| Many irrelevant features | Tree ensembles can be distracted by noise features | Feature selection + simple model |
| Interpretability required | Stakeholders need to understand predictions | Linear model with feature importance |
| Time-critical inference | Predictions needed in < 1ms | Logistic regression or shallow tree |
from sklearn.linear_model import LogisticRegression, Ridge
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
import time
def systematic_model_comparison(
X_train, y_train,
task: str = "classification",
cv: int = 5,
scoring: str = None,
) -> dict:
"""Compare models systematically with proper cross-validation.
Starts simple and adds complexity. Documents training time for each model.
"""
if scoring is None:
scoring = "roc_auc" if task == "classification" else "neg_root_mean_squared_error"
preprocessor = Pipeline([
("imputer", SimpleImputer(strategy="median")),
("scaler", StandardScaler()),
])
if task == "classification":
models = {
"Logistic Regression": LogisticRegression(max_iter=1000, random_state=42),
"Random Forest": RandomForestClassifier(n_estimators=100, random_state=42),
"Gradient Boosting": GradientBoostingClassifier(n_estimators=100, random_state=42),
}
else:
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
models = {
"Ridge Regression": Ridge(alpha=1.0),
"Random Forest": RandomForestRegressor(n_estimators=100, random_state=42),
"Gradient Boosting": GradientBoostingRegressor(n_estimators=100, random_state=42),
}
results = {}
print(f"MODEL COMPARISON (5-fold CV, metric={scoring})")
print("=" * 70)
for name, model in models.items():
pipeline = Pipeline([("prep", preprocessor), ("model", model)])
start_time = time.time()
scores = cross_val_score(pipeline, X_train, y_train, cv=cv, scoring=scoring)
train_time = time.time() - start_time
results[name] = {
"mean": scores.mean(),
"std": scores.std(),
"scores": scores,
"train_time": train_time,
}
print(f" {name:30s} {scoring}={scores.mean():.4f} (+/- {scores.std():.4f}) Time: {train_time:.1f}s")
# Highlight best model
best = max(results.items(), key=lambda x: x[1]["mean"])
print(f"\n Best: {best[0]} ({scoring}={best[1]['mean']:.4f})")
return results
Part 3 - Comparing Models Fairly
Cross-Validation Strategy
The single most important principle: never select a model based on test set performance. Use cross-validation on the training set for model selection, then evaluate the final chosen model on the test set exactly once.
from sklearn.model_selection import (
StratifiedKFold, KFold, TimeSeriesSplit,
cross_validate,
)
def choose_cv_strategy(task_type: str, is_time_series: bool = False, n_samples: int = 1000):
"""Select the appropriate cross-validation strategy.
Returns:
Cross-validation splitter object
"""
if is_time_series:
# NEVER shuffle time series data
return TimeSeriesSplit(n_splits=5)
if task_type == "classification":
# Stratified preserves class distribution in each fold
return StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# Regression
if n_samples < 500:
return KFold(n_splits=3, shuffle=True, random_state=42) # Fewer folds for small data
return KFold(n_splits=5, shuffle=True, random_state=42)
These cross-validation mistakes will sink your submission:
- Selecting models based on test set performance. The test set is for final evaluation only. Using it for model selection means your test metric is optimistically biased.
- Not stratifying for imbalanced classification. If your target is 95/5, random splits can create folds with no positive examples.
- Random splits on time series. This creates future data leaking into the training set.
- Preprocessing outside the CV loop. Scaling, encoding, or feature selection must happen inside each fold. Use
Pipelineto enforce this.
Statistical Comparison of Models
When two models have similar cross-validation scores, how do you determine if one is actually better?
from scipy import stats
def compare_models_statistically(
scores_a: np.ndarray,
scores_b: np.ndarray,
name_a: str = "Model A",
name_b: str = "Model B",
) -> dict:
"""Statistically compare two models using paired t-test on CV scores.
Note: The paired t-test on CV scores is imperfect (it underestimates
variance due to overlapping training sets), but it is the standard
approach and better than just comparing means.
"""
diff = scores_a - scores_b
t_stat, p_value = stats.ttest_rel(scores_a, scores_b)
print(f"MODEL COMPARISON: {name_a} vs {name_b}")
print("=" * 60)
print(f" {name_a}: {scores_a.mean():.4f} +/- {scores_a.std():.4f}")
print(f" {name_b}: {scores_b.mean():.4f} +/- {scores_b.std():.4f}")
print(f" Difference: {diff.mean():.4f} +/- {diff.std():.4f}")
print(f" Paired t-test: t={t_stat:.3f}, p={p_value:.4f}")
if p_value < 0.05:
better = name_a if diff.mean() > 0 else name_b
print(f" Conclusion: {better} is significantly better (p < 0.05)")
else:
print(f" Conclusion: No significant difference (p = {p_value:.3f})")
print(f" Recommendation: Choose the simpler/faster model")
return {"t_stat": t_stat, "p_value": p_value, "difference": diff.mean()}
The Simplicity Principle
When two models perform similarly, always choose the simpler one. Document why.
| When Gradient Boosting AUC = 0.85 and Logistic Regression AUC = 0.83 | Choose |
|---|---|
| Difference is statistically significant | Gradient Boosting, with documented justification |
| Difference is NOT statistically significant | Logistic Regression - simpler, faster, more interpretable |
| Interpretability is required | Logistic Regression - even if GB is significantly better |
| Production latency matters | Logistic Regression - unless the 0.02 AUC matters for the business |
Part 4 - Hyperparameter Tuning
How Much Tuning Is Expected?
| Take-Home Duration | Tuning Expectation | Approach |
|---|---|---|
| 4 hours | Minimal | Default parameters + 1-2 key adjustments |
| 8 hours | Light | RandomizedSearchCV with 20-50 iterations |
| Weekend | Moderate | RandomizedSearchCV + manual refinement of best |
| 1 week | Thorough | Full search for top 2 models, with documentation |
Practical Tuning Approach
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform, randint
def tune_model(
pipeline,
X_train, y_train,
param_distributions: dict,
n_iter: int = 50,
cv: int = 5,
scoring: str = "roc_auc",
) -> dict:
"""Tune model hyperparameters using randomized search.
Why RandomizedSearchCV over GridSearchCV:
- More efficient for high-dimensional parameter spaces
- Finds good parameters faster (statistically proven)
- Allows specifying a time budget (n_iter controls compute time)
"""
search = RandomizedSearchCV(
pipeline,
param_distributions=param_distributions,
n_iter=n_iter,
cv=cv,
scoring=scoring,
random_state=42,
n_jobs=-1,
verbose=0,
return_train_score=True,
)
search.fit(X_train, y_train)
print("HYPERPARAMETER TUNING RESULTS")
print("=" * 60)
print(f" Best {scoring}: {search.best_score_:.4f}")
print(f" Best parameters:")
for param, value in search.best_params_.items():
print(f" {param}: {value}")
# Check for overfitting
results = pd.DataFrame(search.cv_results_)
best_idx = search.best_index_
train_score = results.loc[best_idx, "mean_train_score"]
val_score = results.loc[best_idx, "mean_test_score"]
gap = train_score - val_score
print(f"\n Train score: {train_score:.4f}")
print(f" Validation score: {val_score:.4f}")
print(f" Gap: {gap:.4f}", end="")
if gap > 0.05:
print(" <-- Possible overfitting")
else:
print(" (acceptable)")
return {
"best_params": search.best_params_,
"best_score": search.best_score_,
"best_estimator": search.best_estimator_,
"cv_results": results,
}
# Example parameter distributions for common models
PARAM_DISTRIBUTIONS = {
"logistic_regression": {
"model__C": uniform(0.01, 10),
"model__penalty": ["l1", "l2"],
"model__solver": ["saga"],
},
"random_forest": {
"model__n_estimators": randint(50, 300),
"model__max_depth": [None, 5, 10, 20, 30],
"model__min_samples_split": randint(2, 20),
"model__min_samples_leaf": randint(1, 10),
"model__max_features": ["sqrt", "log2", None],
},
"gradient_boosting": {
"model__n_estimators": randint(50, 300),
"model__max_depth": randint(3, 10),
"model__learning_rate": uniform(0.01, 0.3),
"model__min_samples_split": randint(2, 20),
"model__subsample": uniform(0.6, 0.4),
},
"xgboost": {
"model__n_estimators": randint(50, 300),
"model__max_depth": randint(3, 10),
"model__learning_rate": uniform(0.01, 0.3),
"model__subsample": uniform(0.6, 0.4),
"model__colsample_bytree": uniform(0.6, 0.4),
"model__reg_alpha": uniform(0, 1),
"model__reg_lambda": uniform(0, 1),
},
}
Do not spend 3 hours tuning XGBoost hyperparameters to squeeze out 0.002 AUC improvement. Evaluators do not care about marginal gains from tuning. They care about whether you understand what the hyperparameters control and why you chose certain values. Saying "I increased min_samples_leaf to 5 to reduce overfitting, which I observed from the train-validation gap" is worth more than a 50-iteration grid search with no explanation.
What Hyperparameters Actually Control
| Hyperparameter | What It Controls | Increase Effect | Decrease Effect |
|---|---|---|---|
| n_estimators (ensembles) | Number of trees | More capacity, slower | Less capacity, faster |
| max_depth (trees) | Tree complexity | More overfitting risk | More underfitting risk |
| learning_rate (boosting) | Step size per tree | Faster convergence | Better generalization |
| min_samples_split/leaf | Split constraints | More regularization | More flexibility |
| subsample (boosting) | Data fraction per tree | More randomness | More deterministic |
| C (logistic/SVM) | Regularization strength | Less regularization | More regularization |
| alpha/lambda (regularization) | L1/L2 penalty strength | More sparsity/shrinkage | Less regularization |
Part 5 - When to Stop Iterating
The Diminishing Returns Curve
The Stopping Decision Framework
Stop iterating on models when any of these conditions are met:
| Condition | Why Stop | What to Do Instead |
|---|---|---|
| Diminishing returns | Last iteration improved < 0.5% | Invest time in write-up and code quality |
| Train-val gap growing | You are overfitting | Simplify model, add regularization |
| Time budget exhausted | Cannot afford more iterations | Document what you would try next |
| Performance plateau | Multiple approaches yield similar results | Choose simplest model, explain the plateau |
| Leaderboard-quality reached | Model is competitive with known benchmarks | Shift to communication and polish |
def should_stop_iterating(results: dict, time_remaining_hours: float) -> dict:
"""Decision framework for when to stop model iteration.
Args:
results: Dict of model_name -> {"mean": cv_score, "std": ...}
time_remaining_hours: Hours left before deadline
Returns:
Decision dict with recommendation and reasoning
"""
scores = [(name, r["mean"]) for name, r in results.items()]
scores.sort(key=lambda x: x[1], reverse=True)
best_name, best_score = scores[0]
second_name, second_score = scores[1] if len(scores) > 1 else ("N/A", 0)
improvement = best_score - second_score
decision = {
"best_model": best_name,
"best_score": best_score,
"improvement_over_next": improvement,
}
if time_remaining_hours < 1.5:
decision["action"] = "STOP - insufficient time for another iteration"
decision["focus"] = "Write-up, code cleanup, and final submission prep"
elif improvement < 0.005:
decision["action"] = "STOP - diminishing returns"
decision["focus"] = "Choose simpler model, invest in documentation"
elif improvement > 0.02:
decision["action"] = "Consider one more iteration"
decision["focus"] = "Feature engineering likely has more impact than more models"
else:
decision["action"] = "STOP after current model"
decision["focus"] = "Error analysis on current best model"
return decision
"The most common mistake in take-home projects is spending too much time on modeling and not enough on everything else. The 80/20 rule applies: 80% of your model performance comes from the first 20% of your modeling effort. Once you have a well-tuned model that beats the baseline meaningfully, shift your time to error analysis, code quality, and the write-up. A well-presented 0.85 AUC beats a poorly-presented 0.87 AUC every time."
Part 6 - Error Analysis
Going Beyond Aggregate Metrics
After selecting your model, analyze where it fails. This demonstrates depth of understanding that most candidates lack.
def perform_error_analysis(
model, X_test, y_test, feature_names: list[str], n_examples: int = 10
) -> None:
"""Analyze model errors to understand failure patterns."""
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]
# Confusion matrix breakdown
from sklearn.metrics import confusion_matrix
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
print("ERROR ANALYSIS")
print("=" * 60)
print(f" True Positives: {tp} (correctly predicted positive)")
print(f" True Negatives: {tn} (correctly predicted negative)")
print(f" False Positives: {fp} (predicted positive, actually negative)")
print(f" False Negatives: {fn} (predicted negative, actually positive)")
# Analyze false negatives (often most costly)
test_df = pd.DataFrame(X_test, columns=feature_names)
test_df["y_true"] = y_test.values
test_df["y_pred"] = y_pred
test_df["y_prob"] = y_prob
fn_mask = (test_df["y_true"] == 1) & (test_df["y_pred"] == 0)
fp_mask = (test_df["y_true"] == 0) & (test_df["y_pred"] == 1)
correct_mask = test_df["y_true"] == test_df["y_pred"]
print(f"\n False Negative Analysis (n={fn_mask.sum()}):")
if fn_mask.sum() > 0:
fn_features = test_df.loc[fn_mask, feature_names].describe()
correct_pos = test_df.loc[(test_df["y_true"] == 1) & correct_mask, feature_names].describe()
# Compare false negatives to correctly predicted positives
for col in feature_names[:5]: # Top 5 features
fn_mean = test_df.loc[fn_mask, col].mean()
tp_mean = test_df.loc[(test_df["y_true"] == 1) & correct_mask, col].mean()
diff = abs(fn_mean - tp_mean)
if diff > test_df[col].std() * 0.3:
print(f" {col}: FN mean={fn_mean:.3f}, TP mean={tp_mean:.3f} (notable difference)")
# Confidence calibration
print(f"\n Prediction Confidence Analysis:")
for threshold in [0.3, 0.5, 0.7, 0.9]:
above = (y_prob >= threshold)
if above.sum() > 0:
precision = (y_test[above] == 1).mean()
print(f" P(positive | prob >= {threshold}): {precision:.3f} (n={above.sum()})")
# Error analysis is a strong signal to evaluators that you
# understand your model's limitations and can debug in production
Part 7 - Documenting Model Selection
The Documentation Template
Your notebook should contain a clear summary of the model selection process:
## Model Selection Summary
### Models Evaluated
| Model | CV AUC (mean +/- std) | Training Time | Notes |
|-------|----------------------|---------------|-------|
| Baseline (majority) | 0.500 +/- 0.000 | < 1s | Performance floor |
| Logistic Regression | 0.762 +/- 0.018 | 2s | Strong baseline, interpretable |
| Random Forest | 0.801 +/- 0.022 | 12s | Better, but higher variance |
| **Gradient Boosting** | **0.841 +/- 0.015** | 28s | **Best performance, selected** |
### Selection Rationale
I selected Gradient Boosting as the final model because:
1. It achieves the highest cross-validation AUC (0.841)
2. The improvement over Logistic Regression (+0.079) is statistically significant (paired t-test, p = 0.003)
3. The improvement over Random Forest (+0.040) is borderline significant (p = 0.048) but consistent across folds
4. The train-validation gap (0.92 vs 0.84) is moderate, suggesting the model generalizes reasonably
### Hyperparameter Tuning
Key parameters tuned via RandomizedSearchCV (50 iterations, 5-fold CV):
- `n_estimators`: 200 (increased from default 100 for more capacity)
- `max_depth`: 5 (reduced from default to prevent overfitting)
- `learning_rate`: 0.08 (reduced for better generalization)
- `min_samples_leaf`: 4 (increased for regularization)
### What I Would Try With More Time
1. **Feature engineering:** Interaction features between top predictors
2. **Stacking:** Combine LR and GB predictions as a meta-learner
3. **Calibration:** Apply Platt scaling to improve probability estimates
4. **Feature selection:** Use SHAP values to prune low-importance features
Practice Exercises
Exercise 1: Build the Full Pipeline
Using any classification dataset, implement the complete model selection pipeline:
- Build 3 baselines
- Train 3 real models with pipelines (prevent leakage)
- Compare with cross-validation
- Tune the best model
- Evaluate on the test set (once)
- Write the model selection summary
Time yourself. Target: 90 minutes.
Exercise 2: Defend Your Choice
After completing Exercise 1, write answers to these evaluator questions:
- "Why did you choose this model over the alternatives?"
- "How do you know you are not overfitting?"
- "What would you do differently with 10x more data?"
- "What would you do differently with 10x less data?"
Exercise 3: The Simplicity Challenge
Take a problem where you would normally use XGBoost. Build only a logistic regression with thoughtful feature engineering. See how close you can get to the XGBoost performance. Document the comparison.
Interview Cheat Sheet
| Question | Key Points |
|---|---|
| "How do you select a model?" | Start with baseline, try simple first, add complexity only when justified by CV improvement |
| "What is your baseline?" | Task-dependent: majority class (classification), mean (regression), popularity (recsys) |
| "How do you compare models fairly?" | Same cross-validation folds, same metrics, statistical significance test for close results |
| "When do you choose simple over complex?" | When improvement is not significant, interpretability needed, small data, tight latency requirements |
| "How do you tune hyperparameters?" | RandomizedSearchCV with meaningful parameter ranges; understand what each parameter controls |
| "How do you know you are not overfitting?" | Train-validation gap < 0.05, stable CV scores across folds, performance holds on test set |
| "When do you stop iterating?" | Diminishing returns, time constraints, or performance plateau - document what you would try next |
| "What is the most important hyperparameter?" | Depends on model: regularization (linear), max_depth (trees), learning_rate (boosting) |
Next Steps
You now have a principled model selection process. But the model is only as good as the code it lives in. The next chapter, Code Quality Standards, covers how to write production-quality code for take-home projects - notebook organization, function decomposition, testing, and reproducibility.
