Technical Debt in ML Systems
Reading time: 35–40 min | Relevance: ML Engineer, MLOps Engineer, Senior Data Scientist, Engineering Manager
The Six-Month "Two-Week" Project
It started as a simple task. The team needed to retrain the churn prediction model - the one that had been running successfully in production for 18 months. The data had changed, the business had evolved, and the model's performance had declined. A straightforward retraining job. The tech lead estimated two weeks.
Six months later, they were still working on it.
Week one: they discovered nobody knew where the original training data was. The data scientist who built it had left the company. Her laptop had been wiped. There was a reference to an S3 path in a Slack message from 14 months ago, but the bucket had been reorganized twice since then. The data was gone.
Week two: they found the training script - in a personal Git branch that was never merged, with 23 uncommitted local changes. The script had hardcoded paths to the former data scientist's local machine. It imported a library that had since released a breaking version change. It referenced a "utils.py" file that didn't exist in the repository at all.
Week three: they reconstructed enough to run a training job. The results were worse than the original model despite using more recent data. After two days of debugging, they found the issue: the feature engineering pipeline in production had been quietly updated by a data engineer six months ago to fix a timezone bug, but the training pipeline had never been updated. Training and serving were computing the same features differently. The model they trained couldn't match production feature distributions.
Month two: they discovered the model had three undocumented "consumers" - downstream services that called it directly and depended on its output format. One was a reporting service. One was a pricing engine. One was a customer segmentation tool that had been built by a contractor and was no longer maintained. Changing the model's output format to accommodate the new architecture would break all three.
Month three through six: untangling the mess. Writing contracts, migrating consumers, rebuilding the feature pipeline from scratch, documenting everything that should have been documented at the start.
The two-week job took six months because the team had accumulated eighteen months of ML technical debt with no plan to pay it down. This lesson is about how that debt accrues - and how to stop it.
:::tip 🎮 Interactive Playground Visualize this concept: Try the MLOps Maturity Model demo on the EngineersOfAI Playground - no code required. :::
Why This Exists
Ward Cunningham coined the term "technical debt" in 1992 to describe the long-term maintenance cost of short-term coding decisions. The analogy is financial: you can borrow against the future (ship code faster now, clean it up later), but interest accrues. The longer you wait to pay the debt, the more you pay.
ML systems accrue technical debt in all the same ways traditional software does - plus a set of additional failure modes that are unique to machine learning. In 2015, a team of Google engineers published "Hidden Technical Debt in Machine Learning Systems" (Sculley et al., NeurIPS 2015), one of the most influential papers in applied ML engineering. Their central claim: the code you write to train and serve a model is typically a small fraction of the total system. The majority of an ML system is glue, pipelines, configuration, data management, and infrastructure - and all of it accrues debt in ways that are invisible until they catastrophically fail.
Understanding these debt categories is the difference between building an ML system that runs well for three years and one that collapses under its own weight after six months.
Historical Context: The Sculley et al. Paper
The 2015 Google paper opens with a striking diagram: a large box labeled "ML Code" (small) surrounded by a much larger collection of boxes labeled "Data Collection," "Feature Extraction," "Data Verification," "Process Management Tools," "Analysis Tools," "Serving Infrastructure," and "Configuration." The ML code is the visible tip of an iceberg. Everything else is submerged - and in most teams, poorly engineered, poorly documented, and poorly maintained.
The paper identified seven distinct categories of technical debt specific to ML systems. In the decade since, practitioners have validated and extended these categories. They remain the canonical taxonomy for thinking about ML system health.
The paper's key provocation: "it is remarkably easy to incur technical debt in machine learning systems, but it is also remarkably difficult to pay it down." Every shortcut that speeds up model development - hardcoded paths, undocumented feature assumptions, skipped validation, copy-pasted pipeline code - is a future maintenance burden that compounds over time.
The Seven Categories of ML Technical Debt
Category 1: Entanglement (The CACE Principle)
CACE: Changing Anything Changes Everything.
This is the most fundamental form of ML technical debt, and the hardest to avoid. In a trained model, all features are entangled. You cannot improve one feature in isolation. Changing one feature changes the model's internal representation of all other features.
A concrete example: your fraud detection model has 50 features. A data scientist discovers that feature #23 (rolling 7-day transaction velocity) can be improved by switching from a mean to a median. They make the change and retrain. Precision improves by 2%. But recall drops by 4%. Feature importance rankings shift. The model now handles high-volume accounts differently. The previously calibrated decision threshold is now wrong.
None of this was intended. The "simple" improvement to one feature propagated through the entire model.
# Illustrating CACE: a change to one feature cascades everywhere
# Original feature: 7-day mean transaction amount
df["tx_mean_7d"] = df.groupby("user_id")["amount"].transform(
lambda x: x.rolling(7).mean()
)
# "Improvement": switch to median (more robust to outliers)
df["tx_mean_7d"] = df.groupby("user_id")["amount"].transform(
lambda x: x.rolling(7).median()
)
# What seems like a 1-line change actually:
# 1. Changes the feature distribution (median is not mean)
# 2. Changes correlations between this feature and all correlated features
# 3. Shifts the model's decision boundary in all subspaces involving this feature
# 4. Invalidates any hyperparameter tuning done with the old feature
# 5. Potentially invalidates the threshold calibration
# 6. Changes feature importance rankings for all features
# The correct approach: treat feature changes as model version bumps
# with full re-evaluation, not as small patches
The deeper CACE problem: this entanglement applies not just to features but to any component of the ML system. Changing the objective function changes which examples the model cares about. Changing the data split changes what the model considers its training distribution. Changing the preprocessing order can change feature scaling. No change to a trained model system is truly isolated.
Debt mitigation: evaluate every change to features, preprocessing, or model configuration as a full retraining cycle with complete evaluation. Build ablation testing infrastructure that can compare "new feature" vs "old feature" with statistical rigor. Never treat a feature change as a "quick patch."
Category 2: Hidden Feedback Loops
This is perhaps the most insidious form of ML technical debt because it can be entirely invisible during development - and it grows worse over time.
A hidden feedback loop occurs when a model's predictions influence the future training data that will be used to retrain the model. The model is no longer learning from the world - it is partly learning from its own past decisions.
Classic examples:
Fraud detection: the model flags 100 transactions as fraudulent, and human reviewers only investigate those 100. The remaining transactions (some of which are actual fraud) are never reviewed, so they never appear as fraud in the labels. The next version of the model is trained on data where "fraud" only appears in cases where the previous model was already suspicious. The model learns to agree with itself more than to detect fraud accurately.
Content recommendation: a model recommends video A over video B to 1 million users. Video A gets 50,000 more views and 2,000 more comments. The model retrains on this data and concludes video A is intrinsically more engaging. But the engagement difference was caused by the recommendation - not by inherent quality. The model is amplifying its own initial (possibly arbitrary) preferences.
Loan approval: a model denies loans to applicants from a specific demographic group. Those applicants don't get loans. They don't build credit history. Future models see them as higher risk because of their thin credit files - a risk profile partly created by the previous model's denials.
# Detecting feedback loop risk: audit the label generation process
def audit_label_source(label_metadata: dict) -> dict:
"""
Checks whether label generation could be influenced by model predictions.
Returns a risk assessment.
"""
feedback_risks = []
# Risk 1: labels only generated when model raises an alert
if label_metadata.get("label_trigger") == "model_alert":
feedback_risks.append({
"risk": "CRITICAL",
"description": "Labels only exist for cases the model flagged. "
"Negative cases (model said 'no') are never labeled. "
"Model will learn to agree with itself.",
"mitigation": "Implement random sampling: label N% of cases "
"regardless of model prediction to maintain unbiased ground truth."
})
# Risk 2: model output influences business action that creates labels
if label_metadata.get("model_output_affects_outcome"):
feedback_risks.append({
"risk": "HIGH",
"description": "Model decisions change what happens to cases, "
"which then determines the label. Self-reinforcing loop.",
"mitigation": "Use counterfactual evaluation or holdout groups "
"that bypass the model's intervention."
})
# Risk 3: long label lag creates temporal confounding
if label_metadata.get("label_lag_days", 0) > 30:
feedback_risks.append({
"risk": "MEDIUM",
"description": f"Label lag of {label_metadata['label_lag_days']} days "
"means training data from period X is labeled partly by "
"events from period X+30 that may have been influenced by "
"model actions during the lag window.",
"mitigation": "Be explicit about the temporal cutoff for labels. "
"Never use labels that could be influenced by model actions "
"after the prediction time."
})
return {
"feedback_loop_risk": "HIGH" if any(r["risk"] == "CRITICAL" for r in feedback_risks) else
"MEDIUM" if feedback_risks else "LOW",
"issues": feedback_risks
}
Debt mitigation: document the label generation process for every model. Explicitly ask: "could the model's current predictions change what labels we collect?" Implement random sampling of cases that bypass the model's decision (holdout groups) to maintain an unbiased ground truth. Monitor for the "model agreement" metric - if the model's predictions correlate too strongly with historical model predictions in the training data, you have a feedback loop.
Category 3: Undeclared Consumers
A model is deployed. Over the following months, other teams discover it and start using it - not through a formal API contract, but by calling the endpoint directly or reading from the prediction table. Nobody documents these consumers. The model's output format is not treated as a contract.
This is exactly the situation the team in the opening scenario discovered: three downstream services depending on their model, one of which had been built by a contractor who had left the company. Changing the model to improve it would break services they didn't know existed.
# What undeclared consumer debt looks like in practice
# Model originally returns:
# {"fraud_score": 0.87, "label": "FRAUD"}
# Downstream service A (documented, known):
score = response["fraud_score"] # uses the score correctly
# Downstream service B (undocumented, discovered months later):
is_fraud = response["label"] == "FRAUD" # depends on exact string value
# Downstream service C (completely unknown until breakage):
# Parses the raw JSON and hardcodes field position in a legacy integration
import json
raw = json.loads(response_body)
fields = list(raw.keys())
score_field = fields[0] # assumes "fraud_score" is always first key
# Prevention: explicit consumer registry
# Every consumer must register to receive model output schema change notifications
model_consumer_registry = {
"model_id": "fraud_detector_v3",
"declared_consumers": [
{
"service": "payment_gateway",
"team": "payments",
"fields_used": ["fraud_score"],
"threshold_dependency": 0.85, # their decision threshold
"registered_date": "2024-01-15"
},
{
"service": "fraud_review_dashboard",
"team": "risk",
"fields_used": ["fraud_score", "label", "top_features"],
"threshold_dependency": None,
"registered_date": "2024-02-01"
}
],
"schema_version": "2.1",
"schema": {
"fraud_score": "float [0.0, 1.0]",
"label": "string enum ['FRAUD', 'LEGITIMATE']",
"top_features": "list of {name: string, importance: float}"
},
"schema_change_policy": "30 days notice via registered contacts before any breaking change"
}
Debt mitigation: publish a consumer registry for every production model. Require registration as a prerequisite for accessing the model's output. Treat the output schema as a versioned API contract. Announce schema changes with minimum 30-day notice to all registered consumers.
Category 4: Data Dependencies
Software dependencies (libraries, services) are typically declared, versioned, and managed. Data dependencies in ML systems are almost never managed with the same rigor - and they are far more dangerous.
A model has three classes of data dependencies:
Unstable data dependencies: input features that are computed by an upstream pipeline that can change silently. If the upstream team adds a normalization step, renames a field, or changes the time window of an aggregation, the model's input distribution shifts without any signal to the ML team.
Underutilized data dependencies: features that are in the model but contribute essentially nothing (near-zero importance). These are pure maintenance burden. Every feature in the model requires a data pipeline to compute it in production, documentation, a schema contract, and a drift monitor. A feature with 0.01% importance costs as much to maintain as a feature with 20% importance.
Legacy features: features that were included in the original model for a reason that no longer exists. Often impossible to remove because nobody knows if their removal would cause a regression (see CACE).
# Data dependency audit: find underutilized and unstable features
import pandas as pd
import numpy as np
from sklearn.inspection import permutation_importance
def audit_feature_dependencies(
model,
X_val: pd.DataFrame,
y_val: pd.Series,
feature_metadata: dict,
importance_threshold: float = 0.001,
) -> pd.DataFrame:
"""
Audit all model features for dependency health.
Returns a DataFrame with per-feature risk assessment.
"""
# Get permutation importances (more reliable than impurity-based)
perm_imp = permutation_importance(model, X_val, y_val, n_repeats=10, random_state=42)
importances = perm_imp.importances_mean
audit_results = []
for i, feature_name in enumerate(X_val.columns):
meta = feature_metadata.get(feature_name, {})
importance = importances[i]
risks = []
if importance < importance_threshold:
risks.append(f"UNDERUTILIZED: importance {importance:.5f} < threshold {importance_threshold}")
if meta.get("upstream_owner") is None:
risks.append("NO_OWNER: no declared upstream owner for this feature")
if meta.get("schema_version") is None:
risks.append("NO_CONTRACT: feature has no schema version contract")
if meta.get("last_pipeline_review_days", 999) > 180:
risks.append("STALE: feature pipeline not reviewed in 180+ days")
if meta.get("is_stable", True) is False:
risks.append("UNSTABLE: upstream marks this feature as subject to changes")
audit_results.append({
"feature": feature_name,
"permutation_importance": round(importance, 5),
"upstream_owner": meta.get("upstream_owner", "UNKNOWN"),
"schema_version": meta.get("schema_version", "NONE"),
"risk_count": len(risks),
"risks": "; ".join(risks) if risks else "OK",
})
df = pd.DataFrame(audit_results).sort_values("risk_count", ascending=False)
return df
Category 5: Pipeline Jungles and Dead Experimental Code
Over time, ML repositories accumulate two species of structural debt:
Pipeline jungles: ad-hoc data preparation scripts that grow organically. Each new feature or data source adds another script, another intermediate file, another join. Nobody has ever refactored the whole thing. The result is a tangled mess of interdependent scripts where nobody is confident what runs before what, which files are inputs and which are outputs, and what happens if one script fails.
# A pipeline jungle, as it typically grows:
scripts/
prepare_data.py # original, runs first
prepare_data_v2.py # "fixed the timezone bug," runs after v1
prepare_data_new.py # "faster version," runs instead of v2 in prod
prep_features.py # runs after prepare_data_new.py
prep_features_fixed.py # patches a bug in prep_features.py
merge_all.py # joins all intermediate files (order matters!)
merge_all_with_demographics.py # fork of merge_all for a different segment
fix_missing_values.py # runs after merge, handles NaNs
fix_missing_values_v2.py # "updated to handle new data schema"
final_features.py # produces the actual feature matrix
final_features_with_extra_cols.py # added 3 features for experiment that was never cleaned up
Nobody knows which scripts are canonical. Nobody knows what final_features_with_extra_cols.py was for. Deleting it feels risky because it might be used somewhere. The pipeline is untestable, unreproducible, and unmaintainable.
Dead experimental code: every model development cycle creates experimental branches, ablation variants, and "let me try this quickly" code. In healthy engineering organizations, this code is either merged (if it worked) or deleted (if it didn't). In ML teams, it accumulates because "we might need it later." The codebase fills with model variants, unused feature computation paths, and commented-out hyperparameter sets that nobody remembers the context for.
# Dead experimental code, as it actually appears:
def train_model(X_train, y_train, config):
if config.get("use_v2_features"):
# TODO: test this more - seems to help on the fraud dataset but not sure
X_train = apply_v2_features(X_train) # what does this do? who wrote this?
model = GradientBoostingClassifier(
n_estimators=config.get("n_estimators", 200),
# max_depth=8, # tried this - was worse on 2024-01 data
# max_depth=4, # also worse
max_depth=config.get("max_depth", 6), # current best
learning_rate=config.get("learning_rate", 0.1),
# subsample=0.8, # experiment from feb, didn't help
)
# Old approach - keeping for reference
# model = RandomForestClassifier(n_estimators=500, max_depth=10)
return model
Debt mitigation: define your data pipeline as a DAG using Airflow, Prefect, or DVC pipelines. Every script is a named task with explicit inputs and outputs. No intermediate files, no ambiguous execution order. Enforce a "no commented-out code" rule in ML repositories - if it's not running, delete it; version control will preserve it.
Category 6: Configuration Debt
Configuration debt is the accumulation of magic numbers, undocumented hyperparameters, environment-specific hacks, and flags that nobody understands the purpose of anymore.
ML systems are particularly prone to this because hyperparameters are tuned empirically - and the conditions under which they were tuned (which dataset version, which feature set, which evaluation metric) are rarely documented alongside the values themselves.
# Configuration debt in practice:
class ModelConfig:
# Hyperparameters (no documentation on how these were chosen)
LEARNING_RATE = 0.0078 # why 0.0078? nobody knows
N_ESTIMATORS = 347 # 347? that's a suspicious number
MAX_DEPTH = 7
SUBSAMPLE = 0.83
# Feature thresholds (origin unknown)
HIGH_RISK_THRESHOLD = 0.73 # calibrated on 2022 data?
MEDIUM_RISK_THRESHOLD = 0.41
LOW_VOLUME_FLAG = 150 # 150 transactions per day? per week? per month?
# Environment hacks (reasons forgotten)
BATCH_SIZE_PROD = 512
BATCH_SIZE_STAGING = 128 # why smaller in staging? hardware? memory?
SKIP_VALIDATION_FOR_DEMO = True # this is in production. it was never removed.
# Mysterious flags
USE_LEGACY_SCORER = True # if set to False, everything breaks (tested once in 2023)
ENABLE_EXPERIMENTAL_FEATURES = False # turned off after an incident, can it be re-enabled?
Every undocumented value in this config is debt. When the person who tuned LEARNING_RATE = 0.0078 leaves the company, the number becomes a mystery that nobody dares change. When SKIP_VALIDATION_FOR_DEMO = True gets left in production (as it inevitably does), it is a silent correctness bug that could go undetected for months.
# The right way: documented, typed, validated configuration
from dataclasses import dataclass, field
from datetime import date
@dataclass
class ModelHyperparameters:
"""
Hyperparameters for fraud_detector_v3.
All values tuned via Optuna on transactions_v5.1 dataset (2024-01-15).
Evaluation metric: AUC-ROC on validation split (2024-01-08 to 2024-01-14).
"""
learning_rate: float = 0.0078
# ^ Optuna trial #847, best of 200 trials. Range: [0.001, 0.3].
# Lower rates with more estimators consistently outperformed on this dataset.
n_estimators: int = 347
# ^ Optuna chose 347. Early stopping at trial #847 with val_loss plateau.
# Increasing beyond ~300 showed diminishing returns on this dataset size.
max_depth: int = 7
# ^ Balanced variance/bias on high-dimensional fraud features.
# Depths 5-8 performed similarly; 7 chosen for consistency with v2.
tuning_date: date = field(default_factory=lambda: date(2024, 1, 15))
tuning_dataset_hash: str = "sha256:8f3a21b4c9d7e5f2a1b8c3d4e5f6a7b8"
tuning_metric: str = "auc_roc"
tuning_tool: str = "optuna_v3.5.0"
n_trials: int = 200
@dataclass
class DecisionThresholds:
"""
Decision thresholds for fraud_detector_v3.
Calibrated on validation data 2024-01-08 to 2024-01-14.
Business requirement: precision >= 0.90 at threshold of 0.73 for high-risk queue.
Review owner: Risk Operations ([email protected]).
Last reviewed: 2024-01-17 by [email protected].
"""
high_risk: float = 0.73
medium_risk: float = 0.41
recalibration_required_after: date = field(default_factory=lambda: date(2024, 7, 15))
Category 7: Changes in the External World
The final category is the least actionable but the most important to acknowledge: the world changes, and ML systems built on historical data become outdated.
Economic conditions change. User behavior evolves. Regulatory requirements shift. Competitors enter or exit markets. Seasonal patterns break. External APIs that provide features change their data. Third-party data providers update their methodologies.
Every one of these changes creates debt in ML systems because the systems were built on assumptions about the world that are no longer valid. Unlike software technical debt (which exists only in code), this debt cannot be repaid by refactoring - it can only be managed by building systems that detect and adapt to change.
The only long-term mitigation for external world debt is robust drift detection combined with automated retraining - building the assumption of change into the system's architecture rather than treating it as an exception.
The ML System Iceberg
The Google paper's most memorable insight: ML code is a small minority of the real system. Here is what an ML system actually contains:
The "glue code" anti-pattern is when the submerged infrastructure is built through ad-hoc scripts, copy-pasted code, and one-off hacks rather than principled engineering. The ML code is clean; everything around it is chaos. This is the most common pattern in teams that hire strong data scientists but underinvest in ML engineering.
Measuring ML Technical Debt: A Scoring Rubric
Technical debt is real but hard to quantify. Here is a concrete scoring rubric with measurable criteria:
| Dimension | Score 0 (high debt) | Score 1 (medium) | Score 2 (low debt) |
|---|---|---|---|
| Reproducibility | Cannot reproduce any training run | Can reproduce recent runs only | Any run reproducible via git + dvc pull |
| Data versioning | No data versioning at all | Manual checksums or ad-hoc scripts | Automated DVC or Delta Lake versioning |
| Feature documentation | No feature docs | Some features documented | All features have owner, schema, importance |
| Pipeline structure | Pipeline jungle (many scripts, unclear order) | Partial DAG, some undocumented steps | Full DAG with inputs/outputs declared |
| Configuration | Magic numbers throughout | Some config files, minimal comments | All config documented with tuning context |
| Test coverage | No data or model tests | Unit tests on some pipeline code | Data quality tests + model evaluation gates |
| Consumer registry | Unknown consumers | Known consumers, no formal registry | Registered consumers with schema contracts |
| Monitoring | No production monitoring | Infrastructure monitoring only | Data + prediction + business metric monitoring |
| Dead code | Extensive dead experiments, commented code | Some dead code, mostly contained | No dead code in main pipeline |
| Feedback loop audit | Never done | Informal awareness | Documented audit with mitigation |
Total score out of 20:
- 16–20: Low debt - maintainable, safe to scale
- 10–15: Medium debt - address specific high-scoring items
- 0–9: High debt - systemic issues, significant risk of operational failure
A Technical Debt Scanner for ML Repos
"""
ml_debt_scanner.py
A static analysis tool that checks an ML repository for common
technical debt anti-patterns. Run this as a CI check or periodic audit.
"""
import ast
import os
import re
from pathlib import Path
from dataclasses import dataclass, field
from typing import Iterator
@dataclass
class DebtIssue:
severity: str # "CRITICAL" | "HIGH" | "MEDIUM" | "LOW"
category: str
file: str
line: int
description: str
suggestion: str
class MLDebtScanner:
def __init__(self, repo_root: str):
self.repo_root = Path(repo_root)
self.issues: list[DebtIssue] = []
def scan(self) -> list[DebtIssue]:
"""Run all checks across the repository."""
python_files = list(self.repo_root.rglob("*.py"))
for py_file in python_files:
if ".git" in str(py_file) or "__pycache__" in str(py_file):
continue
try:
source = py_file.read_text(encoding="utf-8")
self._check_hardcoded_paths(source, str(py_file))
self._check_magic_numbers(source, str(py_file))
self._check_no_data_validation(source, str(py_file))
self._check_train_test_leakage_risk(source, str(py_file))
self._check_untracked_experiments(source, str(py_file))
except (UnicodeDecodeError, SyntaxError):
pass
self._check_missing_dvc()
self._check_missing_requirements()
self._check_pipeline_jungle()
return sorted(self.issues, key=lambda x: ["CRITICAL","HIGH","MEDIUM","LOW"].index(x.severity))
def _check_hardcoded_paths(self, source: str, filepath: str) -> None:
"""Flag hardcoded absolute paths - non-portable, breaks reproducibility."""
patterns = [
r'/home/\w+/',
r'/Users/\w+/',
r'C:\\Users\\',
r'/mnt/data/',
]
for i, line in enumerate(source.split("\n"), 1):
for pattern in patterns:
if re.search(pattern, line) and not line.strip().startswith("#"):
self.issues.append(DebtIssue(
severity="HIGH",
category="Configuration Debt",
file=filepath,
line=i,
description=f"Hardcoded absolute path: {line.strip()[:80]}",
suggestion="Use pathlib.Path(__file__).parent or environment variables for paths."
))
def _check_magic_numbers(self, source: str, filepath: str) -> None:
"""Flag suspicious magic numbers that should be named constants."""
# Only flag in ML-related files (heuristic: file contains 'model', 'train', 'feature')
ml_keywords = ["model", "train", "feature", "predict", "fit", "pipeline"]
if not any(kw in filepath.lower() for kw in ml_keywords):
return
# Look for literal floats that look like thresholds (0.XX) not in comments
suspicious_float_pattern = r'(?<![#"\'])(?<!\w)0\.\d{2,4}(?!\d)(?!["\'])'
for i, line in enumerate(source.split("\n"), 1):
stripped = line.strip()
if stripped.startswith("#"):
continue
matches = re.findall(suspicious_float_pattern, line)
# Only flag if the float appears to be a hardcoded threshold (not in a list/dict literal)
if matches and "threshold" not in line.lower() and len(matches) >= 2:
self.issues.append(DebtIssue(
severity="MEDIUM",
category="Configuration Debt",
file=filepath,
line=i,
description=f"Possible magic numbers {matches}: {stripped[:80]}",
suggestion="Move thresholds and hyperparameters to a documented config class."
))
def _check_no_data_validation(self, source: str, filepath: str) -> None:
"""Flag training scripts that load data without any validation."""
train_indicators = ["fit(", "model.train", "train_model", "X_train"]
validation_indicators = [
"great_expectations", "pandera", "expect_", "assert",
"validate", "schema", "check_"
]
has_training = any(ind in source for ind in train_indicators)
has_validation = any(ind in source for ind in validation_indicators)
if has_training and not has_validation:
self.issues.append(DebtIssue(
severity="HIGH",
category="Data Dependencies",
file=filepath,
line=0,
description="Training script has no data validation checks.",
suggestion="Add schema validation (pandera, Great Expectations) before fitting the model."
))
def _check_train_test_leakage_risk(self, source: str, filepath: str) -> None:
"""Flag fitting transformers before train/test split."""
# Pattern: fit_transform or fit( appears before train_test_split
lines = source.split("\n")
fit_lines = [i for i, l in enumerate(lines) if "fit_transform(" in l or ".fit(" in l]
split_lines = [i for i, l in enumerate(lines) if "train_test_split" in l]
if fit_lines and split_lines:
earliest_fit = min(fit_lines)
earliest_split = min(split_lines)
if earliest_fit < earliest_split:
self.issues.append(DebtIssue(
severity="CRITICAL",
category="Entanglement",
file=filepath,
line=earliest_fit + 1,
description="fit() or fit_transform() appears BEFORE train_test_split - possible data leakage.",
suggestion="Always split data first, then fit transformers only on training split."
))
def _check_untracked_experiments(self, source: str, filepath: str) -> None:
"""Flag training scripts that don't use any experiment tracking."""
has_training = "model.fit(" in source or "train_model(" in source
has_tracking = any(t in source for t in ["mlflow", "wandb", "neptune", "comet"])
if has_training and not has_tracking:
self.issues.append(DebtIssue(
severity="MEDIUM",
category="Pipeline Jungle",
file=filepath,
line=0,
description="Training script has no experiment tracking.",
suggestion="Add MLflow, W&B, or Neptune tracking so results are reproducible and comparable."
))
def _check_missing_dvc(self) -> None:
"""Flag repositories with no DVC setup."""
dvc_dir = self.repo_root / ".dvc"
if not dvc_dir.exists():
self.issues.append(DebtIssue(
severity="HIGH",
category="Data Dependencies",
file=str(self.repo_root),
line=0,
description="No DVC configuration found. Datasets are not versioned.",
suggestion="Run `dvc init` and version all data files with `dvc add`."
))
def _check_missing_requirements(self) -> None:
"""Flag repos with no pinned dependencies."""
has_requirements = (
(self.repo_root / "requirements.txt").exists() or
(self.repo_root / "pyproject.toml").exists() or
(self.repo_root / "environment.yml").exists()
)
if not has_requirements:
self.issues.append(DebtIssue(
severity="HIGH",
category="Configuration Debt",
file=str(self.repo_root),
line=0,
description="No pinned dependency file found (requirements.txt / pyproject.toml / environment.yml).",
suggestion="Pin all dependencies with exact versions for reproducibility."
))
def _check_pipeline_jungle(self) -> None:
"""Flag repos with suspicious numbers of similarly-named scripts."""
py_files = list(self.repo_root.rglob("*.py"))
name_counts: dict[str, int] = {}
for f in py_files:
# Strip version suffixes: _v2, _new, _fixed, _old, _backup
clean = re.sub(r'(_v\d+|_new|_fixed|_old|_backup|_final)$', '', f.stem)
name_counts[clean] = name_counts.get(clean, 0) + 1
for base_name, count in name_counts.items():
if count >= 3:
self.issues.append(DebtIssue(
severity="MEDIUM",
category="Pipeline Jungle",
file=str(self.repo_root / "scripts"),
line=0,
description=f"'{base_name}' has {count} versioned variants. Possible pipeline jungle.",
suggestion="Consolidate into a single script with proper version control via git."
))
def print_report(self) -> None:
issues = self.scan()
print(f"\n=== ML Technical Debt Report ===")
print(f"Repository: {self.repo_root}")
print(f"Issues found: {len(issues)}")
print(f" CRITICAL: {sum(1 for i in issues if i.severity == 'CRITICAL')}")
print(f" HIGH: {sum(1 for i in issues if i.severity == 'HIGH')}")
print(f" MEDIUM: {sum(1 for i in issues if i.severity == 'MEDIUM')}")
print(f" LOW: {sum(1 for i in issues if i.severity == 'LOW')}")
print()
for issue in issues:
print(f"[{issue.severity}] {issue.category}")
print(f" File: {issue.file}:{issue.line}")
print(f" Issue: {issue.description}")
print(f" Fix: {issue.suggestion}")
print()
# Usage:
# scanner = MLDebtScanner("/path/to/ml_repo")
# scanner.print_report()
Remediation Strategies
Priority 1: Reproducibility First
If you cannot reproduce a training run, you cannot debug failures, cannot audit model behavior, and cannot safely retrain. Reproducibility is the prerequisite for everything else.
Minimum viable reproducibility:
# Every training run should be reproducible with exactly these steps:
git checkout <commit_hash> # exact code version
dvc pull # exact data version
conda env create -f environment.yml # exact library versions
python train.py --config configs/run_847.yaml # exact hyperparameters
# Output: model artifact with the same weights (within float precision)
Priority 2: Modular, Tested Pipelines
Replace pipeline jungles with proper pipeline frameworks. Every step is a named node in a DAG with explicit inputs and outputs:
# DVC pipeline definition: dvc.yaml
# This replaces a folder of ambiguously named scripts
stages:
ingest:
cmd: python src/data/ingest.py
deps:
- src/data/ingest.py
- data/raw/transactions_2024.csv
outs:
- data/processed/transactions_clean.parquet
featurize:
cmd: python src/features/build_features.py
deps:
- src/features/build_features.py
- data/processed/transactions_clean.parquet
outs:
- data/features/feature_matrix.parquet
- models/feature_pipeline.pkl
train:
cmd: python src/models/train.py --config configs/train.yaml
deps:
- src/models/train.py
- data/features/feature_matrix.parquet
- configs/train.yaml
outs:
- models/fraud_detector.pkl
metrics:
- reports/metrics.json
Priority 3: Schema Contracts
Every data dependency must have an explicit schema contract that is validated at runtime:
import pandera as pa
from pandera import Column, DataFrameSchema, Check
# Declare the schema as code - checked in to version control
transaction_schema = DataFrameSchema({
"transaction_id": Column(str, nullable=False, unique=True),
"user_id": Column(str, nullable=False),
"amount": Column(float, Check.greater_than(0), nullable=False),
"merchant_id": Column(str, nullable=False),
"timestamp": Column(pa.DateTime, nullable=False),
"is_fraud": Column(bool, nullable=True), # nullable - labels may arrive later
}, name="transactions_v2")
def load_and_validate_transactions(path: str) -> pd.DataFrame:
df = pd.read_parquet(path)
validated_df = transaction_schema.validate(df) # raises SchemaError on violation
return validated_df
Priority 4: Feature Stores for Consistency
The most effective solution to training-serving skew and data dependency chaos is a feature store - a centralized system that computes features once and serves them consistently to both training and inference:
# Feast feature store: define features once, use everywhere
from feast import FeatureStore, Entity, FeatureView, Field
from feast.types import Float64, Int64, String
user = Entity(name="user", join_keys=["user_id"])
transaction_features = FeatureView(
name="user_transaction_features",
entities=[user],
ttl=timedelta(days=7),
schema=[
Field(name="tx_count_7d", dtype=Int64),
Field(name="tx_amount_mean_7d", dtype=Float64),
Field(name="tx_amount_max_7d", dtype=Float64),
Field(name="distinct_merchants_7d", dtype=Int64),
],
)
# Training: retrieve historical features for any point in time
store = FeatureStore(".")
training_df = store.get_historical_features(
entity_df=entity_df, # user_id + event_timestamp
features=["user_transaction_features:tx_count_7d",
"user_transaction_features:tx_amount_mean_7d"],
).to_df()
# Inference: retrieve the exact same features in real time
online_features = store.get_online_features(
features=["user_transaction_features:tx_count_7d",
"user_transaction_features:tx_amount_mean_7d"],
entity_rows=[{"user_id": "usr_12345"}],
).to_dict()
# Training and serving use identical feature definitions.
# Training-serving skew is architecturally eliminated.
Production Engineering Notes
On the "just get it working" trap: ML teams under deadline pressure regularly make the choices that create the six-month backlogs. Hardcoding a path "just for now." Skipping data validation "we know the data is clean." Running experiments without tracking "I'll remember which one was best." Every one of these shortcuts is a debt payment you will make eventually - with interest. The time you save today costs 3-5x on the other side.
On the glue code ratio: measure it. If 80%+ of your codebase is data wrangling scripts, pipeline glue, and format conversion, that is not a sign of ML sophistication - it is a warning sign. Good MLOps infrastructure should reduce glue code by providing reusable, principled abstractions for the most common tasks.
On dead code in ML: the "I might need it later" instinct is wrong. Git preserves everything. Delete dead experimental code aggressively. A codebase that's 20% dead code is a codebase where nobody can confidently distinguish what's running from what's historical. Branches exist for exploration. The main pipeline should contain only code that is actively in use.
On configuration documentation: every non-trivial constant in your ML codebase should have a comment that answers: (1) what is this value, (2) why was it set to this specific number, (3) when was it last validated, and (4) who to ask if it needs to change.
Common Mistakes
:::danger Fitting Transformers Before Splitting Data The single most common cause of optimistic model evaluation in the industry. If you fit a StandardScaler on your full dataset and then split into train/test, the scaler has learned statistics from the test set. Your model has indirectly seen the test distribution. Your reported performance is better than actual generalization performance. Always split first, then fit only on the training portion. :::
:::danger Never Auditing for Hidden Feedback Loops Teams that have never done a feedback loop audit almost always have at least one. The question to ask for every model in production: "Can our model's predictions change the data that will appear in our next training set?" If the answer is yes, you need a mitigation strategy (holdout groups, randomized evaluation, counterfactual logging). Most teams skip this audit entirely. :::
:::warning Treating Magic Numbers as Permanent Truths A threshold of 0.73, a batch size of 347, a minimum sample count of 1500 - these were derived under specific conditions that no longer exist. When the model is retrained on new data, when the feature distribution shifts, when the business requirements change, these numbers need re-evaluation. But because they're buried in code without documentation of their origin, nobody knows that. Treat all ML configuration values as hypotheses that require periodic re-validation, not permanent facts. :::
:::warning Ignoring Underutilized Features Every feature in your model has a maintenance cost: a pipeline step to compute it, a schema contract to maintain, a drift monitor to watch it, documentation to keep current. A feature with permutation importance near zero has full maintenance cost and near-zero benefit. Audit feature importance quarterly and remove features that haven't contributed in multiple evaluation cycles. Fewer features means a simpler, more maintainable, more interpretable system. :::
:::warning Not Deleting Dead Experimental Code "We might need it later" is the ML equivalent of hoarding. Version control preserves every deleted file forever - you can always get it back. Dead experimental code clutters the codebase, confuses new team members, and makes it harder to understand what's actually running in production. Establish a rule: every experimental branch is either merged to main or deleted within 30 days. No exceptions. :::
Interview Q&A
Q1: What is the CACE principle, and why does it matter in ML engineering?
Strong answer: CACE stands for "Changing Anything Changes Everything." It describes the entanglement problem in ML systems: because all features contribute to a trained model's learned representation, no change to any input, preprocessing step, or feature definition is truly isolated.
In software engineering, you can change function A and be confident that function B is unaffected (assuming clean interfaces). In ML, if you improve feature #23 by switching from a mean to a median computation, you change the correlation structure of all features that correlate with feature #23. The model's internal weights - calibrated to the old correlations - are now slightly miscalibrated. You may improve precision at the cost of recall. You may shift the optimal decision threshold. Feature importance rankings may change.
This matters in practice because it means ML changes must be treated differently from software changes. You cannot "patch" a feature like you patch a function. Every change to an ML system's features or preprocessing requires a full retraining cycle with complete evaluation - not a code review and a quick test run.
Q2: What is the glue code anti-pattern in ML systems?
Strong answer: The glue code anti-pattern describes the common situation where only 5–10% of an ML system's codebase is actual ML code (model definition, loss function, training loop), while 90–95% is glue: data loading scripts, format conversion utilities, API clients for data sources, preprocessing pipelines, serving wrappers, and infrastructure configuration.
The problem isn't that this glue exists - it's necessary. The problem is that it's written as ad-hoc scripts rather than principled, reusable, tested infrastructure. The glue in an immature ML system is typically: not version-controlled (or version-controlled but in personal branches), not tested, not documented, not modular (the same preprocessing logic exists in three different places), and deeply brittle (breaks when data schemas change, breaks on new environments, breaks when someone changes a hardcoded path).
Good ML infrastructure reduces glue code by providing principled abstractions: feature stores for feature computation, pipeline frameworks for DAG definition, model registries for artifact management. But in most teams, these are absent, and the glue accumulates until it dominates the system's maintenance burden.
Q3: How would you identify and address hidden feedback loops in an existing ML system?
Strong answer: Start with the label generation audit. For every production model, map the path from prediction to label: when the model makes a prediction, does that prediction affect any subsequent human or system action? Does that action determine what label is eventually assigned to the case?
Common patterns to look for: labels that only exist when the model raised an alert (fraud, content moderation), outcome data that's collected only for cases the model approved (loan decisions, job applications), or recommendation models where the model determines what users see, and engagement metrics for what they saw becomes the training signal.
Once identified, the standard mitigations are: holdout groups (a random sample of cases that bypass the model's decision to provide unbiased ground truth), counterfactual logging (log the model's decision and the alternative decision, then evaluate both), and offline evaluation on randomly-sampled historical slices that predate the model's deployment.
Long-term, the design principle is: never use a label-generation process that depends on the model's current predictions, unless you explicitly account for this dependence in your training and evaluation methodology.
Q4: What are the three categories of data dependencies in ML systems, and how do you manage them?
Strong answer: Unstable dependencies, underutilized dependencies, and legacy dependencies.
Unstable dependencies are features computed by upstream pipelines you don't control. If the upstream team changes their computation logic - even to fix a bug - your feature distribution silently changes. Mitigation: declare a schema contract for every feature, including the exact computation logic, and require the upstream team to notify you of any changes to a versioned interface.
Underutilized dependencies are features with near-zero importance that you continue to maintain. Mitigation: run a quarterly feature importance audit using permutation importance (more reliable than tree-based importance), identify features below a threshold (e.g., < 0.001), and remove them after confirming removal doesn't degrade performance.
Legacy dependencies are features included for historical reasons that no longer exist. These are the hardest to remove because nobody knows what they were for. Mitigation: require documentation of purpose and business context for every feature at the time of addition. Without this documentation, future engineers cannot distinguish "legacy that can be removed" from "essential that looks optional."
Q5: How do you prevent ML technical debt from accumulating in the first place?
Strong answer: Technical debt is easier to prevent than to pay down. The practices that prevent it:
Reproducibility from day one: require data versioning (DVC), experiment tracking (MLflow), and pinned dependencies before the first training run. These are cheap to add at the start and extremely expensive to retrofit.
Pipeline as code: use a pipeline framework (DVC pipelines, Prefect, Airflow) from the first prototype. Enforce the rule that all data transformation is defined as a named DAG stage with explicit inputs and outputs - never as an ad-hoc script.
Feature documentation requirements: every new feature requires a data dictionary entry before the PR is merged. Include: what the feature measures, who computes it, what pipeline produces it, what the expected distribution is.
Evaluation gates as blockers: build model quality evaluation as a blocking CI step. A pipeline that doesn't enforce quality gates accumulates models that "sort of work" in production.
Dead code policy: every experimental branch is merged or deleted within 30 days. No commented-out code in main pipeline files.
Consumer registry: before serving a model externally, require consumers to register. Treat the output schema as a versioned API contract.
The key insight from the Sculley et al. paper: the cost of ML technical debt is not linear. Each additional shortcut increases the complexity of paying down future shortcuts. Start clean and enforce cleanliness from the beginning - the alternative is the six-month "two-week" project.
Q6: What is the difference between data drift and concept drift, and which is harder to detect?
Strong answer: Data drift (also called covariate shift) is when the distribution of input features P(X) changes. The relationship between features and labels P(Y|X) remains the same - but the inputs you're seeing in production are different from what you trained on. For example, a sudden increase in high-value transactions changes the distribution of your transaction_amount feature.
Concept drift is when the relationship between inputs and labels P(Y|X) changes. The same feature values now mean something different. In January, high_amount AND new_merchant predicts fraud with 72% probability. By June, that pattern predicts fraud with only 31% probability because fraud patterns have evolved.
Concept drift is significantly harder to detect because it requires ground truth labels to measure. You can detect data drift immediately by comparing the production input distribution against a reference using statistical tests (KS test, PSI). But to detect concept drift, you need to compare the model's predictions to actual outcomes - which requires waiting for labels to arrive (which may take days, weeks, or months depending on your domain).
This lag is the dangerous gap. You can have severe concept drift for weeks before enough labels arrive to confirm it statistically. The defense is to monitor leading indicators: prediction score distribution (if the model's confidence distribution shifts, concept drift may be coming), disagreement between the model and a simple rule-based baseline, and upstream feature drift as an early warning signal even before labels are available.
