Skip to main content

Model Cards and Documentation

The Regulator Arrives. Nobody Has Answers.

The email arrived on a Thursday morning. It was from the national financial regulator - a routine inquiry, they said. They wanted documentation for the credit scoring model that had been deciding loan eligibility for 840,000 retail bank customers over the past fourteen months. Specifically, they wanted to know: what data was used to train it, what its error rate was across different demographic groups, who had reviewed and approved it before deployment, what its known failure modes were, and whether the organization had evaluated it for discriminatory outcomes.

The head of data science forwarded the email to the model's primary author, who had left the company eight months earlier. The secondary contact, a senior ML engineer, began digging through the codebase. She found the training script. She found the model artifact in S3. She found a Jupyter notebook from eighteen months ago with some exploratory charts. What she could not find: the version of the training dataset that had actually been used (there were four candidates in the data lake with similar names), the evaluation results broken down by age group or geographic region, any documentation of who had reviewed the model before deployment, or any record of what the model was and was not designed to do.

The legal team spent three weeks reconstructing the answers from git history, database logs, and interviews with former employees. The regulator accepted the retrospective documentation but issued a formal finding: the organization had deployed a high-risk automated decision system without adequate documentation, in potential violation of their anti-discrimination obligations. The model was suspended while a proper evaluation was conducted. The evaluation took eleven weeks and cost significantly more than the original model development.

The cruel irony was that the model itself was probably fine - subsequent analysis found no significant discriminatory outcomes. But "probably fine" is not a legal standard. "We can prove it was evaluated and found to be within acceptable parameters, by this person, on this date, using this methodology" is a legal standard. The gap between those two statements is what model cards exist to close.


:::tip 🎮 Interactive Playground Visualize this concept: Try the Model Staging & Promotion demo on the EngineersOfAI Playground - no code required. :::

Why This Exists

For most of the history of machine learning, models were evaluated in notebooks and the evaluation results lived in those notebooks - if they were recorded at all. The model's behavior was known to the people who built it, and that knowledge was considered sufficient. This worked when ML systems were internal tools used by small teams. It stopped working when ML systems began making consequential decisions about real people at scale.

The problem is multi-dimensional. First, the original model author inevitably moves on - to another team, another project, or another company. Their institutional knowledge leaves with them. Second, models that behave well on aggregate metrics can behave very differently for specific subpopulations, and aggregate metrics don't surface this. Third, as organizations deploy more models, there is no systematic way to understand what each model does, what it's allowed to do, and what it's known to get wrong. Fourth - and increasingly - regulators, auditors, and enterprise procurement processes now require documentation before a model can be used in production.

Model cards address all of these problems by providing a standardized, machine-readable, human-readable artifact that travels with the model and answers the most important questions about it.


Historical Context: The 2019 Paper That Started a Movement

The concept of model cards was formally introduced in a 2019 paper by Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru - a group of researchers at Google. The paper, "Model Cards for Model Reporting," was published at the ACM Conference on Fairness, Accountability, and Transparency (FAccT) in January 2019.

The paper's central observation was that while the ML community had developed rigorous standards for reporting the performance of models in research papers - including training setup, evaluation methodology, and benchmark comparisons - deployed models rarely received the same treatment. A model used in production affecting thousands or millions of people might have less documentation than a model submitted to a research conference affecting only the reviewers.

The analogy the paper drew - and which has since become the standard framing - is to the FDA nutrition label. Before nutrition labels were mandated in 1994, consumers had essentially no way to understand what was in their food. The nutrition label didn't make food healthier. It made the properties of food transparent, so that consumers and regulators could make informed decisions. Model cards are the nutrition label for AI systems.

The paper proposed a structured format for model documentation that included: model details, intended use, factors (relevant demographics, conditions), metrics, evaluation data, training data, quantitative analyses (including disaggregated evaluation), ethical considerations, and caveats and recommendations.

Since 2019, model cards have been adopted by Hugging Face (where they are now required for all model uploads), Google (for all production models), and numerous other organizations. The EU AI Act (enacted 2024) effectively mandates model card-equivalent documentation for high-risk AI systems. The framework has moved from an academic proposal to a regulatory requirement in five years.


What a Model Card Contains

A complete model card answers ten categories of questions:

SectionQuestions Answered
Model DetailsWho built it? When? What version? What type?
Intended UseWhat is it for? What is it NOT for? Who are the intended users?
Training DataWhat data? What version? What transformations? What were the data sources?
Evaluation DataWhat test set? How was it collected? Is it representative?
Performance MetricsWhat metrics? What are the values? On what evaluation set?
Disaggregated EvaluationHow does performance vary across subgroups? Are there disparities?
Ethical ConsiderationsWhat are the risks? What could go wrong? Who could be harmed?
LimitationsWhat does this model NOT do well? What inputs will it fail on?
CaveatsWhat should users be aware of before using this model?
LineageWhat training data version, feature version, code commit, and registry version produced this model?

The lineage section is worth emphasizing separately - it is the section that is most commonly omitted and most often critical in post-incident analysis. Without lineage, you cannot reproduce the model, understand what changed between versions, or answer the question "was this model trained on the data that included the mislabeled records from March?"


The Model Card Schema in Code

Treating a model card as a formal data structure (rather than an informal document) enables automation - generation, validation, and registry integration.

from dataclasses import dataclass, field, asdict
from typing import Optional, List, Dict
import json
import datetime

@dataclass
class PerformanceMetric:
name: str # e.g., "AUC-ROC"
value: float
threshold: float # What value is considered acceptable
evaluation_set: str # e.g., "holdout_2024q4_v3"
passed: bool

@dataclass
class DisaggregatedResult:
group_name: str # e.g., "age_group_18_25"
group_attribute: str # e.g., "age_group"
metric_name: str
metric_value: float
group_size: int # sample size for this group
disparity_from_overall: float # signed difference from aggregate metric

@dataclass
class ModelLineage:
training_data_name: str
training_data_version: str
training_data_uri: str
feature_pipeline_version: str
training_code_commit: str
training_code_repo: str
mlflow_run_id: str
mlflow_experiment_id: str
training_date: str
trainer_identity: str # e.g., CI job ID, or person who triggered training

@dataclass
class ModelCard:
# --- Identity ---
model_name: str
model_version: str
model_type: str # e.g., "gradient_boosted_trees", "transformer", "logistic_regression"
model_description: str

# --- Intended Use ---
intended_use: str
intended_users: List[str] # e.g., ["loan_officers", "automated_approval_system"]
out_of_scope_uses: List[str]

# --- Data ---
training_data_description: str
evaluation_data_description: str
lineage: ModelLineage

# --- Performance ---
primary_metrics: List[PerformanceMetric]
secondary_metrics: List[PerformanceMetric] = field(default_factory=list)
disaggregated_results: List[DisaggregatedResult] = field(default_factory=list)

# --- Risk and Ethics ---
ethical_considerations: str
known_limitations: List[str]
known_failure_modes: List[str]
caveats: List[str]
bias_assessment: Optional[str] = None
fairness_conclusion: Optional[str] = None

# --- Approval ---
reviewer_name: Optional[str] = None
reviewer_role: Optional[str] = None
review_date: Optional[str] = None
approved_for_production: bool = False

# --- Metadata ---
created_at: str = field(default_factory=lambda: datetime.datetime.utcnow().isoformat())
card_version: str = "2.0"

def to_dict(self) -> dict:
return asdict(self)

def to_json(self) -> str:
return json.dumps(self.to_dict(), indent=2)

def to_markdown(self) -> str:
return render_model_card_markdown(self)

def validate(self) -> tuple[bool, List[str]]:
"""
Validate that all required fields are populated.
Returns (valid, list_of_missing_fields).
"""
missing = []

if not self.intended_use.strip():
missing.append("intended_use")
if not self.out_of_scope_uses:
missing.append("out_of_scope_uses (required: at least one)")
if not self.primary_metrics:
missing.append("primary_metrics (required: at least one)")
if not self.disaggregated_results:
missing.append("disaggregated_results (required for production models)")
if not self.known_limitations:
missing.append("known_limitations (required: at least one)")
if not self.ethical_considerations.strip():
missing.append("ethical_considerations")
if not self.lineage.training_data_version:
missing.append("lineage.training_data_version")
if not self.lineage.training_code_commit:
missing.append("lineage.training_code_commit")

return len(missing) == 0, missing

Automated Card Generation from MLflow

The most valuable model cards are the ones that are generated automatically as part of the training pipeline, with minimal manual input required from the data scientist. Manual model cards get written once and then forgotten. Automated model cards get updated with every training run.

import mlflow
from mlflow.tracking import MlflowClient
import pandas as pd
from sklearn.metrics import roc_auc_score
from typing import List

def generate_model_card_from_mlflow(
model_name: str,
model_version: str,
mlflow_tracking_uri: str,
eval_df: pd.DataFrame,
demographic_column: str,
label_column: str,
intended_use: str,
out_of_scope_uses: List[str],
known_limitations: List[str],
known_failure_modes: List[str],
ethical_considerations: str,
) -> ModelCard:
"""
Automatically generate a model card by pulling training metadata from MLflow.
The data scientist provides context (intended use, limitations);
the system provides facts (metrics, lineage, training data info).
"""
mlflow.set_tracking_uri(mlflow_tracking_uri)
client = MlflowClient()

# Fetch version and run metadata from MLflow
version_info = client.get_model_version(model_name, model_version)
run = client.get_run(version_info.run_id)

params = run.data.params
metrics = run.data.metrics
tags = run.data.tags

# Load the model for evaluation
model = mlflow.pyfunc.load_model(f"models:/{model_name}/{model_version}")

# --- Compute primary metrics on eval dataset ---
features = eval_df.drop(columns=[label_column, demographic_column])
labels = eval_df[label_column]
predictions = model.predict(features)

overall_auc = roc_auc_score(labels, predictions)

primary_metrics = [
PerformanceMetric(
name="AUC-ROC",
value=overall_auc,
threshold=float(params.get("min_auc_threshold", 0.82)),
evaluation_set=params.get("eval_dataset_version", "unknown"),
passed=overall_auc >= float(params.get("min_auc_threshold", 0.82)),
)
]

# Pull additional metrics logged during training
secondary_metrics = []
for metric_name in ["precision_at_k", "recall_at_k", "f1_score", "log_loss"]:
if metric_name in metrics:
secondary_metrics.append(
PerformanceMetric(
name=metric_name,
value=metrics[metric_name],
threshold=float(params.get(f"threshold_{metric_name}", 0.0)),
evaluation_set=params.get("eval_dataset_version", "unknown"),
passed=True, # We'll let gates handle threshold enforcement
)
)

# --- Disaggregated evaluation ---
disaggregated_results = []
for group_value in eval_df[demographic_column].unique():
mask = eval_df[demographic_column] == group_value
group_df = eval_df[mask]
if len(group_df) < 30:
continue # Skip groups too small for reliable AUC computation

group_features = group_df.drop(columns=[label_column, demographic_column])
group_labels = group_df[label_column]
group_predictions = model.predict(group_features)

try:
group_auc = roc_auc_score(group_labels, group_predictions)
except ValueError:
continue # AUC undefined for single-class groups

disaggregated_results.append(DisaggregatedResult(
group_name=f"{demographic_column}={group_value}",
group_attribute=demographic_column,
metric_name="AUC-ROC",
metric_value=group_auc,
group_size=len(group_df),
disparity_from_overall=group_auc - overall_auc,
))

# Build fairness conclusion
if disaggregated_results:
max_disparity = max(abs(r.disparity_from_overall) for r in disaggregated_results)
worst_group = max(disaggregated_results, key=lambda r: abs(r.disparity_from_overall))
if max_disparity > 0.05:
fairness_conclusion = (
f"ATTENTION: Maximum demographic disparity is {max_disparity:.4f} "
f"(threshold: 0.05). Worst-performing group: {worst_group.group_name} "
f"(AUC={worst_group.metric_value:.4f}). Review before production promotion."
)
else:
fairness_conclusion = (
f"Fairness criteria met. Maximum demographic disparity: {max_disparity:.4f} "
f"(threshold: 0.05)."
)
else:
fairness_conclusion = "Disaggregated evaluation could not be completed - insufficient group sizes."

# --- Build lineage ---
lineage = ModelLineage(
training_data_name=params.get("training_dataset_name", "unknown"),
training_data_version=params.get("training_dataset_version", "unknown"),
training_data_uri=params.get("training_dataset_uri", "unknown"),
feature_pipeline_version=params.get("feature_pipeline_version", "unknown"),
training_code_commit=tags.get("mlflow.source.git.commit", "unknown"),
training_code_repo=tags.get("mlflow.source.git.repoURL", "unknown"),
mlflow_run_id=version_info.run_id,
mlflow_experiment_id=run.info.experiment_id,
training_date=run.info.start_time.__str__() if run.info.start_time else "unknown",
trainer_identity=tags.get("mlflow.user", "unknown"),
)

# --- Assemble card ---
card = ModelCard(
model_name=model_name,
model_version=model_version,
model_type=params.get("model_type", "unspecified"),
model_description=params.get("model_description", "No description provided."),
intended_use=intended_use,
intended_users=params.get("intended_users", "unspecified").split(","),
out_of_scope_uses=out_of_scope_uses,
training_data_description=params.get("training_data_description", "See lineage."),
evaluation_data_description=params.get("eval_data_description", "See lineage."),
lineage=lineage,
primary_metrics=primary_metrics,
secondary_metrics=secondary_metrics,
disaggregated_results=disaggregated_results,
ethical_considerations=ethical_considerations,
known_limitations=known_limitations,
known_failure_modes=known_failure_modes,
caveats=params.get("caveats", "").split(";") if params.get("caveats") else [],
fairness_conclusion=fairness_conclusion,
bias_assessment=f"Maximum demographic AUC disparity: {max_disparity:.4f}" if disaggregated_results else None,
)

return card

Rendering a Model Card to Markdown

def render_model_card_markdown(card: ModelCard) -> str:
"""
Render a ModelCard dataclass to a human-readable Markdown document.
Suitable for embedding in MLflow, Hugging Face, or storing in a docs repo.
"""
lines = []

lines.append(f"# Model Card: {card.model_name} v{card.model_version}")
lines.append(f"\n**Generated:** {card.created_at} ")
lines.append(f"**Card Version:** {card.card_version} ")
lines.append(f"**Model Type:** {card.model_type}")

lines.append("\n---\n")
lines.append("## Model Description")
lines.append(card.model_description)

lines.append("\n## Intended Use")
lines.append(f"**Primary Use:** {card.intended_use}")
lines.append(f"\n**Intended Users:** {', '.join(card.intended_users)}")
lines.append("\n**Out-of-Scope Uses:**")
for use in card.out_of_scope_uses:
lines.append(f"- {use}")

lines.append("\n## Training Data")
lines.append(card.training_data_description)
lines.append(f"\n**Dataset Name:** `{card.lineage.training_data_name}` ")
lines.append(f"**Dataset Version:** `{card.lineage.training_data_version}` ")
lines.append(f"**Dataset URI:** `{card.lineage.training_data_uri}`")

lines.append("\n## Evaluation Data")
lines.append(card.evaluation_data_description)

lines.append("\n## Performance Metrics")
lines.append("\n### Primary Metrics")
lines.append("| Metric | Value | Threshold | Evaluation Set | Status |")
lines.append("|--------|-------|-----------|----------------|--------|")
for m in card.primary_metrics:
status = "PASS" if m.passed else "FAIL"
lines.append(f"| {m.name} | {m.value:.4f} | {m.threshold:.4f} | {m.evaluation_set} | {status} |")

if card.secondary_metrics:
lines.append("\n### Secondary Metrics")
lines.append("| Metric | Value | Evaluation Set |")
lines.append("|--------|-------|----------------|")
for m in card.secondary_metrics:
lines.append(f"| {m.name} | {m.value:.4f} | {m.evaluation_set} |")

if card.disaggregated_results:
lines.append("\n## Disaggregated Evaluation")
lines.append("| Group | Metric | Value | Group Size | Disparity from Overall |")
lines.append("|-------|--------|-------|------------|------------------------|")
for r in card.disaggregated_results:
sign = "+" if r.disparity_from_overall >= 0 else ""
lines.append(
f"| {r.group_name} | {r.metric_name} | {r.metric_value:.4f} "
f"| {r.group_size} | {sign}{r.disparity_from_overall:.4f} |"
)

if card.fairness_conclusion:
lines.append(f"\n**Fairness Assessment:** {card.fairness_conclusion}")

lines.append("\n## Ethical Considerations")
lines.append(card.ethical_considerations)

lines.append("\n## Known Limitations")
for lim in card.known_limitations:
lines.append(f"- {lim}")

lines.append("\n## Known Failure Modes")
for mode in card.known_failure_modes:
lines.append(f"- {mode}")

if card.caveats:
lines.append("\n## Caveats")
for c in card.caveats:
lines.append(f"- {c}")

lines.append("\n## Model Lineage")
lines.append(f"| Field | Value |")
lines.append(f"|-------|-------|")
lines.append(f"| Training Data | `{card.lineage.training_data_name}` v`{card.lineage.training_data_version}` |")
lines.append(f"| Feature Pipeline | v`{card.lineage.feature_pipeline_version}` |")
lines.append(f"| Code Commit | `{card.lineage.training_code_commit}` |")
lines.append(f"| Repository | `{card.lineage.training_code_repo}` |")
lines.append(f"| MLflow Run | `{card.lineage.mlflow_run_id}` |")
lines.append(f"| Trained By | `{card.lineage.trainer_identity}` |")
lines.append(f"| Training Date | `{card.lineage.training_date}` |")

if card.reviewer_name:
lines.append("\n## Approval")
lines.append(f"**Reviewer:** {card.reviewer_name} ({card.reviewer_role}) ")
lines.append(f"**Review Date:** {card.review_date} ")
status = "APPROVED" if card.approved_for_production else "PENDING / REJECTED"
lines.append(f"**Status:** {status}")

return "\n".join(lines)

The Architecture: Model Card as Documentation Hub

The model card is not an isolated document - it is the hub that connects all the artifacts produced during the ML lifecycle. Understanding this relationship makes it clear why automation is possible and why it is valuable.


Storing Model Cards in MLflow

Once generated, the model card should be stored as an artifact attached to the model version in the registry. This keeps it co-located with the model and queryable through the same interface.

def attach_model_card_to_registry(
card: ModelCard,
model_name: str,
model_version: str,
mlflow_tracking_uri: str,
):
"""
Attach a generated model card to an MLflow model version.
Stores as both JSON (machine-readable) and Markdown (human-readable).
Also logs the card's validation status as model version tags.
"""
mlflow.set_tracking_uri(mlflow_tracking_uri)
client = MlflowClient()

# Validate the card first
valid, missing_fields = card.validate()

# Write card files to a temp directory and log as artifacts
import tempfile, os

with tempfile.TemporaryDirectory() as tmpdir:
# JSON version
json_path = os.path.join(tmpdir, "model_card.json")
with open(json_path, "w") as f:
f.write(card.to_json())

# Markdown version
md_path = os.path.join(tmpdir, "model_card.md")
with open(md_path, "w") as f:
f.write(card.to_markdown())

# Log artifacts to the model's run
version_info = client.get_model_version(model_name, model_version)
with mlflow.start_run(run_id=version_info.run_id):
mlflow.log_artifact(json_path, artifact_path="model_card")
mlflow.log_artifact(md_path, artifact_path="model_card")

# Tag the model version with card status
client.set_model_version_tag(model_name, model_version, "model_card.present", "true")
client.set_model_version_tag(model_name, model_version, "model_card.valid", str(valid))
client.set_model_version_tag(model_name, model_version, "model_card.generated_at", card.created_at)

if not valid:
client.set_model_version_tag(
model_name, model_version,
"model_card.missing_fields", ", ".join(missing_fields)
)

return valid, missing_fields


def fail_promotion_if_card_incomplete(model_name: str, model_version: str, client: MlflowClient):
"""
Used as a promotion gate: refuse to promote if model card is absent or incomplete.
"""
tags = client.get_model_version(model_name, model_version).tags

if tags.get("model_card.present") != "true":
raise ValueError(f"Model {model_name} v{model_version} has no model card. Promotion blocked.")

if tags.get("model_card.valid") != "True":
missing = tags.get("model_card.missing_fields", "unknown")
raise ValueError(
f"Model card is incomplete. Missing fields: {missing}. Promotion blocked."
)

CI/CD Integration: Generating Cards in the Training Pipeline

The model card should be generated and validated as part of every training run, not as a post-hoc step. Here is how to integrate card generation into a training pipeline:

# In your training pipeline (e.g., a GitLab CI job or Airflow task)

def training_pipeline(config: dict):
"""
Full training pipeline with automatic model card generation.
The card is generated, validated, and attached to the registry entry
before the pipeline exits. If the card is invalid, the pipeline fails.
"""
with mlflow.start_run(run_name=config["run_name"]) as run:
# Log all configuration
mlflow.log_params(config)

# --- Train ---
model, train_metrics = train_model(config)
mlflow.log_metrics(train_metrics)

# --- Evaluate ---
eval_df = load_eval_dataset(config["eval_dataset_version"])
eval_metrics = evaluate_model(model, eval_df)
mlflow.log_metrics(eval_metrics)

# Log evaluation dataset version for model card lineage
mlflow.log_param("eval_dataset_version", config["eval_dataset_version"])
mlflow.log_param("training_dataset_version", config["training_dataset_version"])

# --- Register ---
model_uri = mlflow.sklearn.log_model(model, "model").model_uri
model_version = mlflow.register_model(model_uri, config["model_name"]).version

# --- Generate model card ---
card = generate_model_card_from_mlflow(
model_name=config["model_name"],
model_version=model_version,
mlflow_tracking_uri=mlflow.get_tracking_uri(),
eval_df=eval_df,
demographic_column=config["demographic_column"],
label_column=config["label_column"],
intended_use=config["intended_use"], # Must be in config
out_of_scope_uses=config["out_of_scope_uses"],
known_limitations=config["known_limitations"],
known_failure_modes=config["known_failure_modes"],
ethical_considerations=config["ethical_considerations"],
)

# --- Attach and validate ---
valid, missing = attach_model_card_to_registry(
card,
config["model_name"],
model_version,
mlflow.get_tracking_uri(),
)

if not valid:
# Fail the pipeline if card is incomplete
# This forces the data scientist to provide missing context
raise ValueError(
f"Training pipeline failed: model card incomplete.\n"
f"Missing required fields: {', '.join(missing)}\n"
f"Add these fields to the training config before re-running."
)

print(f"Model card generated and validated. Model version {model_version} registered.")
return model_version

Model Card Templates by Context

Different organizational contexts require different levels of detail. A startup building an internal recommendation tool has different documentation needs than a bank's credit scoring system.

Minimal Template (Startup / Internal Tool)

The minimum viable model card captures lineage and basic intent. It is better than nothing, even if it is not comprehensive.

# model_card_minimal.yaml
model_name: "product-recommender"
model_version: "12"
model_type: "collaborative_filtering"
intended_use: "Recommend products to logged-in users on the home feed"
out_of_scope:
- "Do not use for new user cold start (model has no training signal for users < 5 sessions)"
primary_metric:
name: "NDCG@10"
value: 0.423
known_limitations:
- "Popularity bias: popular items are systematically over-recommended"
lineage:
training_data_version: "user_events_2024q4_v2"
code_commit: "a3f91bc"

Comprehensive Template (Enterprise)

Adds disaggregated evaluation, bias assessment, approval workflow, and full lineage.

The Python ModelCard dataclass defined earlier covers this template fully. Generate it programmatically and render to Markdown for storage.

Regulatory Template (Fintech / Healthcare)

For high-risk systems under the EU AI Act or financial services regulation, the card must additionally include:

  • The legal basis for using automated decision-making
  • Evidence of human oversight mechanisms
  • The appeal/redress process for affected individuals
  • Results of the conformity assessment (for EU AI Act high-risk systems)
  • Name and contact of the responsible person / DPO
@dataclass
class RegulatoryModelCard(ModelCard):
"""
Extended model card for high-risk AI systems under EU AI Act / financial services regulation.
"""
# EU AI Act fields
risk_category: str = "" # e.g., "HIGH_RISK" under EU AI Act Annex III
legal_basis: str = "" # e.g., "Legitimate interest under GDPR Art. 6(1)(f)"
human_oversight_mechanism: str = "" # e.g., "All rejections reviewed by human officer"
appeal_process: str = "" # e.g., "Customer can appeal via branch or online portal"
responsible_person: str = "" # Name and contact of accountable officer
dpo_contact: str = ""
conformity_assessment_reference: str = "" # Reference to formal conformity assessment

# Financial services specific
model_risk_tier: str = "" # e.g., "TIER_1" per internal MRM framework
mrm_review_date: str = ""
mrm_reviewer: str = ""
regulatory_submission_reference: str = "" # Reference number if submitted to regulator

Datasheets for Datasets

Model cards document the model. Datasheets for Datasets - introduced by Gebru et al. in 2018 - document the training and evaluation data. They are complementary. A model card without a datasheet for its training dataset is incomplete: you cannot understand the model's behavior without understanding the data it learned from.

A dataset datasheet answers:

@dataclass
class DatasetDatasheet:
dataset_name: str
dataset_version: str
motivation: str # Why was this dataset created?
composition: str # What does it consist of? Instances, features, labels?
collection_process: str # How was data collected?
preprocessing: str # What transformations were applied?
uses: str # What is this dataset for?
distribution: str # Under what terms is it distributed?
maintenance: str # Who maintains it and how is it updated?
known_issues: str # Known biases, errors, or limitations?

# Statistics
num_instances: int
num_features: int
label_distribution: Dict[str, float] # e.g., {"positive": 0.12, "negative": 0.88}
demographic_coverage: Dict[str, List] # Which demographic groups are represented?
time_range: str # What time period does the data cover?
geographic_coverage: str # What geographies?

Hugging Face Model Cards

For models shared publicly or within an organization using Hugging Face Hub, model cards are written in a specific YAML front-matter format in a README.md file. The format is standardized and machine-readable.

---
language: en
license: apache-2.0
tags:
- text-classification
- credit-risk
datasets:
- my-org/credit-applications-2024
metrics:
- auc
model-index:
- name: CreditRiskClassifier
results:
- task:
type: text-classification
dataset:
name: credit-applications-holdout-q4-2024
type: my-org/credit-applications-holdout
metrics:
- type: auc
value: 0.847
---

# CreditRiskClassifier

## Model Description
[Description here]

## Intended Uses & Limitations
[Intended use, out-of-scope uses, limitations]

## Training and Evaluation Data
[Data description + version]

## Ethical Considerations
[Fairness, bias, risks]

Production Engineering Notes

Versioned model cards: Model cards must be versioned alongside model versions. When you retrain a model, generate a new model card for the new version. Never update an existing model card for an older version - it is an immutable record of what was known at the time of that version's deployment.

Machine-readable format is essential: Storing the card as a pure Markdown document is better than nothing, but storing it as structured JSON enables querying - "which of our production models have not had a fairness evaluation in the last six months?" - and integration with compliance tooling.

The limitations section is the hardest to write: Data scientists often resist writing known limitations because it feels like an admission of failure. Frame it differently: documenting limitations is how you prevent misuse. A model with no documented limitations is a model that gets used for everything, including things it is terrible at.

Card completeness as a promotion gate: The most effective mechanism for ensuring model cards exist is to make the promotion pipeline refuse to promote a model without a valid card. If you make the card optional, it will be skipped under time pressure. If you make it required, it will get written.


Common Mistakes

:::danger Never Treat Model Cards as a Post-Hoc Documentation Task Writing the model card weeks or months after the model is in production - reconstructed from memory, git history, and guesswork - is much better than not having one, but it is a fundamentally worse artifact than a card written during the training run when all information is immediately available. The automation approach described here eliminates this problem: the card is generated by the training pipeline, not written by hand afterward. :::

:::danger Do Not Use Aggregate Metrics Alone Reporting "accuracy = 92%" without disaggregated evaluation is actively misleading for decision-making systems. A model can achieve 92% overall accuracy while performing at 70% accuracy for a specific demographic group that is 8% of the population. Aggregate metrics hide this. Disaggregated evaluation is not optional for any model making consequential decisions about people. :::

:::warning Lineage Information Goes Stale Copying a model card from a previous version and updating only the performance numbers - without updating the lineage fields - is a common mistake. The lineage fields (training data version, code commit, feature pipeline version) are the most important fields for reproducibility and post-incident analysis. They must be populated automatically from the training run, not manually maintained. :::

:::warning Intended Use Must Be Specific "Classify text" is not an intended use statement. "Classify customer support tickets into one of 12 predefined categories for routing to the correct support queue, for logged-in enterprise customers in English-language markets" is an intended use statement. Specificity is what makes the out-of-scope uses section meaningful and what enables reviewers to identify misuse. :::


Interview Q&A

Q: What is a model card and why does it matter?

A: A model card is a structured document that accompanies a machine learning model and describes its intended use, training data, evaluation results, performance across demographic groups, known limitations, and ethical considerations. The concept was introduced by Mitchell et al. at Google in a 2019 paper and has since become an industry standard - required by Hugging Face for all model uploads and mandated by regulation (EU AI Act) for high-risk AI systems. Model cards matter for three reasons. First, they preserve institutional knowledge: when the model author leaves, the card preserves what they knew about the model's behavior. Second, they prevent misuse: a card with explicit out-of-scope use statements makes it harder to repurpose a model for tasks it was not designed for. Third, they enable accountability: when a model produces a harmful outcome, the card provides the documented evidence of what the model was supposed to do, what it was evaluated on, and what its known failure modes were.

Q: How do you automate model documentation?

A: Most of the information in a model card is already being generated by the training pipeline - metrics, parameters, training data references, code commit hashes - it just needs to be collected and structured. The approach is to build a model card generator that reads from MLflow (or your tracking system of choice) to pull objective information automatically: training run ID, parameters, logged metrics, dataset versions from logged parameters, code commit from run tags. The data scientist provides the context that cannot be automated: intended use, known limitations, ethical considerations. These are supplied as parameters to the training config, so they are logged to the tracking system and available at card generation time. The card is then generated at the end of every training run and attached to the model version as an artifact. A promotion gate checks for card presence and completeness before allowing the model to move to Staging. This removes model cards from the category of "things we should do but don't" and puts them in the category of "things that happen automatically."

Q: What is disaggregated evaluation and why is it required?

A: Disaggregated evaluation means computing your performance metrics separately for each relevant demographic subgroup - age groups, geographic regions, gender categories, income brackets - rather than only reporting an aggregate metric across the full evaluation set. It is required because ML models can achieve excellent aggregate performance while performing significantly worse for specific subgroups, particularly underrepresented groups. This is not a hypothetical risk - it is a documented pattern across deployed systems in facial recognition, credit scoring, recidivism prediction, and medical imaging. From a regulatory standpoint, many jurisdictions require evidence of non-discriminatory outcomes for automated decision systems. From a product standpoint, a model that works poorly for a subgroup will generate disproportionate complaints and churn from that subgroup. Disaggregated evaluation is the only way to detect these disparities before they cause harm.

Q: What is the difference between a model card and a datasheet for datasets?

A: A model card documents the model: its intended use, training data, performance metrics, fairness evaluation, limitations, and lineage. A datasheet for datasets documents the training and evaluation data: how it was collected, what it contains, what preprocessing was applied, what known biases it has, and who maintains it. They are complementary: a model card references the dataset versions it was trained on; datasheets provide the detailed documentation for those datasets. Gebru et al. introduced datasheets for datasets in 2018, a year before Mitchell et al. introduced model cards. Together, they form the complete documentation stack for a trained ML system. You cannot fully understand a model's behavior without understanding the data it was trained on.

Q: How does the EU AI Act affect model documentation requirements?

A: The EU AI Act (enacted 2024) establishes a risk-based framework for AI regulation. Systems classified as "high-risk" - which includes AI used in credit scoring, hiring, educational assessment, biometric identification, and critical infrastructure - must meet documentation requirements that closely mirror and extend model cards. Specifically, high-risk systems require: technical documentation describing the system's purpose, capabilities, and limitations; information about training, validation, and testing data; performance metrics disaggregated by relevant groups; a description of human oversight mechanisms; details of the risk management processes applied during development; and documentation of the conformity assessment performed before deployment. The key difference from voluntary model card practice is that the EU Act makes non-compliance a legal matter, with fines up to 30 million euros or 6% of global turnover. For practitioners in regulated industries, the practical implication is that the model card generation and validation process must be part of the compliance program, not just an engineering best practice.

Q: How do you handle the limitations section when data scientists resist writing it?

A: The resistance usually comes from a framing problem: data scientists often interpret "document the limitations" as "admit the model is bad," which creates a defensive reaction. Reframe it: documenting limitations is how you prevent the model from being used for things it will fail at. A well-documented limitation ("this model's performance degrades significantly for users with fewer than 5 historical transactions") prevents the model from being deployed in a context where 30% of users have no transaction history and then failing publicly. The organizational mechanism is to make the limitations section required by the promotion gate - the pipeline will not allow promotion without at least one documented limitation. This shifts the conversation from "should we document this?" to "what do we document?" You can also provide a template with prompts: "Under what input distribution does this model perform worst?" "What inputs could cause it to return incorrect results with high confidence?" These are concrete questions that are easier to answer than "what are the limitations?"


Summary

Model cards are the accountability layer for machine learning systems. They answer the questions that matter when something goes wrong, when a regulator asks, when a new team member needs to understand a deployed model, and when a product team wants to know if an existing model is appropriate for a new use case.

The operational principle is that model cards should be generated automatically, not written manually. Most of the information belongs in the training pipeline's logging output. The data scientist provides the human context - intended use, known limitations, ethical considerations - as structured configuration inputs, not as after-the-fact prose.

The final enforcement mechanism is making model card completeness a promotion gate. If a model cannot reach production without a valid card, cards will get written. If they are optional, they will not.

© 2026 EngineersOfAI. All rights reserved.