AI Regulation and FDA Compliance

The Submission That Failed in Week Three

The AI company had done everything right - technically. Their chest X-ray AI for pneumonia detection had been trained on 450,000 images from four academic medical centers. The test set AUC was 0.94. Reader studies with 12 radiologists showed that radiologists using the AI caught 11% more pneumonias than radiologists reading without it. The company had raised $40 million. They had a sales pipeline. They had a hospital system ready to deploy as their first customer.

They filed a 510(k) submission with the FDA. The reviewer came back in three weeks with a list of deficiencies. The intended use statement was too broad: "detection of pneumonia in chest X-rays" covered pediatric patients, but their training data was 98% adult. Their performance testing did not include demographic breakdowns by age, sex, or race/ethnicity. Their predicate device claim referenced a 2017 CAD system that FDA said was not substantially equivalent because the algorithm architecture and clinical workflow integration were materially different. And critically: their training data included patients from the hospitals that would be their first clinical sites, which FDA flagged as a potential validation data integrity issue.

The company went back and spent eight months fixing the deficiencies. They collected 12,000 pediatric chest X-rays, ran new reader studies, reclassified to a different predicate, and prepared a 4,000-page submission package. Total delay: eleven months. Total additional cost: approximately $2 million. First patient at risk from an inadequately validated AI: zero - because the regulatory process worked.

This is the story that repeats across the industry. The technical team builds something that genuinely works and is genuinely useful. Then they encounter the regulatory framework for the first time and discover that "works" in the engineering sense and "cleared for clinical use" in the regulatory sense are different standards with different evidence requirements. The gap is not because the FDA is obstructionist. It is because patients are different from benchmark datasets, and the history of medical device failures is a chronicle of what happens when that gap is ignored.

Understanding the regulatory landscape is not compliance overhead for ML engineers working in healthcare. It is engineering knowledge. Regulatory requirements shape model design decisions, training data requirements, validation methodology, deployment architecture, and post-market monitoring. A team that learns this at the end - after building - pays a much higher price than a team that designs for regulatory compliance from the beginning.

Why This Exists - The History of Medical Device Failures

The FDA's authority over medical devices was established by the Medical Device Amendments of 1976, enacted in response to the Dalkon Shield IUD disaster of the early 1970s. Before 1976, medical devices could be marketed without any federal approval. The Dalkon Shield caused pelvic inflammatory disease, septic abortions, and deaths in thousands of women before it was withdrawn. Congress responded by giving FDA authority to require evidence of safety and effectiveness before devices could reach patients.

The 1976 framework established the risk-based classification system still in use today: Class I (low risk, general controls), Class II (moderate risk, special controls plus substantial equivalence), Class III (high risk, premarket approval with clinical evidence). Software was not a significant factor in 1976 - the primary devices were implants, diagnostic equipment, and surgical tools.

Software as a Medical Device became a regulatory category in the 2010s as clinical software became clinically consequential. The FDA's 2013 guidance on Mobile Medical Applications was the first clear statement that medical software was subject to FDA oversight. A 2019 discussion paper on AI/ML-based SaMD and a 2021 action plan formally established the regulatory framework for adaptive AI systems.

The European parallel developed through the Medical Device Regulation (MDR 2017/745) and In Vitro Diagnostic Regulation (IVDR 2017/746), with the EU AI Act (2024) layering additional requirements specific to AI systems. The EU framework is more prescriptive on risk management and transparency than the FDA framework, reflecting different regulatory philosophies.

The consequences of regulatory failure in medical AI are not hypothetical. IBM Watson for Oncology was deployed at hospitals worldwide and recommended treatments that oncologists at some sites considered unsafe - it was later revealed the system had been trained on hypothetical cases rather than real patient outcomes. Sepsis prediction algorithms deployed without proper validation showed poor performance and racial disparities in independent evaluations. These failures accelerated regulatory attention.

Historical Context - From 510(k) to SaMD

The 510(k) premarket notification process was designed in 1976 for physical devices: if a new device is substantially equivalent to a device legally marketed before May 28, 1976 (a predicate device), it can be cleared without full clinical trials. This worked for incremental device improvements - a new model of pacemaker lead that works like the previous generation.

For software, substantial equivalence is harder to establish because the "technology" (the algorithm) changes fundamentally between generations. A BERT-based clinical NLP system is not substantially equivalent to a rule-based clinical decision support system from 2010, even if both are intended to extract diagnoses from clinical notes. FDA has navigated this by establishing algorithm families and performance benchmarks rather than architectural equivalence.

The De Novo pathway (introduced 1997, expanded 2012) created a route for novel low-to-moderate risk devices without predicates. Many first-in-class AI medical devices use De Novo, which results in a new device classification and creates a new predicate that subsequent products can use for 510(k) clearance. This is why the early AI radiology companies (Aidoc, Viz.ai, Arterys) used De Novo for their first products even though it is slower and more expensive.

The FDA's 2021 AI/ML Action Plan addressed a fundamental mismatch: the 510(k) framework assumes a static, locked device. An AI model that retrains on new patient data is never the same device twice. The Predetermined Change Control Plan (PCCP) was introduced to allow AI manufacturers to pre-specify types of model updates that can be deployed without a new 510(k) submission, subject to FDA agreement at the time of initial clearance.

Core Concepts

FDA Device Classification

FDA classifies medical devices into three classes based on risk:

Class I: Low risk. Most Class I devices are exempt from premarket review. Example: bandages, examination gloves, tongue depressors. For software, decision support tools that are advisory and where clinicians can independently verify the AI's recommendation (the clinician can look at the image themselves) typically fall here.

Class II: Moderate risk. Requires 510(k) clearance demonstrating substantial equivalence to a predicate. Most AI medical imaging tools pursue Class II. Special controls (specific testing requirements, labeling requirements, performance standards) are defined for each device type. Example: CT-based osteoporosis screening AI, diabetic retinopathy screening AI (when a doctor can verify the output), ECG analysis software.

Class III: High risk. Requires Premarket Approval (PMA) with clinical trial data demonstrating safety and effectiveness. Class III includes devices that sustain or support life, prevent impairment, or present unreasonable risk of illness or injury. AI systems that operate autonomously without clinician review of the output are more likely to be classified Class III.

The classification depends critically on the intended use and how the AI output is used in clinical practice. The exact same algorithm deployed in two different workflows can have different classifications:

Diabetic retinopathy AI that presents results to an ophthalmologist who makes the final diagnosis: Class II, 510(k) clearance
Diabetic retinopathy AI that produces autonomous diagnoses and sends results directly to patients without physician review: Class III, PMA required

This is why intended use statement engineering is the most consequential technical decision in the regulatory process.

The 510(k) Submission Package

A typical 510(k) submission for a radiology AI product contains:

Device Description: Detailed description of the AI system: what it takes as input (DICOM images, clinical metadata), what it outputs (probability scores, bounding box coordinates, structured findings), how it integrates with clinical workflow, and what version of software is being cleared.

Intended Use / Indications for Use: The document that defines the cleared scope. Must specify: intended patient population (adult, pediatric, age range), imaging modality and acquisition parameters the system was designed for, clinical indication, and the role of the clinician in the workflow. Example: "XYZ AI Nodule Detection is intended to detect pulmonary nodules greater than 6mm in diameter on chest CT images acquired in adult patients (18 years and older) with suspected malignancy, for use as a decision support tool by trained radiologists. The system is not intended as a standalone diagnostic device."

Predicate Device Comparison: A table comparing the proposed device to the predicate device across: intended use, technology, performance testing. Must argue that differences in technology do not raise new questions of safety and effectiveness.

Performance Testing Data: The largest section. Must include: training data description (how many images, from how many sites, patient demographics, image acquisition parameters), test dataset description (independent from training, statistically powered, demographically representative), performance metrics (sensitivity, specificity, AUC, PPV, NPV, with 95% confidence intervals), performance by subgroup (age, sex, race/ethnicity, scanner type), and comparison to reference standard (how ground truth was established - expert consensus reads, pathology confirmation, clinical outcome).

Cybersecurity: FDA requires cybersecurity documentation for software medical devices covering: threat modeling, security controls, software bill of materials (SBOM), patch management plan.

Labeling: The Instructions for Use document that will accompany the cleared device. Must accurately represent the intended use, performance characteristics, limitations, and how to interpret outputs.

HIPAA Compliance for AI

The Health Insurance Portability and Accountability Act (HIPAA) regulates how Protected Health Information (PHI) is handled. For healthcare AI, HIPAA compliance is not optional - it is a condition of doing business with any covered entity (hospital, health system, insurance company, healthcare provider).

What counts as PHI: 18 categories of identifiers: patient name, geographic data more specific than state, dates (other than year) related to health care, phone numbers, fax numbers, email addresses, Social Security numbers, medical record numbers, health plan beneficiary numbers, account numbers, certificate/license numbers, vehicle identifiers, device identifiers, URLs, IP addresses, biometric identifiers (fingerprints, retinal scans), full-face photographs, and any other unique identifier. Note that medical images are PHI even after removing DICOM header tags - a face visible in a head MRI is an identifier.

Business Associate Agreements (BAAs): If you are providing AI services to a covered entity (hospital) and your system processes or stores PHI, you are a Business Associate and must have a BAA with the covered entity. The BAA specifies: what PHI you receive, how you are permitted to use it, security safeguards you will implement, breach notification procedures, and your obligations when the relationship ends (returning or destroying PHI). Signing a BAA is not just paperwork - it makes you legally responsible for PHI breaches under HIPAA's Breach Notification Rule and subject to civil monetary penalties (up to $1.9 million per violation category per year).

De-identification methods: Two standards under 45 CFR 164.514. Safe Harbor requires removing all 18 identifier categories and having no actual knowledge that remaining information could identify an individual. Expert Determination requires a statistical expert to certify that the risk of identification is very small. De-identified data is not PHI and is not subject to HIPAA. For AI training datasets, Safe Harbor is standard practice.

The minimum necessary standard: When using or disclosing PHI for AI purposes (training, validation, quality improvement), you are only permitted to use the minimum amount of PHI necessary for the purpose. A model trained to detect pneumonia on chest X-rays does not need to access the patient's psychiatric history. Accessing more PHI than necessary is a HIPAA violation even if the data is otherwise properly protected.

EU AI Act - High-Risk AI

The EU AI Act (enacted August 2024, full enforcement by 2026) establishes a risk-based framework for AI systems in the EU market. AI used in healthcare is explicitly listed as a high-risk AI use case (Annex III, Category 5: AI intended for management and operation of critical infrastructure; Category 6: AI in education and employment; and - most relevant - AI that influences access to healthcare services).

For high-risk AI systems, the EU AI Act requires:

Risk management system: An ongoing process of identifying, analyzing, and mitigating risks throughout the AI system's lifecycle. Must be documented and updated as the system changes.

Data governance: Training, validation, and testing datasets must meet quality criteria including relevance, representativeness, and freedom from errors. Must document data collection sources, annotation procedures, and data preprocessing steps.

Technical documentation: Comprehensive documentation covering: system purpose, design, development methodology, performance evaluation, known limitations, and interactions with other systems.

Human oversight: High-risk AI systems must be designed to allow human oversight. Specifically: (1) humans must be able to understand the system's capabilities and limitations, (2) humans must be able to detect and address anomalies, (3) humans must be able to disregard, override, or interrupt the system's output. This last requirement directly affects autonomous AI designs.

Transparency to users: Persons using a high-risk AI system must be provided information to use it appropriately: the system's purpose, performance characteristics under which it has been tested, known limitations, and circumstances where performance may be diminished.

Accuracy, robustness, and cybersecurity: Performance must be documented across relevant use conditions. Systems must be resilient to adversarial manipulation and errors.

For companies that already have FDA 510(k) clearance: FDA clearance does not equal EU AI Act compliance. The EU Act requires documentation and processes that go beyond what FDA currently requires. Companies selling in both markets need a compliance strategy that satisfies both regulatory regimes simultaneously.

Bias Auditing

FDA increasingly expects AI medical devices to demonstrate adequate performance across demographic subgroups. The 2022 "Artificial Intelligence and Machine Learning in Software as a Medical Device" action plan explicitly requires reporting performance by sex, age, and race/ethnicity. The FDA's Algorithmic Bias Guidance (2023) adds further requirements.

A bias audit for a radiology AI submission includes:

Subgroup performance analysis: Report sensitivity, specificity, and AUC separately for each major demographic group. The sample size for each subgroup must be sufficient for statistical power - typically 100+ positive cases per subgroup for a detection task.

Disparate impact testing: Test whether the model's false negative rate (missed findings) differs statistically across subgroups. A model that misses 8% of findings in white patients but 18% in Black patients has a disparate impact problem. Use statistical tests (chi-squared, Cochran-Mantel-Haenszel) to establish whether observed differences are statistically significant.

Confounding factor analysis: Demographic variables often correlate with acquisition parameters (older scanners in under-resourced hospitals serving majority-minority populations). Separate the effect of demographics from the effect of acquisition parameters.

Training data representativeness: Document the demographic composition of the training dataset and compare to the intended patient population. If the training data is 92% white patients and the target market includes hospitals serving diverse populations, document this limitation explicitly.

Code Examples

HIPAA De-identification Pipeline

import pydicom
import hashlib
import json
import re
from pathlib import Path
from typing import Optional
from datetime import datetime, date


# Complete list of DICOM tags that must be removed or replaced under Safe Harbor
# Based on DICOM PS 3.15 Annex E Basic Application Level Confidentiality Profile
SAFE_HARBOR_REMOVE_TAGS = [
    (0x0008, 0x0014),  # Instance Creator UID
    (0x0008, 0x0022),  # Acquisition Date
    (0x0008, 0x0023),  # Content Date
    (0x0008, 0x0025),  # Curve Date
    (0x0008, 0x002A),  # Acquisition DateTime
    (0x0008, 0x0032),  # Acquisition Time
    (0x0008, 0x0033),  # Content Time
    (0x0008, 0x0081),  # Institution Address
    (0x0008, 0x0082),  # Institution Code Sequence
    (0x0008, 0x0090),  # Referring Physician Name
    (0x0008, 0x0092),  # Referring Physician Address
    (0x0008, 0x0096),  # Referring Physician Identification
    (0x0008, 0x1048),  # Physician(s) of Record
    (0x0008, 0x1049),  # Physician(s) of Record Identification
    (0x0008, 0x1050),  # Performing Physician Name
    (0x0008, 0x1052),  # Performing Physician Identification
    (0x0008, 0x1060),  # Name of Physician Reading Study
    (0x0008, 0x1062),  # Physician Reading Study Identification
    (0x0008, 0x1070),  # Operators Name
    (0x0008, 0x1072),  # Operators Identification
    (0x0008, 0x1080),  # Admitting Diagnoses Description
    (0x0008, 0x1084),  # Admitting Diagnoses Code Sequence
    (0x0008, 0x1195),  # Transaction UID
    (0x0010, 0x0010),  # Patient Name
    (0x0010, 0x0020),  # Patient ID
    (0x0010, 0x0021),  # Issuer of Patient ID
    (0x0010, 0x0030),  # Patient Birth Date
    (0x0010, 0x0032),  # Patient Birth Time
    (0x0010, 0x0040),  # Patient Sex (keep - not PHI under Safe Harbor)
    (0x0010, 0x0050),  # Patient Insurance Plan Code
    (0x0010, 0x1000),  # Other Patient IDs
    (0x0010, 0x1001),  # Other Patient Names
    (0x0010, 0x1010),  # Patient Age (keep year/decade only)
    (0x0010, 0x1020),  # Patient Size
    (0x0010, 0x1030),  # Patient Weight
    (0x0010, 0x1090),  # Medical Record Locator
    (0x0010, 0x2160),  # Ethnic Group
    (0x0010, 0x2180),  # Occupation
    (0x0010, 0x21B0),  # Additional Patient History
    (0x0010, 0x4000),  # Patient Comments
    (0x0032, 0x1032),  # Requesting Physician
    (0x0032, 0x1033),  # Requesting Service
    (0x0038, 0x0010),  # Admission ID
    (0x0038, 0x001E),  # Scheduled Admission Date
    (0x0038, 0x0020),  # Admitting Date
    (0x0040, 0x0006),  # Scheduled Performing Physician Name
    (0x0040, 0x0244),  # Performed Procedure Step Start Date
    (0x0040, 0x0253),  # Performed Procedure Step ID
    (0x0040, 0xA124),  # UID
    (0x0040, 0xA730),  # Content Sequence - may contain PHI in SR
]

# Tags to replace with pseudonymized values (not remove, because other tags reference them)
SAFE_HARBOR_REPLACE_TAGS = [
    (0x0020, 0x000D),  # Study Instance UID - replace with pseudonymized UID
    (0x0020, 0x000E),  # Series Instance UID
    (0x0008, 0x0018),  # SOP Instance UID
    (0x0008, 0x0080),  # Institution Name
]


def safe_harbor_deidentify(
    input_path: str,
    output_path: str,
    keep_year_of_birth: bool = True,
    salt: str = "change_this_salt_to_a_secret_value",
) -> dict:
    """
    Apply HIPAA Safe Harbor de-identification to a DICOM file.
    Returns a mapping of original UIDs to pseudonymized UIDs for linking.

    IMPORTANT: Use a consistent salt across a project so the same patient's
    studies always get the same pseudonym (for longitudinal analysis).
    Keep the salt secret and secure.
    """
    ds = pydicom.dcmread(input_path)
    uid_mapping = {}

    # Check for burned-in PHI in pixel data
    burned_in = str(getattr(ds, "BurnedInAnnotation", "NO")).upper()
    if burned_in == "YES":
        raise ValueError(
            f"File {input_path} has BurnedInAnnotation=YES. "
            "Manual review required before automated de-identification."
        )

    # Remove PHI tags
    for tag in SAFE_HARBOR_REMOVE_TAGS:
        if tag in ds:
            del ds[tag]

    # Handle patient birth date specially - keep year if requested
    birth_date_tag = (0x0010, 0x0030)
    if birth_date_tag in ds and ds[birth_date_tag].value:
        birth_date_str = str(ds[birth_date_tag].value)
        if keep_year_of_birth and len(birth_date_str) >= 4:
            # Keep birth year, zero out month and day
            ds[birth_date_tag].value = birth_date_str[:4] + "0101"
        else:
            del ds[birth_date_tag]

    # Handle study date - replace with year only
    study_date_tag = (0x0008, 0x0020)
    if study_date_tag in ds and ds[study_date_tag].value:
        study_date_str = str(ds[study_date_tag].value)
        if len(study_date_str) >= 4:
            ds[study_date_tag].value = study_date_str[:4] + "0101"

    # Pseudonymize UIDs consistently using HMAC
    for tag in SAFE_HARBOR_REPLACE_TAGS:
        if tag in ds:
            original_uid = str(ds[tag].value)
            # Create deterministic pseudonym from original UID + salt
            pseudo_hash = hashlib.sha256(f"{salt}{original_uid}".encode()).hexdigest()[:16]
            # DICOM UIDs must be numeric with dots; use a prefix from our OID space
            pseudo_uid = f"2.25.{int(pseudo_hash, 16)}"[:64]  # Max 64 chars
            uid_mapping[original_uid] = pseudo_uid
            ds[tag].value = pseudo_uid

    # Set patient ID and name to pseudonym based on original patient ID
    original_patient_id = str(getattr(ds, "PatientID", "UNKNOWN"))
    patient_pseudo = "ANON_" + hashlib.sha256(f"{salt}{original_patient_id}".encode()).hexdigest()[:8].upper()
    ds.PatientName = patient_pseudo
    ds.PatientID = patient_pseudo

    # Add de-identification method indicator (best practice)
    ds.add_new((0x0012, 0x0063), "LO", "HIPAA Safe Harbor")

    output_path_obj = Path(output_path)
    output_path_obj.parent.mkdir(parents=True, exist_ok=True)
    ds.save_as(str(output_path_obj), write_like_original=False)

    return uid_mapping


def audit_deidentification(
    original_path: str,
    deidentified_path: str,
    phi_patterns: Optional[list[str]] = None,
) -> dict:
    """
    Audit a de-identified DICOM file to verify PHI removal.
    Checks for known PHI patterns in remaining tag values.
    """
    original_ds = pydicom.dcmread(original_path)
    deidentified_ds = pydicom.dcmread(deidentified_path)

    # Build set of known PHI values from original
    phi_values = set()
    if hasattr(original_ds, "PatientName"):
        name_parts = str(original_ds.PatientName).lower().split("^")
        phi_values.update(p for p in name_parts if len(p) > 2)
    if hasattr(original_ds, "PatientID"):
        phi_values.add(str(original_ds.PatientID).lower())
    if hasattr(original_ds, "PatientBirthDate") and len(str(original_ds.PatientBirthDate)) > 4:
        phi_values.add(str(original_ds.PatientBirthDate)[4:])  # Month/day portion

    if phi_patterns:
        phi_values.update(p.lower() for p in phi_patterns)

    # Scan all remaining tag values in de-identified file
    phi_found = []
    for elem in deidentified_ds:
        if elem.VR in ("LO", "LT", "PN", "SH", "ST", "UT", "CS"):
            tag_value = str(elem.value).lower()
            for phi in phi_values:
                if phi in tag_value:
                    phi_found.append({
                        "tag": str(elem.tag),
                        "vr": elem.VR,
                        "keyword": elem.keyword,
                        "matched_phi": phi,
                    })

    return {
        "passed": len(phi_found) == 0,
        "phi_found": phi_found,
        "remaining_tag_count": len(list(deidentified_ds)),
    }

Model Card Generator

from dataclasses import dataclass, field
from typing import Optional
import json
from datetime import date


@dataclass
class SubgroupPerformance:
    group_name: str
    n_positive: int
    n_negative: int
    sensitivity: float
    specificity: float
    auc: float
    sensitivity_ci: tuple[float, float]  # 95% CI
    auc_ci: tuple[float, float]


@dataclass
class ModelCard:
    """
    FDA-aligned model card for a healthcare AI system.
    Captures the information required for regulatory submissions
    and ongoing post-market monitoring.
    """
    # Identity
    model_name: str
    model_version: str
    model_date: str
    intended_use: str
    indications_for_use: str
    contraindications: str

    # Training data
    training_dataset_name: str
    training_n_total: int
    training_n_positive: int
    training_institutions: list[str]
    training_date_range: tuple[str, str]  # start, end
    training_demographics: dict  # age, sex, race/ethnicity breakdown

    # Test data (must be independent from training)
    test_dataset_name: str
    test_n_total: int
    test_n_positive: int
    test_institutions: list[str]
    test_demographics: dict

    # Overall performance (on independent test set)
    sensitivity: float
    specificity: float
    auc: float
    ppv: float
    npv: float
    sensitivity_ci: tuple[float, float]
    auc_ci: tuple[float, float]

    # Subgroup performance
    subgroup_performance: list[SubgroupPerformance] = field(default_factory=list)

    # Known limitations
    known_limitations: list[str] = field(default_factory=list)
    out_of_scope_uses: list[str] = field(default_factory=list)

    # Regulatory
    regulatory_status: str = "Not cleared"
    predicate_device: Optional[str] = None
    fda_submission_number: Optional[str] = None

    def to_markdown(self) -> str:
        """Generate a Markdown model card suitable for publication."""
        lines = [
            f"# Model Card: {self.model_name} v{self.model_version}",
            f"",
            f"**Date:** {self.model_date}",
            f"**Regulatory Status:** {self.regulatory_status}",
            f"",
            f"## Intended Use",
            f"",
            f"{self.intended_use}",
            f"",
            f"**Indications for Use:** {self.indications_for_use}",
            f"",
            f"**Contraindications:** {self.contraindications}",
            f"",
            f"## Training Data",
            f"",
            f"- Dataset: {self.training_dataset_name}",
            f"- Total studies: {self.training_n_total:,}",
            f"- Positive cases: {self.training_n_positive:,} ({100*self.training_n_positive/self.training_n_total:.1f}%)",
            f"- Institutions: {', '.join(self.training_institutions)}",
            f"- Date range: {self.training_date_range[0]} to {self.training_date_range[1]}",
            f"",
            f"## Performance (Independent Test Set)",
            f"",
            f"- Test dataset: {self.test_dataset_name}",
            f"- Test set size: {self.test_n_total:,} ({self.test_n_positive:,} positive)",
            f"",
            f"| Metric | Value | 95% CI |",
            f"|--------|-------|--------|",
            f"| Sensitivity | {self.sensitivity:.3f} | ({self.sensitivity_ci[0]:.3f}, {self.sensitivity_ci[1]:.3f}) |",
            f"| Specificity | {self.specificity:.3f} | - |",
            f"| AUC | {self.auc:.3f} | ({self.auc_ci[0]:.3f}, {self.auc_ci[1]:.3f}) |",
            f"| PPV | {self.ppv:.3f} | - |",
            f"| NPV | {self.npv:.3f} | - |",
            f"",
        ]

        if self.subgroup_performance:
            lines.extend([
                f"## Subgroup Performance",
                f"",
                f"| Group | N Pos | N Neg | Sensitivity | AUC |",
                f"|-------|-------|-------|-------------|-----|",
            ])
            for sg in self.subgroup_performance:
                lines.append(
                    f"| {sg.group_name} | {sg.n_positive} | {sg.n_negative} | "
                    f"{sg.sensitivity:.3f} ({sg.sensitivity_ci[0]:.3f}-{sg.sensitivity_ci[1]:.3f}) | "
                    f"{sg.auc:.3f} |"
                )
            lines.append("")

        if self.known_limitations:
            lines.extend([
                f"## Known Limitations",
                f"",
            ])
            for lim in self.known_limitations:
                lines.append(f"- {lim}")
            lines.append("")

        if self.out_of_scope_uses:
            lines.extend([
                f"## Out of Scope Uses",
                f"",
            ])
            for oos in self.out_of_scope_uses:
                lines.append(f"- {oos}")
            lines.append("")

        return "\n".join(lines)

    def check_fda_readiness(self) -> list[str]:
        """
        Check whether the model card has the information required for an FDA 510(k) submission.
        Returns list of missing or potentially problematic items.
        """
        issues = []

        # Subgroup analysis requirements
        if not self.subgroup_performance:
            issues.append("REQUIRED: No subgroup performance data. FDA expects performance by sex, age group, and race/ethnicity.")

        subgroup_names = {sg.group_name.lower() for sg in self.subgroup_performance}
        if not any("female" in n or "sex" in n for n in subgroup_names):
            issues.append("REQUIRED: No sex-stratified performance data.")
        if not any("age" in n for n in subgroup_names):
            issues.append("REQUIRED: No age-stratified performance data.")
        if not any("race" in n or "ethnic" in n or "black" in n or "white" in n for n in subgroup_names):
            issues.append("REQUIRED: No race/ethnicity-stratified performance data.")

        # Sample size check
        for sg in self.subgroup_performance:
            if sg.n_positive < 100:
                issues.append(
                    f"WARNING: Subgroup '{sg.group_name}' has only {sg.n_positive} positive cases. "
                    "FDA typically expects 100+ positive cases per subgroup for adequate statistical power."
                )

        # Training/test overlap risk
        if set(self.training_institutions) & set(self.test_institutions):
            issues.append(
                "WARNING: Training and test set share institutions. "
                "FDA will scrutinize whether patient-level independence is maintained. "
                "Consider test set from entirely separate institutions."
            )

        # Regulatory status
        if self.regulatory_status == "Not cleared" and not self.fda_submission_number:
            issues.append("INFO: No FDA submission number. Device is not cleared for clinical use in the US.")

        return issues

Production Engineering Notes

Build Regulatory Traceability from Day One: Every dataset used in training or validation needs an immutable record: where it came from, who collected it, under what IRB protocol, when it was collected, and what de-identification was applied. Version-control your datasets with checksums (SHA-256 hash of the full dataset). Version-control your model artifacts. When an FDA reviewer asks "what was in your training data and how was it processed," you need a complete and auditable answer. Trying to reconstruct this retroactively from scattered notes and Slack messages is a nightmare. Tools: DVC for dataset versioning, MLflow for model versioning and experiment tracking, a structured data provenance database.

Intended Use Design is Architecture Design: The intended use statement you file with FDA constrains what the AI can do in production. If you clear a device to "assist radiologists in detecting nodules in adult chest CT" and then deploy it on pediatric patients or use it for something other than decision support, you are operating outside the cleared intended use - which is a regulatory violation. Design the clinical workflow, the user interface, and the inference API to enforce the intended use boundaries. Log every inference with patient demographics so you can verify the system is being used within its cleared scope.

Post-Market Surveillance is an Engineering System: FDA requires post-market surveillance for medical AI devices. For AI systems, this means: (1) tracking model performance on production data over time (which requires getting feedback on whether AI findings are confirmed or refuted, which requires integration with the radiology reporting system); (2) logging and reviewing adverse events (cases where following an AI recommendation led to patient harm or delayed treatment); (3) detecting distribution shift that might indicate model performance has degraded. Build this surveillance infrastructure before go-live, not after. It is harder to retrofit surveillance into a deployed system than to build it in from the start.

IRB Approval for Retrospective Studies: Using patient data to train an AI model is research. Research using patient data requires Institutional Review Board (IRB) approval unless specifically exempted. The Common Rule's exemption 4 covers secondary research with identifiable data only under specific conditions. Most hospital systems require at minimum a waiver of authorization from their IRB before allowing data access for AI training. Get IRB approval before collecting data, not after. Retroactive IRB approval is rarely granted and creates legal risk.

Common Mistakes

:::danger Changing Model After Performance Testing A 510(k) submission binds the performance data to a specific model version. If you run your performance testing, identify a bug, fix it, and retrain the model, the performance data no longer matches the model you are submitting. Either rerun performance testing on the updated model (required if you changed anything that affects outputs), or design your pipeline so that model freezing happens before performance testing and no changes are made after. This seems obvious but is regularly violated: teams find that performance looks slightly worse than expected, make a small hyperparameter tweak, and then need to decide whether to rerun the full test. Document your model freeze date and do not allow any changes after that date except for legitimate bug fixes with documented impact analysis. :::

:::danger Treating De-identified Data as Automatically HIPAA-Compliant De-identification reduces re-identification risk but does not eliminate it. A CT scan of a patient with a distinctive bone abnormality, combined with publicly available data, can potentially re-identify the patient even after removing all 18 Safe Harbor identifiers. Genomic data is inherently re-identifiable - 50 SNPs are sufficient to uniquely identify most individuals. Before publishing de-identified medical datasets or making them available to external collaborators, conduct a formal re-identification risk assessment. The fact that you followed Safe Harbor removes HIPAA liability but does not mean re-identification is impossible. For genomic data specifically, do not rely on Safe Harbor - use controlled access with data use agreements and consider additional technical protections (k-anonymity, synthetic data generation). :::

:::warning 510(k) Clearance is Site-Specific in Practice FDA 510(k) clearance applies to the device as described in the submission. If your submission says the device was validated on 3T MRI scanners from three specific vendors, deploying the same cleared software on 7T MRI or on a scanner from a vendor not in your validation set is operating outside the cleared intended use. Hospitals often have different equipment than your validation sites. Build an onboarding validation process: before deploying at a new site, run 100-200 locally labeled studies through the model and confirm performance is within acceptable bounds. If a new site has equipment or patient population significantly different from your validation set, that may require a new 510(k) amendment. :::

:::warning Failing to Separate Model Card from Marketing Model cards are technical documents describing actual performance on actual test sets with confidence intervals and subgroup breakdowns. Marketing materials describe the best-case performance with selective quoting. When FDA reviews your 510(k), they will compare your marketing claims against your model card data. "AI that detects pneumonia with 94% accuracy" in marketing but your model card shows sensitivity of 87% with specificity of 91% (not "accuracy") is a discrepancy that reviewers will flag. Use precise technical language consistently across all product communications. Never quote AUC in marketing without also disclosing the operating point (sensitivity/specificity) at which the device is intended to be used clinically. :::

Interview Q&A

Q: What is the difference between 510(k) clearance and PMA approval, and what factors determine which pathway a radiology AI product should pursue?

A: 510(k) clearance (k-number) is a premarket notification demonstrating that a new device is substantially equivalent to a legally marketed predicate device. It does not require clinical trial data proving safety and effectiveness - it requires demonstrating that the device is as safe and effective as the predicate. 510(k) takes roughly 6-12 months and is appropriate for moderate-risk (Class II) devices. Premarket Approval (PMA, p-number) is required for high-risk (Class III) devices and requires clinical trial data demonstrating safety and effectiveness independently, not just equivalence to a predicate. PMA takes 1-3 years minimum. For radiology AI, the key factors driving which pathway to use: (1) whether a substantial equivalent predicate exists - if you are the first AI of your kind for a given clinical indication, you may need De Novo; (2) the role of the clinician - AI that a trained radiologist reviews before acting is typically Class II and 510(k)-eligible; AI that operates autonomously without clinician review of outputs is more likely Class III requiring PMA; (3) the clinical consequence of errors - an AI that misses a finding that a radiologist would catch with 99% probability is lower risk than an AI making autonomous treatment decisions. In practice, virtually all radiology AI companies pursue 510(k) by carefully designing the intended use statement to position the AI as decision support that a radiologist reviews.

Q: Explain what a Predetermined Change Control Plan (PCCP) is and why it matters for AI medical devices.

A: A PCCP is a document, submitted as part of a 510(k) or PMA, that describes the types of changes a manufacturer anticipates making to their AI/ML-based medical device and the validation activities they will conduct for those changes. FDA agrees at the time of initial clearance that changes falling within the PCCP's scope can be deployed after the manufacturer completes internal validation, without requiring a new regulatory submission. This matters enormously for AI products because traditional medical device regulation assumed a static device - a pacemaker software update that fixes a bug has a well-defined scope. An AI model that retrains on new patient data is genuinely a different model. Without a PCCP, every retraining round technically requires a new 510(k), which takes 6-12 months and costs hundreds of thousands of dollars. With a PCCP, if the retraining methodology and performance evaluation criteria are pre-agreed with FDA, the manufacturer can redeploy an updated model within days or weeks. Designing your model development pipeline - retraining frequency, validation methodology, performance thresholds for deployment - is therefore a regulatory strategy decision as much as an engineering decision.

Q: What constitutes PHI under HIPAA, and what specific steps are required to de-identify a dataset of medical images for AI training?

A: PHI is individually identifiable health information held by a covered entity or business associate that relates to an individual's past, present, or future physical or mental health, the provision of health care, or payment for health care. The 18 Safe Harbor identifier categories include names, geographic subdivisions smaller than state, dates other than year, phone/fax numbers, email, SSN, MRN, health plan numbers, account numbers, license numbers, vehicle identifiers, device identifiers, URLs, IP addresses, biometric identifiers, full-face photographs, and any other unique identifying number. For medical images: (1) strip all DICOM header tags containing the 18 categories - use pydicom with a verified removal list, not just a subset; (2) pseudonymize UIDs consistently using keyed hashing so studies from the same patient can be linked for longitudinal analysis without exposing the original ID; (3) check for burned-in annotations - ultrasound, fluoroscopy, and some secondary captures embed PHI in pixel data; these require either manual cropping/masking or reconstruction of the image with annotations removed; (4) handle dates carefully - retain year of birth and year of study for age calculation and temporal analysis, but remove month and day; (5) audit the output using automated scanning for known PHI values in remaining tags and visual inspection for burned-in annotations; (6) document the de-identification methodology and the software version used, since you may need to demonstrate your approach is safe harbor compliant in a BAA or IRB protocol.

Q: How would you design a post-market surveillance system for a deployed chest X-ray AI? What metrics would you track and what thresholds would trigger action?

A: The surveillance system has three components. Performance monitoring: for each AI output (positive finding, negative finding, priority score), track whether the radiologist's subsequent report confirms or contradicts the AI. This requires extracting structured data from radiology reports using NLP or structured reporting templates. Compute rolling AUC, sensitivity, and specificity over a 30-day window and compare against the validation dataset performance. Trigger an alert if AUC drops more than 0.05 from baseline or if sensitivity drops more than 0.10 - these are clinically meaningful degradations. Distribution monitoring: track pixel statistics (mean HU, image noise, slice thickness) from processed studies daily and compare to training distribution statistics using t-tests or KL-divergence. Track scanner model metadata to detect new equipment not in the training set. Adverse event monitoring: implement a structured feedback mechanism for radiologists to flag cases where they believe the AI was misleading or harmful. Review every flagged case within 24 hours. Track false negative rate on critical findings (missed PE, missed intracranial hemorrhage, missed pneumothorax) separately from overall false negative rate. Action thresholds: information tier (log and monitor) for minor drift; warning tier (notify ML and clinical teams, schedule re-validation) for moderate performance decline; critical tier (disable AI output pending investigation) for severe performance decline, new scanner type with unknown performance characteristics, or any confirmed patient harm event attributable to AI output.

Q: A colleague argues that your AI model does not need FDA clearance because it is "clinical decision support" and not a medical device. How do you evaluate this claim?

A: This is a frequently made and frequently incorrect claim that requires careful analysis. FDA's 2016 guidance and subsequent 21st Century Cures Act provisions do exempt certain clinical decision support (CDS) software from FDA oversight, but the exemption is narrow. CDS software is NOT a medical device and does not require FDA clearance only if it meets all of: (1) it does not acquire, process, or analyze medical images or other physiological signals; (2) it displays, analyzes, or prints medical information about a patient; (3) it supports or provides recommendations to a healthcare professional about the prevention, diagnosis, or treatment of a disease; AND (4) a healthcare professional can independently review the basis for the recommendations so that they do not primarily rely on the software. The moment an AI analyzes DICOM images, vital sign waveforms, or genomic sequencing data, condition (1) is violated and it is no longer exempt. Any AI that processes medical images or signals is almost certainly a medical device requiring regulatory oversight. The colleague's claim is commonly used to avoid regulatory compliance but creates significant legal risk - deploying an uncleared AI medical device is a violation of the FD&C Act and can result in enforcement action including injunctions, recalls, and civil penalties.

Q: Explain the EU AI Act's requirements for healthcare AI and how they differ from FDA requirements.

A: The EU AI Act (2024) takes a more prescriptive, rights-based approach to AI regulation compared to FDA's evidence-based, safety-and-effectiveness framework. Key differences: FDA asks "is this device safe and effective for its intended use?" - the EU AI Act asks "does this system meet broad requirements around transparency, human oversight, data governance, and non-discrimination?". On prohibited uses: the EU AI Act prohibits AI that exploits vulnerabilities or uses subliminal manipulation - these prohibitions apply even if the AI is safe and effective. On transparency: the EU Act requires that high-risk AI systems inform natural persons that they are interacting with an AI system, and requires labeling when AI generates content. On human oversight: FDA requires that clinicians can review AI outputs before acting; the EU Act goes further in requiring that the AI system itself be designed to enable human oversight, with specific design requirements (ability to override, ability to pause). On bias: FDA expects subgroup performance testing; the EU Act requires ongoing bias monitoring and equal access to high-quality healthcare regardless of demographic characteristics, with specific documentation requirements. On enforcement: FDA enforces through device clearance; the EU AI Act creates a new enforcement structure with national market surveillance authorities and potential fines up to 30 million euros or 6% of global annual revenue. Companies selling in both markets need compliance programs that satisfy both: build to the more demanding standard in each area, which means FDA-level performance evidence plus EU-level documentation and process requirements.

The Submission That Failed in Week Three​

Why This Exists - The History of Medical Device Failures​

Historical Context - From 510(k) to SaMD​

Core Concepts​

FDA Device Classification​

The 510(k) Submission Package​

HIPAA Compliance for AI​

EU AI Act - High-Risk AI​

Bias Auditing​

Code Examples​

HIPAA De-identification Pipeline​

Model Card Generator​

Production Engineering Notes​

Common Mistakes​

Interview Q&A​