The Provenance Gap
History of ML data provenance tooling and EU AI Act enforcement deadlines
EU AI Act Enforcement Deadlines - Countdown
High-risk AI systems (credit scoring, HR decisions, educational AI, critical infrastructure) must maintain full audit trails.
August 2026
Article 26 logging obligations for new high-risk AI systems
2 December 2027
Backstop enforcement for all deployed high-risk AI systems
Already in force
GDPR Article 17: right to erasure of personal training data, verifiable
The gap
No existing single tool answers audit questions end-to-end
Stage 1: Data Ingestion
MLflow
DVC
DSLog
Stops at artifact boundary - no row-level lineage
Stage 2: Model Training
TF MLMD
W&B
Neptune
No gradient-level attribution, no source traceability
Stage 3: Attribution
TRAK
dattri
LogIX
Operates on tensors - no knowledge of source files or preprocessing
Tool Coverage Across the ML Pipeline
Which tools cover which pipeline stages (1=full coverage, 0.5=partial, 0=none)
2018
MLflow Released
Databricks open-sources MLflow. Artifact-level experiment tracking becomes standard. Records which dataset file and code commit produced which model. Does not record which rows were processed or how.
2020
DVC + DSLog
DVC adds Git-like versioning for datasets. DSLog tracks operator-level provenance for NumPy workloads. Both excellent at their stage - neither crosses the boundary into training or attribution.
2023
TRAK: Attribution at Scale
Park et al. (MIT/ICML 2023) introduce TRAK - training data attribution via sparse JL projection. LDS 0.0290 on CIFAR-2 in 691 seconds on GPU. Excellent attribution quality. Operates entirely on post-preprocessing tensors with no source file knowledge.
2024
EU AI Act Published
EU Regulation 2024/1689 published. Article 26 requires audit trails for high-risk AI. Article 13 requires transparency. GDPR Article 17 erasure obligations already in force. The compliance clock starts.
May 2026
Traceprop Released (Zenodo Preprint)
First unified system connecting source files through preprocessing through training to predictions. Three layers: ProvenanceTensor lineage, JL-projected GradientStore attribution, provenance-guided approximate unlearning. Sub-1% overhead at 1M elements. pip install traceprop.
August 2026
EU AI Act Article 26 - New Systems
Logging obligations enter force for new high-risk AI systems. Credit scoring, HR decisions, educational AI, critical infrastructure must maintain audit trails. Systems deployed after this date need audit trail infrastructure from day one.
2 Dec 2027
EU AI Act Backstop Enforcement
All deployed high-risk AI systems must be compliant. Systems deployed before August 2026 have until this date to retrofit audit trail infrastructure. The 18-month window is shorter than most ML infrastructure projects.
www.engineersofai.com - AI Letters #33