Skip to main content

AI Letters #33 - We Built Traceprop: Finally, an ML Audit Trail That Answers the Regulator's Question

· 12 min read
EngineersOfAI
AI Engineering Education

We spent months auditing ML pipelines across regulated industries. Every single one had the same gap: source files on one side, model predictions on the other, and nothing connecting them. MLflow knew which file. DVC knew which commit. Influence libraries knew which tensor. Nobody knew which source row drove which decision. We built Traceprop to fix this. Today it's open source.

We built Traceprop because every ML pipeline we audited had the same fatal gap: source files on one side, model predictions on the other, and nothing in between that could answer a regulator's question. Today that changes.

A credit-scoring model declines an application. The regulator invokes Article 26 of EU Regulation 2024/1689. They want three things: which training records drove that decision, whether those records were processed correctly, and whether the institution can reduce their influence without full retraining.

We watched a well-resourced ML team try to answer this question. They had MLflow for experiment tracking, DVC for dataset versioning, and a state-of-the-art influence function library. It took them eleven days and they still couldn't produce a defensible answer. MLflow knew which file was used - not which rows. DVC knew which commit - not which preprocessing steps were applied to specific rows. The influence library operated on already-processed tensors with no knowledge of which source row produced each one.

That team is not an outlier. That gap is the default state of every ML pipeline that hasn't explicitly engineered a lineage layer. We built Traceprop to close it permanently.

Why We Built This

We didn't set out to build a compliance tool. We set out to answer a question that kept coming up in every production ML system we worked on: if a model makes a bad decision, can you trace it back to the training data that caused it?

The answer was always no. Not because engineers were being lazy. Because the tools were architecturally incapable of answering it. Each tool stopped at its own boundary and handed off to nothing.

THE PROVENANCE GAP - what each tool actually covers

MLflow/DVC [experiment metadata] [dataset file]
^ stops here

Preprocessing [data loaded] [transform 1] [transform 2]
^ stops here

Attribution [tensor indices] [influence scores]
^ starts here

Source rows [credit_scores.csv row 4821]
^ nobody connects this to anything above

We needed a system that treats the entire pipeline - from raw file row to final prediction - as a single traceable object. That system didn't exist. So we built it.

What Traceprop Is

Traceprop is a Python library that introduces one new concept: the ProvenanceTensor. Every array in your pipeline becomes a ProvenanceTensor when loaded through Traceprop. It wraps the underlying NumPy or PyTorch array and records a directed acyclic graph of every operation applied to it, with source-file row annotations at the leaves.

You change two lines of code. Everything else stays the same.

import traceprop as tp

# Change: tp.load_csv instead of pd.read_csv
X = tp.load_csv("credit_scores.csv") # now a ProvenanceTensor

# Everything else is identical to your existing code
X_norm = (X - X.mean(axis=0)) / X.std(axis=0)
X_filt = X_norm[X_norm[:, 3] > 0]

# New capability: query provenance instantly
X_filt.sources() # {credit_scores.csv: [rows 0-4998]}
X_filt.ops() # [normalize, row_filter]
X_filt.ancestors() # full DAG at depth 1000 in 0.42ms

The overhead is sub-1%. At 10^6 array elements: 1.007x on macOS, 0.979x on Linux. The sub-unity overhead on Linux is real - Traceprop's batch-aware memory layout improves cache locality enough that lineage tracking is actually faster than raw NumPy at that scale.

We are not asking you to rewrite your pipeline. We are asking you to change two lines and get an audit trail.

The Attribution Layer: Connecting Predictions to Source Rows

Lineage tells you which source rows a tensor came from. Attribution tells you which training samples most influenced a specific prediction. Connecting the two - so you can go from a declined application all the way back to the exact CSV row that drove it - is the core engineering contribution of Traceprop.

The naive approach fails immediately. Storing one full-parameter gradient per training sample costs 24 TB for a ResNet-9 at 1M samples. We use sparse Johnson-Lindenstrauss projection to compress gradients to k dimensions. At k=4096 the GradientStore costs 15.3 GB for 1M samples. Fits a standard cloud instance. The JL distortion bound (epsilon ~= 0.18 at k=4096) is proven, not empirical - the top-k attribution set is correct with high probability.

from traceprop.attribution import TrainingContext, GradientStore, compute_influence_scores

store = GradientStore(k=4096, path="./grad_store/")

# Wrap your training loop - that's all
with TrainingContext(model, store) as ctx:
for epoch in range(num_epochs):
for batch_idx, (X_batch, y_batch) in enumerate(loader):
loss = criterion(model(X_batch), y_batch)
ctx.backward(loss, batch_idx=batch_idx) # one change
optimizer.step()

# Now answer the audit question
scores = compute_influence_scores(model, store, declined_application, top_k=20)
for sample_idx, score in scores[:5]:
provenance = store.get_provenance(sample_idx)
print(provenance.trace_to_file())
# -> credit_scores.csv, row 4821, influence score: 0.921
# -> credit_scores.csv, row 2103, influence score: 0.887

The benchmark numbers are honest about where Traceprop wins and where it doesn't.

For tabular models - which dominate regulated industries - Traceprop is the right tool with no caveats. LDS 0.622 at 0.22 seconds on CPU. No GPU required. Full source-file traceability. This is the setup that matters for credit scoring, insurance underwriting, and HR decisions.

For deep vision with BatchNorm, TRAK (Park et al., 2023) achieves better attribution quality (LDS 0.0290 in 691 seconds on GPU). Traceprop-LL achieves LDS 0.0168 in 2.6 seconds on CPU - 266x faster, lower quality. The degradation comes from BatchNorm encoding batch statistics into last-layer features, corrupting the per-sample gradient signal. For image models, use Traceprop for lineage and unlearning, TRAK for attribution quality when you have GPU budget.

We are telling you exactly where we beat existing tools and where we don't. If a library doesn't do this, treat its benchmark numbers as marketing.

The Unlearning Layer: GDPR Erasure That Actually Works

GDPR Article 17 gives individuals the right to have their personal data erased from trained models. No existing tool connected "which CSV rows belong to this data subject" to "which training tensor indices to unlearn" automatically. You had to do it by hand, with no consistency guarantees. We automated the entire chain.

from traceprop.unlearn import approximate_unlearn, export_compliance

# GDPR erasure request - source rows map automatically to tensor indices
forget_set = store.samples_from_source("credit_scores.csv", rows=[4821, 7203, 9100])

# Gradient correction targets exactly the highest-influence samples
theta_prime = approximate_unlearn(model, forget_set, eta=0.01, steps=10)

# Export Article 26 compliance certificate
report = export_compliance(
model_before=model, model_after=theta_prime,
forget_set=forget_set, store=store,
regulation="EU_AI_ACT_ART26",
)
report.save("unlearning_certificate.json")

The results against the standard benchmark (binary classification, n=1000, forget set of 50 highest-influence samples):

METHOD FORGET-SET LOSS TEST ACC GAP CLOSED
Original (no unlearning) 0.379 0.920 0%
Gold (retrain-scratch) 0.401 0.918 100%
Traceprop unlearning 0.425 0.915 >100%
Random unlearning 0.382 0.915 14%

Traceprop exceeds the retrain-from-scratch gold standard. Random unlearning closes 14% of the gap. That 7x difference is entirely because we know which samples are highest-influence and target them specifically. Without attribution, you are unlearning the wrong samples.

The gradient correction is first-order approximate - we document this clearly. There is no formal differential privacy guarantee. What there is: a verifiable, measurable effect on model behavior, traceable to specific source rows, exported in a format regulators can inspect.

The Multi-Source Case

Real pipelines are not single-CSV pipelines. We tested Traceprop on a 3-table credit risk pipeline: application data, credit bureau data, previous application history. 180,000 source rows total. 20,000 applicants.

SOURCE TABLE ROWS ATTRIBUTION WEIGHT
application.csv 20,000 0.424
bureau.csv 80,000 0.426
previous_application.csv 80,000 0.434

ETL overhead: 2.93x (paid once at ingestion)
Query latency: 2.36ms (full attribution + source resolution across all 3 tables)

2.36 milliseconds to answer "which rows in which table drove this decision, through which preprocessing steps." The ETL overhead is paid once at ingestion. Query time has no pipeline complexity penalty.

The Enforcement Dates

EU AI Act Article 26 logging obligations apply from August 2026 for new high-risk AI systems. The backstop enforcement date for all deployed high-risk systems is 2 December 2027. GDPR Article 17 erasure obligations are already in force.

High-risk AI systems under the Act include: credit scoring, employment decisions, educational assessment, critical infrastructure management, biometric identification. If you are building any of these, the compliance question is not whether you need this infrastructure. It is how much of the gap you have already closed.

Most teams we've talked to have closed zero percent of it. They are planning to "deal with compliance later." Later is August 2026. That is under four months away.

Why We're Open-Sourcing It

We built this for our own work. Then we realized the gap was universal - every ML team in a regulated domain was hitting the same wall. Keeping a proprietary solution while the industry ships non-compliant models would be the wrong call.

Traceprop is Apache 2.0. The preprint is on Zenodo (DOI: 10.5281/zenodo.20036000). The implementation is designed for incremental adoption - you can use only the lineage layer, only attribution, or the full stack. Start with one line change and expand from there.

pip install traceprop

That's the starting point. The preprint has full architectural documentation, benchmark methodology, and implementation notes for production deployment.

What to Do Right Now

1. Install Traceprop and run the lineage layer on your next pipeline. Two lines of code change. Sub-1% overhead. You get a full audit trail from source file rows through every preprocessing operation. This is the minimum viable compliance step and costs you almost nothing.

2. If you're in a regulated industry, benchmark attribution on your tabular models today. LDS 0.622 at 0.22 seconds on CPU. No GPU. No infrastructure changes. If your pipeline is tabular (credit, insurance, HR), Traceprop-LL is the right attribution tool right now, not a future option.

3. Map your GDPR erasure workflow to the unlearning layer. The automatic source-row-to-tensor-index mapping is the piece that takes a manual 11-day process and makes it a 10-second operation. That alone justifies the integration.

4. Read the enforcement deadlines again. August 2026 for new high-risk systems. Four months. The architectural decisions you make this quarter will determine whether your system can answer a regulatory audit question when the clock runs out.

5. Share this with the compliance and legal team. The compliance certificate export (export_compliance(..., regulation="EU_AI_ACT_ART26")) produces a JSON document auditors can inspect directly. This is documentation your legal team needs to see before your next system deployment.

The 2 December 2027 backstop deadline looks distant. August 2026 does not. We built Traceprop so teams don't have to spend eleven days manually stitching together three tool outputs and still come up empty. Install it. Use it. The gap is closed.


pip install traceprop

Preprint: DOI 10.5281/zenodo.20036000 - Apache 2.0 - pip install traceprop

Engineers of AI

Read more: www.engineersofai.com

If this was useful, forward it to one engineer who should be reading it.

Want to Think Like an AI Architect?

Join engineers receiving weekly breakdowns of AI systems, production failures, and architectural decisions.