Traceprop Benchmark Dashboard

Lineage Overhead Ratio (tracked / baseline)

Lower is better. Ratios below 1.05x acceptable for production. Linux sub-unity = cache locality gain.

Attribution Quality: LDS (Spearman rho)

Higher is better. Tabular: Traceprop-LL matches TRAK quality at 266x lower cost.

Attribution Quality Table

CIFAR-2/ResNet-9 (500 subsets). Traceprop-LL is 266x faster on CPU vs TRAK on GPU.

Method	LDS	Time	HW
TRAK (5 ckpts)	0.0290	691 s	GPU T4
Traceprop-LL	0.0168	2.6 s	CPU
Traceprop-BM	0.0033	14.2 s	CPU
Random	0.0205	<0.001 s	-
Tabular: TP-LL	0.622	0.22 s	CPU

Query Latency (macOS, single thread)

Sub-millisecond for all practical pipeline depths.

Query	Depth	Latency
`sources()`	1	<0.001 ms
`ops()`	1	<0.001 ms
`ancestors()`	10	0.004 ms
`ancestors()`	100	0.041 ms
`ancestors()`	1000	0.420 ms
`trace_to_file()`	multi-source	2.36 ms

Multi-Source Case Study: 3-Table Credit Risk Pipeline

20,000 applicants, 180,000 total source rows across 3 tables. One-time ETL overhead enables sub-millisecond query-time provenance.

Metric	Value	Notes
Source tables	3	application, bureau, previous_application
Total source rows	180,000	across all 3 tables
Avg source rows / training sample	8.5	after join aggregation
ETL baseline time	0.100 s	raw pandas
Traceprop ETL time	0.293 s	2.93x overhead - paid once at ingestion
`ancestors()` latency	0.003 ms	query time
`trace_to_file()` latency	2.36 ms	full attribution + source resolution
Attribution contribution	0.424 / 0.426 / 0.434	per table - comparably distributed

Tabular Models: Traceprop is the Right Choice

LDS 0.622 at 0.22s on CPU with full source-file traceability. Tabular models dominate regulated industries (credit, insurance, HR). For these use cases, Traceprop-LL provides TRAK-quality attribution without GPU infrastructure or source-file blindness.

Deep Vision: Use TRAK When GPU is Available

On ResNet-9 with BatchNorm, TRAK achieves LDS 0.0290 (GPU) vs Traceprop-LL 0.0168 (CPU). BatchNorm encodes batch statistics in last-layer features, degrading per-sample gradient signal. For image models, Traceprop provides lineage and unlearning; use TRAK for attribution quality.

Provenance-Guided Unlearning Outperforms Random by 6x

Random unlearning closes 14% of the gap. Provenance-guided unlearning closes >100%. The difference is attribution: knowing which samples are highest-influence and targeting them, rather than randomly sampling a forget set. The gradient correction is first-order approximate - no formal DP guarantee.

GradientStore Memory: Practical at Scale

k=4096, 1M samples: 15.3 GB. Fits a standard cloud instance. For larger datasets, numpy.memmap enables disk-backed storage. At k=512, memory drops to 1.91 GB with lower attribution quality. The JL distortion bound (epsilon ~=0.18 at k=4096) is proven, not empirical.

Traceprop: All Benchmarks