Null-Calibrated Evaluation of Sparse Autoencoder Decoder Reproducibility
Abstract
Sparse autoencoders (SAEs) are often evaluated by reconstruction loss, but interpretability workflows also require that learned dictionaries be reproducible across random seeds and robust to evaluation artifacts. We study SAE decoder reproducibility as a benchmark-design problem: every stability score is reported against a metric-specific random-dictionary null, pairwise seed statistics are treated as dependent, and decoder geometry is audited with assignment-based, activation-level, firing-overlap, causal, streaming, and synthetic-ground-truth controls. In compute-limited cached-activation regimes, reconstruction can appear converged while decoder-column similarity remains within 1.5% of the geometric null; longer training raises decoder agreement, but activation and functional diagnostics lag. These results argue that SAE benchmarks should report reconstruction, null-calibrated decoder matching, held-out activation agreement, and ground-truth or downstream checks together rather than treating reconstruction or a single stability metric as sufficient.