Breaking the Block: Preserving Data Continuity to Train Superior SAEs for Instruct Models
Jiaming Li ⋅ Haoran Ye ⋅ Yukun Chen ⋅ Xinyue Li ⋅ Lei Zhang ⋅ Hamid Alinejad-Rokny ⋅ Jimmy Chih-Hsien Peng ⋅ Min Yang
Abstract
Sparse Autoencoders (SAEs) have become a cornerstone in mechanistic interpretability. However, current training methods inherit the Block Training paradigm from LLM pre-training. We identify this as a critical methodological oversight when applied to instruct models. Theoretically, utilizing GSNR analysis, we prove that attention leakage from unrelated contexts introduces destructive gradient noise. To rectify this paradigm, we propose $\underline{\textbf{F}}$inetuning-$\underline{\textbf{a}}$ligned $\underline{\textbf{S}}$equential $\underline{\textbf{T}}$raining ($\textit{FAST}$), a novel training method specifically tailored for instruct models.$\textit{FAST}$ aligns the training process with the data distribution and activation patterns of instruct models, achieving substantial improvements in both reconstruction and feature interpretability. Experimental results validate the efficacy of$\textit{FAST}$. GSNR analysis confirms improved training performance, with $\textit{FAST}$ demonstrating higher GSNR. This translates into superior reconstruction fidelity: $\textit{FAST}$ achieves an MSE of 0.6468 (significantly outperforming the baseline’s 5.1985) and maintains a near-zero Delta Loss (-0.51% to 0.37%). Consequently, feature quality is markedly enhanced; on Llama-3.2-3B-it, $\textit{FAST}$ yields 21.1% high-quality features, surpassing the 7.0% and 10.2% achieved by baselines. Surprisingly, we discover that intervening on special token activations via SAEs improves output quality, suggesting new opportunities for fine-grained control. Code, data, and all 240 trained SAEs will be publicly released.
Successful Page Load