SVD as a Fast Interpretability Method for Transformers
Abstract
Mechanistic interpretability of Transformer models commonly relies on training auxiliary proxy models, such as Sparse Autoencoders or Cross-Layer Transcoders. While effective, these post-hoc approaches introduce approximation bias and incur substantial computational overhead. We propose an alternative, training-free interpretability framework that directly exploits the Singular Value Decomposition (SVD) of weight matrices in Transformer MLP sublayers. By operating natively on model parameters, our method improves scalability while preserving fidelity to the original weights. We show that the projection matrices of MLP sublayers admit a natural decomposition into orthogonal, interpretable rank-1 subspaces, which we term Detector-Effector Units (DEUs). Within each unit, a singular vector functions as a detector of input patterns and modulates a coupled effector vector that encodes output semantics. Building on this structure, we introduce Subspace Contribution Analysis (SCA), a diagnostic method that quantifies the direct causal contribution of individual native subspaces to model predictions. Experiments across the GPT-2 family demonstrate that our framework, Native Network Anatomy (NaNA), identifies dominant functional pathways with orders-of-magnitude efficiency gains over training-based interpretability baselines, while maintaining weight fidelity. Our results suggest that SVD-based analyses provide a scalable and faithful alternative to learned proxy approaches for mechanistic interpretability.