Interpretability Transfer from Language to Vision via Sparse Autoencoders
Abstract
Recent advances in language model interpretability using sparse autoencoders (SAEs) have yet to effectively translate to the visual domain, mainly due to the difficulty and ambiguity of labeling visual concepts. In this paper, we introduce Visual Interpretability via SAE Transfer Alignment (VISTA), a framework that transfers interpretability from language to vision by constraining a visual projector to map visual tokens into an LLM's pre-existing, labeled textual SAE space. This approach enables visual interpretability without training dedicated vision SAEs. By regularizing the projector using the LLM's SAE reconstruction loss, VISTA achieves a \textbf{threefold} increase in the matching rate which measures how accurately the most activating textual concepts in the SAE space correspond to semantic elements in the image. Using this framework, we further analyze spatial localization properties of different vision encoders and show that DINOv2 features have significantly stronger localization abilities than other encoders. Leveraging this precision, VISTA enables fine-grained, localized concept steering, allowing specific objects to be removed or replaced while preserving the surrounding scene. This results in improvements of \textbf{38\%} in object removal and \textbf{58\%} in object replacement tasks over vision-only baselines. These contributions are validated across multiple LLM architectures.