Less is More: Neuroscience-Motivated Probing for Efficient Concept Circuits Tracing
Abstract
Despite the high accuracy, Vision Transformer (ViT) predictions can be driven by spurious cues. Interpreting their inner workings is therefore essential for safe deployment. Sparse autoencoders (SAEs) shed light on decomposing language-model representations into concepts. However, adapting SAE-based interpretation to ViTs remains challenging due to limited control over concept coverage and subjective, non-scalable feature interpretation. To fill the gaps, motivated by stimulus-efficient probing in systems neuroscience, we propose ViSAE, a compact diagnostic toolbox for interpreting the internal mechanisms of ViTs. Specifically, we introduce a probing suite with 64K images and a 16K visually grounded concept vocabulary, improving concept coverage efficiency by 20× over ImageNet and interpretation accuracy by 28.7% over existing concept sets. We further develop an algorithm that automatically interprets SAE features at scale and causally traces cross-layer interactions to recover ViT inner workings using concept circuits. Our method supports auditing spurious correlations and failure modes, and boosts worst-group accuracy on WaterBirds by 48.2% through concept steering.