SAEs-BrainMap: Unveiling the Emergence of Specialized Concepts in Deep Models via Brain Alignment
Abstract
Understanding the internal mechanisms of Deep Neural Networks remains a significant challenge, particularly in elucidating how generic visual concepts emerge within latent spaces. In this work, we propose SAEs-BrainMap, a novel framework that utilizes human brain activation patterns from the ventral visual pathway as objective probes to guide the identification of features decomposed by Sparse Autoencoders (SAEs). Our quantitative and qualitative empirical results demonstrate a robust representational alignment between sparse model features and biological Regions of Interest (ROIs), confirming the feasibility of utilizing brain signals to characterize model functionality. By leveraging this alignment, we trace the hierarchical trajectory of generic concepts cross layers and utilize the brain's hierarchical structure to visualize the model's global processing flow, providing novel insights into model interpretability.