Shared Semantics, Divergent Mechanisms: Unsupervised Feature Discovery by Aligning Semantics and Mechanisms
Abstract
As large language models are increasingly deployed in high-stakes settings, there is a growing need for tools that audit not only model outputs but also the internal computations that produce them. Circuit analysis is a central approach in mechanistic interpretability, but it is typically target-conditioned, explaining a single prompt paired with a chosen completion. This target-conditioned setup obscures mechanistic heterogeneity and hinders scalable discovery. We introduce distribution-level unsupervised feature discovery, which discovers interpretable clusters across a prompt’s continuation distribution and provides a knob to trade off semantic granularity against mechanistic specificity, without manual target selection. Our method samples continuations, represents each with (i) a semantic embedding and (ii) a mechanistic signature derived from sparse feature attributions, and clusters them via a rate–distortion objective that trades off semantic coherence and mechanistic consistency. We also show that our method has cluster-level causality, which validates the discovery of cluster-level mechanistic representation. Overall, our approach complements circuit analysis and behavioral evaluation by providing a scalable, unsupervised audit of the mechanisms underlying a model’s continuation distribution.