Query Lens: Interpreting Sparse Key-Value Features with Indirect Effects
Abstract
While sparse autoencoders yield features easier to study than individual neurons, their reliable interpretation remains challenging. We propose Query Lens, which extends Logit Lens to provide more comprehensive and faithful interpretations of sparse features. By jointly considering encoder-side key features and decoder-side value features, we characterize both the inputs that activate a feature and the outputs it promotes. We also account for indirect, module-mediated effects that arise when the feature is processed by downstream modules, going beyond the direct effect captured by Logit Lens. In experiments, we find that Query Lens yields coherent token signatures for features that were previously uninterpretable under Logit Lens. Finally, we propose the Subspace Channel Hypothesis, suggesting that downstream modules read features through layer-specific subspaces.