On the Relationship Between Activation Outliers and Feature Death in Sparse Autoencoders
Elana Simon ⋅ Etowah Adams ⋅ James Zou
Abstract
Sparse autoencoders (SAEs) decompose neural network activations into interpretable features, but many features never activate- a problem called feature death. Death rates vary dramatically across models: near-zero on GPT-2, over 70\% on AlphaFold3 with identical SAE configurations. Why? We find that dimension-level activation outliers (dimensions where mean magnitude is large relative to per-token variation) shift pre-activations at initialization, making feature fate depend on weight-outlier alignment rather than input content. We derive $\gamma = \|\boldsymbol{\mu}\|/\|\boldsymbol{\sigma}\|$ from this mechanism; it predicts initial death rates (Spearman $\rho > 0.9$) across 275 model-layer combinations spanning language, vision, and protein models. This creates two death pathways; we trace their recovery mechanisms and find one resolves naturally while the other bottlenecks on the SAE slowly learning to mean-center. Initializing the SAE to mean-center from the start eliminates this outlier-induced death, confirming the mechanism.
Successful Page Load