Learning a Generative Meta-Model of LLM Activations
Abstract
Existing approaches for manipulating neural network activations, such as PCA and SAEs, rely on strong assumptions about activation structure. We develop a generative approach that models activations with diffusion, that makes minimal assumptions and improves with data and model scale. We use this activation diffusion model to improve downstream tasks: for instance, post-processing interventions with its learned generative prior, allowing for more effective steering without sacrificing fluency. Furthermore, the activation diffusion model can be used as an encoder, with units that cover a broad range of human-interpretable concepts, as measured by scalar probing. We also characterize the scaling properties of our approach, training models with 0.5B to 3.3B parameters on one billion residual stream activations from the Llama model family. We see that the diffusion loss decreases smoothly and reliably as a function of compute, and serves as a good proxy for downstream steering and probing performance. Our method provides a scalable approach towards interpretability without requiring commitments to strong assumptions.