CorrSteer: Generation-Time LLM Steering via Correlated Sparse Autoencoder Features
Seonglae Cho ⋅ Zekun Wu ⋅ Adriano Koshiyama
Abstract
Sparse Autoencoders (SAEs) decompose LLM activations into interpretable features, yet existing SAE-based steering methods require contrastive datasets or large activation stores. We introduce CorrSteer, which selects steering features by correlating task outcomes with SAE activations computed during generation, then validates these selections through intervention. This two-stage approach treats correlation as a selection heuristic and intervention as the causal test: features that both correlate with success and improve performance when amplified are retained. Coefficients derive from mean activations on correct samples, yielding a fully automated pipeline without task-specific tuning. On Gemma-2 2B and LLaMA-3.1 8B, CorrSteer achieves +3.3% on MMLU (4k samples) and +27.2% on HarmBench (108 samples), with lower side-effect ratios than fine-tuning despite comparable accuracy. Selected features cluster into interpretable categories: structured-output features for multiple-choice tasks, refusal features for safety, and domain-specific semantics for specialized benchmarks. The method scales to $10^5$ SAE features via streaming correlation ($O(1)$ in dataset size), requiring no backward passes or activation storage.
Successful Page Load