Steer Like the LLM: Activation Steering that Mimics Prompting
Abstract
Large language models can be steered at inference time through prompting or activation interventions, but activation steering methods often underperform compared to prompt-based approaches. We investigate whether activation steering can be improved by learning to mimic the interventions that prompt steering triggers within the model. To this end, we introduce Prompt Steering Replacement (PSR) models, a new family of activation steering methods that distill prompt steering behavior into interpretable interventions on model activations. A PSR is an activation steering method that estimates position-specific steering coefficients and is trained to imitate prompt-based interventions. Experiments on persona steering and instruction following across multiple language models demonstrate that PSR models consistently outperform constant-coefficient interventions that are frequently used in the literature and achieve performance close to or exceeding prompt steering while maintaining interpretability.