Spotlight Poster
AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders
Zhengxuan Wu · Aryaman Arora · Atticus Geiger · Zheng Wang · Jing Huang · Dan Jurafsky · Christopher Manning · Christopher Potts
East Exhibition Hall A-B #E-1108
Fine-grained steering of language model outputs is essential for safety and reliability. Prompting and finetuning are widely used to achieve these goals, but interpretability researchers have proposed a variety of representation-based techniques as well, including sparse autoencoders (SAEs), linear artificial tomography, supervised steering vectors, linear probes, and representation finetuning. At present, there is no benchmark for making direct comparisons between these proposals. Therefore, we introduce AxBench, a large-scale benchmark for steering and concept detection, and report experiments on Gemma-2-2B and 9B. For steering, we find that prompting outperforms all existing methods, followed by finetuning. For concept detection, representation-based methods such as difference-in-means, perform the best. On both evaluations, SAEs are not competitive. We introduce a novel weakly-supervised representational method (Rank-1 Representation Finetuning; ReFT-r1), which is competitive on both tasks while providing the interpretability advantages that prompting lacks. Along with AxBench, we train and publicly release SAE-scale feature dictionaries for ReFT-r1 and DiffMean.
Imagine trying to teach a voice-assistant to avoid spoilers, speak politely, or explain ideas at a child-friendly level every single time it answers. Researchers use two main tricks to guide these systems today: (1) writing clever prompts and (2) re-training the model on lots of new examples. A third family of methods—tweaking what happens inside the model’s hidden layers—has attracted growing interest because it promises faster, more targeted control. Yet no one has had a single testbed for judging which approach actually works best.Our work introduces AxBench, the first large-scale benchmark designed to compare all three strategies on two everyday challenges:1. Steering – getting the model to talk about or avoid a chosen topic.2. Concept detection – quickly spotting whether a passage already contains that topic.Running AxBench on open-source Gemma models (2-billion and 9-billion parameters), we found:* Well-crafted prompts still give the most reliable steering, with full model retraining close behind.* For detecting concepts, simple statistical checks inside the model outperform everything else.* A popular interpretability tool, sparse autoencoders, surprisingly lags on both tasks.Finally, we present ReFT-r1, a lightweight way to nudge the model’s internal representations. It competes with the best methods on both steering and detection while remaining transparent about why it works. To help others build on this, we are releasing AxBench, our evaluation code, and ready-to-use feature dictionaries for the community.
Live content is unavailable. Log in and register to view live content