Introspection Adapters: Training LLMs to Report Their Learned Behaviors
Keshav Shenoy ⋅ Li Yang ⋅ Abhay Sheshadri ⋅ Jack Lindsey ⋅ Samuel Marks ⋅ Rowan Wang
Abstract
Can we train LLMs to *introspect*, i.e. to faithfully describe their own behaviors in natural language? Prior work has shown some, limited, success. However, it is difficult to scale introspection training due to a lack of ground-truth labels. In this work, we study an approach to introspection training which side-steps this data bottleneck. Given a target model $M$, our method works by fine-tuning models $M_i$ from $M$ with implanted behaviors $b_i$ (such as downplaying medical problems); the $(M_i, b_i)$ pairs serve as labeled introspection training data. We then train an *introspection adapter* (IA): a LoRA adapter jointly optimized across the fine-tunes $M_i$ which causes them to verbalize their implanted behaviors. This IA induces faithful introspection in fine-tunes of $M$ that were trained in very different ways from the $M_i$, as well as in $M$ itself. This is surprising because the IA was never trained on $M$. To demonstrate the utility of IAs, we use them to successfully audit misaligned models introduced in prior work. IAs can also be used to detect fine-tuning API attacks which train models to comply with encrypted harmful requests. Notably, IAs are more effective when applied to larger models. Overall, our results suggest that IAs are a scalable, effective, and practically useful approach to LLM introspection training.
Successful Page Load