Removing Sandbagging in LLMs by Training with Weak Supervision
Abstract
As AI systems begin to automate complex tasks, supervision increasingly relies on weaker models or limited human oversight that cannot fully verify output quality. A model more capable than its supervisors could exploit this gap through sandbagging, producing work that appears acceptable but falls short of its true abilities. Can training force a model to produce its best work, even when we cannot reliably verify whether it has? We study this using model organisms trained to deliberately sandbag, testing supervised fine-tuning (SFT) and reinforcement learning (RL) as elicitation techniques on Olympiad math, graduate-level science (Super GPQA), and competitive coding (Code Contests). SFT on weaker supervisor outputs reliably reduces sandbagging and elicits performance beyond the supervisor’s own capabilities, though not always fully. RL alone is unreliable: consistent sandbagging limits exploration of correct answers, allowing the model to reward hack the supervisor instead. SFT followed by RL works most reliably: SFT reduces sandbagging enough for RL to obtain useful signal and fully elicit the sandbagging model. When training and evaluation distributions differ, however, the model can exploit this gap by producing correct answers during training while continuing to sandbag at evaluation.