Exploration Hacking: LLMs Can Learn to Resist RL Training
Abstract
Reinforcement learning (RL) has become essential to the reasoning and alignment post-training of large language models (LLMs). However, successful RL relies on sufficient exploration of diverse actions by the model during training. We study whether RL is robust to exploration hacking, where a model strategically alters its exploration during training to influence the subsequent training outcome. First, we create model organisms of exploration hacking by using fine-tuning-based "locking" techniques; we show that these models can successfully resist RL-based capability elicitation in AI R&D and agentic biosecurity environments, while maintaining performance on closely related tasks. Next, we use our model organisms to evaluate the effectiveness of monitoring techniques as detection methods for exploration hacking. Finally, we show that current frontier models can reason effectively about suppressing their exploration when presented with simulated RL environments and encouraged to act strategically. Together, our results empirically establish exploration hacking as a failure mode of RL on sufficiently capable LLMs.