Active Attacks: Red-teaming LLMs via Adaptive Environments
Abstract
We address the challenge of automatically generating diverse attack prompts for large language models (LLMs) that elicit harmful behaviors (e.g., insults, sexual content) and are used for safety fine-tuning. While several prior approaches train LLMs with reinforcement learning (RL) to generate such prompts using only a toxicity classifier as a reward, existing diversity-seeking RL methods often collapse to limited modes: once high-reward prompts are found, exploration of new regions is discouraged. Inspired by the active learning paradigm that encourages adaptive exploration, we introduce \textbf{Active Attacks}, a novel RL-based red-teaming algorithm that adapts its attacks as the victim evolves. By periodically safety fine-tuning the victim LLM with collected attack prompts, we naturally induce an \emph{easy-to-hard exploration curriculum}, where the attacker progresses beyond easy modes toward increasingly difficult ones. We observe that this simple plug-and-play module, which seamlessly integrates into existing RL objectives, unexpectedly outperformed prior RL-based methods, improving cross-attack success rates against GFlowNets, the previous state-of-the-art, from 0.07\% to 31.28\% (a relative gain of more than 400×) with only a 6\% increase in computation.