Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language Models
Abstract
Conventional large language model (LLM) safety alignment relies on a reactive, disjoint loop: attackers exploit a static model, then defenders patch exposed vulnerabilities. This sequential setup leads to attackers overfitting obsolete exploits while defenders perpetually lag behind emerging threats. To address this, we introduce Self-RedTeam, the first fully online self-play multi-agent reinforcement learning (MARL) algorithm that continuously co-evolves attacker and defender for robust safety alignment. A single policy self-plays as both attacker and defender, generating adversarial prompts and defending against them, with a reward model adjudicating outcomes. Each role uses hidden chain-of-thought for strategic planning. Grounded in two-player zero-sum game theory, we establish a theoretical safety guarantee: if the game converges to Nash Equilibrium, the defender produces safe responses against any adversarial input. Empirically, Self-RedTeam generalizes across five models from the Llama and Qwen families, uncovering more diverse attacks (+17.80% SBERT) and improving safety of RLHF-trained models by up to 95% across 14 benchmarks. Our work motivates a shift from reactive patching to proactive co-evolution, enabling LLM safety self-improvement via online self-play MARL.