Poster
in
Workshop: Multi-Agent Systems in the Era of Foundation Models: Opportunities, Challenges and Futures

Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language Models

Mickel Liu ⋅ Liwei Jiang ⋅ Yancheng Liang ⋅ Simon Du ⋅ Yejin Choi ⋅ Tim Althoff ⋅ Natasha Jaques

Project Page [ OpenReview]

Abstract

Conventional language model (LM) safety alignment follows a reactive approach: attackers exploit static models, then defenders patch vulnerabilities through fine-tuning. This sequential process creates mismatches where attackers overfit to obsolete defenses while defenders lag behind emerging threats. We propose Self-RedTeam, an online self-play reinforcement learning algorithm where attacker and defender agents co-evolve through continuous interaction. We cast safety alignment as a two-player zero-sum game where a single model alternates between generating adversarial prompts and defending against them, while a reward LM adjudicates outcomes. Grounded in game theory, we establish a $\textit{theoretical safety guarantee}$: if self-play converges to Nash Equilibrium, the defender reliably produces safe responses to any adversarial input. $\textit{Empirically}$, Self-RedTeam discovers more diverse attacks (+21.8\% SBERT) than static-defender training and achieves higher robustness on safety benchmarks (+65.5\% on WildJailBreak) than static-attacker training. We further propose $\textit{hidden Chain-of-Thought}$ for private action planning, boosting adversarial diversity while reducing over-refusals. Our results motivate shifting from reactive patching to proactive co-evolution in LM safety training, enabling scalable and robust self-improvement via multi-agent reinforcement learning.

Chat is not available.