RSPO: Regularized Self-Play Alignment of Large Language Models
Xiaohang Tang ⋅ Sangwoong Yoon ⋅ Seongho Son ⋅ Huizhuo Yuan ⋅ Quanquan Gu ⋅ Ilija Bogunovic
Abstract
Self-play-based policy optimization has emerged as an effective approach for fine-tuning large language models (LLMs), formulating preference optimization as a two-player game. However, the regularization with respect to the reference policy, which is crucial for mitigating over-optimization, has been insufficiently investigated in self-play alignment. To study the impact of different regularization strategies, we propose \textbf{Regularized Self-Play Policy Optimization (RSPO)}, a novel framework that unifies prior methods and enables simple plug-and-play regularizers, meanwhile preserving convergence to Nash equilibrium of the corresponding regularized game. We empirically show that RSPO with appropriate regularizers can substantially improve the length-controlled win rate (LCWR) on AlpacaEval-2 across a range of base models, while also achieving consistently superior performance on Arena-Hard, MT-Bench, ArmoRM, and response diversity. In particular, RSPO improves unregularized self-play baseline (SPPO) on AlpacaEval-2 LCWR from $28.5\%$ to $ 35.4\%$ with base model Mistral-7B, from $38.77\%$ to $43.66\%$ with LLaMA-8B, and from $50.54\%$ to $51.83\%$ with Gemma-2B. Combining simplicity, convergence guarantees, and significant empirical gains, RSPO offers a strong foundation for exploring regularized self-play in alignment.
Successful Page Load