$\textit{S}$-SPPO: Semantic-Calibrated Self-Play Preference Optimization
Xiwen Chen ⋅ Wenhui Zhu ⋅ Jingjing Wang ⋅ Peijie Qiu ⋅ Zhipeng Wang ⋅ Huayu Li ⋅ ZhengXiao He ⋅ XUANZHAO DONG ⋅ Prayag Tiwari ⋅ Mingkun Xu ⋅ Yujian Xiong ⋅ Feng Luo ⋅ Abolfazl Razi ⋅ Brendan Rappazzo ⋅ Anderson Schneider ⋅ Yuriy Nevmyvaka
Abstract
Aligning Large Language Models (LLMs) with human preferences is often formulated via Direct Preference Optimization (DPO). However, the standard Bradley-Terry instantiation of DPO is limited in modeling common departures from transitivity in human preferences. To address this, recent work has introduced Self-Play Preference Optimization (SPPO), which iteratively refines the policy by training on self-generated win-lose pairs. Our investigation, however, reveals a critical instability in SPPO: the optimization is prone to \textit{policy degeneration} when the preference oracle assigns overly confident wins to semantically indistinguishable responses. To mitigate this, we propose $\textit{S}$-SPPO, a dual-space semantic calibration framework comprising: i) $\textit{Supervision Calibration}$ via semantic gating, which anneals win rate targets toward the maximum-entropy baseline as semantic overlap increases; and ii) $\textit{Representation Calibration}$ via latent repulsion to enforce geometric diversity to prevent manifold collapse and maintain latent diversity between chosen and rejected samples. Theoretically, we show that the calibration preserves the constant-sum game structure, facilitating convergence to a Nash Equilibrium. Empirically, $\textit{S}$-SPPO avoids the performance degradation seen in prior methods, achieving 52.19\% win rate and 47.46\% length-controlled win rate on AlpacaEval 2.0 with Llama-3-8B, without using additional human-annotated preferences during training.
Successful Page Load