Transitivity Meets Cyclicity: Explicit Preference Decomposition for Dynamic Large Language Model Alignment
Abstract
Standard RLHF relies on transitive scalar rewards, failing to capture the cyclic nature of human preferences. While some approaches like the General Preference Model (GPM) address this, we identify a theoretical limitation: their implicit formulation entangles hierarchy with cyclicity, failing to guarantee dominant solutions. To address this, we propose the Hybrid Reward-Cyclic (HRC) model, which utilizes game-theoretic decomposition to explicitly disentangle preferences into orthogonal transitive (scalar) and cyclic (vector) components. Complementing this, we introduce Dynamic Self-Play Preference Optimization (DSPPO), which treats alignment as a time-varying game to progressively guide the policy toward the Nash equilibrium. Experiments on RewardBench 2 demonstrate that HRC consistently improves over both BT and GPM baselines (e.g., +1.23\% on Gemma-2B). In particular, its superior performance in the \textit{Ties} domain empirically validates the model's robustness in handling complex, non-strict preferences. Furthermore, extensive evaluations on downstream generation tasks (AlpacaEval 2.0 and MT-Bench) confirm the efficacy of our framework. Notably, on Gemma-2B-it, HRC+DSPPO achieves a peak length-controlled win-rate of 44.75\% against GPT-4-Turbo on AlpacaEval 2.0 (evaluated by GPT-4o-mini), significantly outperforming standard SPPO baselines trained with BT or GPM.