Towards Disentangled Preference Optimization Dynamics
Abstract
Preference optimization is widely used to align large language models (LLMs) with human preferences, yet many margin-based objectives often suppress the chosen response together with the rejected one, and no general mechanism exists to prevent this across objectives. We bridge this gap by presenting a unified \textbf{incentive-score decomposition} of preference optimization, revealing that diverse objectives share identical local update directions and differ only in their scalar weighting coefficients. Building on this decomposition, by analyzing the dynamics of the rewards of chosen/rejected responses, we identify the \textbf{disentanglement band (DB)}, a simple, testable condition that characterizes when training can realize the ideal pathway: suppressing the loser while maintaining the winner, possibly after an initial transient. Leveraging the DB, we propose a plug-and-play \textbf{reward calibration (RC)} that adaptively rebalances chosen versus rejected updates to satisfy the DB, without redesigning the base objective. Empirical results confirm that this calibration effectively disentangles updates and improves alignment performance across diverse objectives.