Poster Wed, Jul 8, 2026 • 10:30 AM – 12:15 PM KST Coex: HALL A

Towards Disentangled Preference Optimization Dynamics

Wei Chen ⋅ Yubing Wu ⋅ Junmei Yang ⋅ Delu Zeng ⋅ Qibin Zhao ⋅ John Paisley ⋅ Min Chen ⋅ Zhou Wang

Project Page

Abstract

Preference optimization is widely used to align large language models (LLMs) with human preferences, yet many margin-based objectives often suppress the chosen response together with the rejected one, and no general mechanism exists to prevent this across objectives. We bridge this gap by presenting a unified \textbf{incentive-score decomposition} of preference optimization, revealing that diverse objectives share identical local update directions and differ only in their scalar weighting coefficients. Building on this decomposition, by analyzing the dynamics of the rewards of chosen/rejected responses, we identify the \textbf{disentanglement band (DB)}, a simple, testable condition that characterizes when training can realize the ideal pathway: suppressing the loser while maintaining the winner, possibly after an initial transient. Leveraging the DB, we propose a plug-and-play \textbf{reward calibration (RC)} that adaptively rebalances chosen versus rejected updates to satisfy the DB, without redesigning the base objective. Empirical results confirm that this calibration effectively disentangles updates and improves alignment performance across diverse objectives.