Layer-wise Gradient Disentanglement: Decoupling Semantics and Preferences in Direct Preference Optimization
Abstract
Direct Preference Optimization (DPO) has become the dominant approach for aligning large language models with human preferences. However, standard DPO treats all preference pairs uniformly, overlooking the heterogeneous nature of the learning problem: some samples demand sophisticated semantic understanding of the prompt, while others require nuanced discrimination between similar responses. We argue that these two objectives should be disentangled during training. Through gradient analysis, we identify a layer-wise localization phenomenon where semantic complexity predominantly drives lower-layer updates while preference uncertainty modulates upper layers. Building on this insight, we propose Gradient-Guided Disentangled DPO (GDO-DPO), a curriculum framework that independently regulates learning pace along each dimension based on layer-specific gradient stability. Experiments on UltraFeedback and HH-RLHF demonstrate consistent improvements, with GDO-DPO outperforming DPO by 4.1\% on AlpacaEval 2.0 and showing particularly strong gains on reasoning-intensive tasks.