GapPO: Gradient-Adaptive Pairwise Preference Optimization
Michelle Chang ⋅ Xiaodi Sun ⋅ Ethan C Chau ⋅ Zhaoqiong Huang ⋅ Arpita Das ⋅ Izzie Lau ⋅ Liyuan Zheng ⋅ Huancheng Chen ⋅ Jingwen Lu
Abstract
Aligning large language models to human preferences requires training on pairwise comparisons between candidate responses. Existing preference optimization methods assign equal gradient weight to every pair, regardless of whether the quality difference is large or negligibly small. We introduce GapPO (Gradient-Adaptive Pairwise Preference Optimization), a preference optimization method designed to directly improve pairwise ranking accuracy in large language models. Standard methods currently treat all pairs equally: a pair scoring $4.8$ vs. $1.2$ receives the same gradient weight as one scoring $3.2$ vs. $2.9$, diluting clear signal with annotator noise. GapPO corrects this by weighting each pair by the absolute quality-score gap $|\delta| = |\texttt{score}{\text{chosen}} - \texttt{score}{\text{rejected}}|$, so that gradient mass concentrates on the most discriminative comparisons. Since the model is shaped more by reliable comparisons, its implicit reward function better separates high-quality from low-quality responses at test time. Beyond improving pairwise accuracy (PWA), score-gap weighting improves Spearman rank correlation between model rewards and annotation scores, which is the calibration property required to scale from pairwise to listwise ranking. Evaluated on UltraFeedback binarized across Qwen2.5-0.5B, Gemma-2-2B, and Mistral-7B, GapPO consistently outperforms SimPO, CPO, IPO, and AlphaPO baselines.
Successful Page Load