Poster Tue, Jul 7, 2026 • 6:30 PM – 8:15 PM PDT HALL A #1502

Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO

Ruizhe Shi ⋅ Minhak Song ⋅ Runlong Zhou ⋅ Zihan Zhang ⋅ Maryam Fazel ⋅ Simon Du

Project Page

Abstract

We present a fine-grained theoretical analysis of the performance gap between two-stage reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO). Our study decomposes this gap into two sources: the explicit representation gap under exact optimization and the implicit representation gap under finite samples. In the exact optimization setting, we characterize how the relative capacities of the reward and policy model classes influence the final policy qualities. We show that RLHF, DPO, or online DPO can outperform one another depending on type of model mis-specifications. Notably, online DPO can outperform both RLHF and standard DPO when the reward and policy model classes are isomorphic and both mis-specified. In the approximate optimization setting, we provide a concrete construction where the ground-truth reward is sparse and show that RLHF requires significantly fewer samples than DPO to recover an effective reward model, highlighting a statistical advantage of two-stage learning. Together, these results provide a comprehensive understanding of the performance gap between RLHF and DPO under various settings, and offer practical insights into when each method is preferred.

Lay Summary

Two-stage RLHF or one-stage DPO, which one is better for learning from preferences? Equal under strong assumptions, but representation differences break the tie. Our paper reveals their fine-grained performance gaps under various conditions. We show that RLHF is superior to DPO under policy mis-specification, and online DPO cannot close the gap; and when the reward and policy models are both mis-specified, then result depends on qualities of the (surrogate) reward models, and online DPO can help enhance the quality. For token-level parameterization, we construct a simple task where the ground-truth reward to is a dual-token linear function with sparsity. We demonstrate that policy model is prone to model mis-specification. And even without mis-specifications, for the finite-sample regime, we can reveal a separation from the perspective of sparse recovery, showing that reward model typically outperforms DPO model. To conclude, we present a fine-grained analysis of the performance gap between RLHF and DPO, and comprehensively extend upon previous studies. These results offer practical insights into when each method is preferred, and explain why RLHF is empirically better than DPO.