Timezone: »

Principled Reinforcement Learning with Human Feedback from Pairwise or K-wise Comparisons
Banghua Zhu · Michael Jordan · Jiantao Jiao

Thu Jul 27 01:30 PM -- 03:00 PM (PDT) @ Exhibit Hall 1 #538
We provide a theoretical framework for Reinforcement Learning with Human Feedback (RLHF). We show that when the underlying true reward is linear, under both Bradley-Terry-Luce (BTL) model (pairwise comparison) and Plackett-Luce (PL) model ($K$-wise comparison), MLE converges under certain semi-norm for the family of linear reward. On the other hand, when training a policy based on the learned reward model, we show that MLE fails while a pessimistic MLE provides policies with good performance under certain coverage assumption. We also show that under the PL model, both the true MLE and a different MLE which splits the $K$-wise comparison into pairwise comparisons converge, while the true MLE is asymptotically more efficient. Our results validate the empirical success of the existing RLHF algorithms, and provide new insights for algorithm design. Our analysis can also be applied for the problem of online RLHF and inverse reinforcement learning.

Author Information

Banghua Zhu (University of California, Berkeley)
Michael Jordan (UC Berkeley)
Jiantao Jiao (University of California, Berkeley)

More from the Same Authors