Skip to yearly menu bar Skip to main content


Poster

Gibbs Sampling from Human Feedback: A Provable KL-constrained Framework for RLHF

Wei Xiong · Hanze Dong · Chenlu Ye · Ziqi Wang · Han Zhong · Heng Ji · Nan Jiang · Tong Zhang


Abstract:

This paper studies the theoretical framework of the alignment process of generative models with Reinforcement Learning from Human Feedback (RLHF). We consider a standard mathematical formulation, the reverse-KL regularized contextual bandit for RLHF. Despite its widespread practical application, a rigorous theoretical analysis of this formulation remains open. We investigate its behavior in three distinct settings: offline, online, and hybrid, and propose efficient algorithms with finite-sample theoretical guarantees. Our work also bridges the gap between theory and practice by linking our theoretical insights with existing practical algorithms such as Direct Preference Optimization (DPO) and Rejection Sampling Optimization (RSO). Furthermore, these findings and connections also offer both theoretical and practical communities new tools and insights for future algorithmic design of alignment algorithms.

Live content is unavailable. Log in and register to view live content