Poster

Can RLHF be More Efficient with Imperfect Reward Models? A Policy Coverage Perspective

Jiawei Huang ⋅ Bingcong Li ⋅ Christoph Dann ⋅ Niao He

2025 Poster

Project Page [ Poster] [ OpenReview]

Abstract

Sample efficiency is critical for online Reinforcement Learning from Human Feedback (RLHF). While existing works investigate sample-efficient online exploration strategies, the potential of utilizing misspecified yet relevant reward models to accelerate learning remains underexplored. This paper studies how to transfer knowledge from those imperfect reward models in online RLHF. We start by identifying a novel property due to KL-regularization in the RLHF objective: \emph{a policy's coverability of the optimal policy is captured by its sub-optimality}. Building on this insight, we propose novel transfer learning principles and a theoretical algorithm---Transfer Policy Optimization (TPO)---with provable benefits compared to standard online learning. Empirically, inspired by our theoretical findings, we develop a win-rate-based transfer policy selection strategy with improved computational efficiency. Moreover, our empirical transfer learning technique is modular and can be integrated with various policy optimization methods, such as DPO, IPO and XPO, to further enhance their performance. We validate the effectiveness of our method through experiments on summarization tasks.

Lay Summary

Reinforcement Learning from Human Feedback (RLHF) is a key step in fine-tuning large language models (LLMs), but collecting human feedback is expensive. This makes improving sample efficiency—learning from fewer annotations—an essential goal.While most works focus on better exploration or modeling techniques, we take a different approach: can we speed up learning by transferring knowledge from any reward models available, even if they’re imperfect? We introduce Transfer Policy Optimization (TPO), an algorithm with novel transfer learning strategies and provable benefits. Inspired by our theoretical findings, we also propose an empirical version of TPO, a scalable algorithm template that can leverage even flawed reward models to reduce the need for human feedback.Our work highlights an under-explored direction in RLHF: extracting and making use of information from imperfect signals to improve learning efficiency. This opens new possibilities for faster, cheaper, and more flexible training of LLMs.

Video

Chat is not available.