Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Workshop on Theoretical Foundations of Foundation Models (TF2M)

Active Preference Optimization for Sample Efficient RLHF

Nirjhar Das · Souradip Chakraborty · Aldo Pacchiano · Sayak Ray Chowdhury


Abstract: Reinforcement Learning from Human Feedback (RLHF) is pivotal in aligning Large Language Models (LLMs) with human preferences. Although aligned LLMs have shown remarkable abilities in numerous tasks, their reliance on high-quality human preference data creates a costly bottleneck. Current methods for RLHF rely on uniformly picking prompt-generation pairs from a dataset of prompt-generations, to collect human feedback. For limited number of human feedback samples, we show that this leads to sub-optimal alignment. Next, we develop an active-learning algorithm, $\textit{Active Preference Optimization}$ ($\texttt{APO}$), which significantly enhances model alignment by querying preference data for the most important samples, thus achieving superior performance at a small sample budget. We analyze the theoretical performance guarantees of $\texttt{APO}$ showing that the suboptimality gap of the policy learned via $\texttt{APO}$ scales as $O(1/\sqrt{T})$ for a sample budget of $T$. We perform detailed experimental evaluations on practical preference datasets to validate $\texttt{APO}$'s efficacy over the existing methods, establishing it as a sample-efficient and practical solution of alignment in a cost-effective and scalable manner.

Chat is not available.