Skip to yearly menu bar Skip to main content


Poster

RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

Harrison Lee · Samrat Phatale · Hassan Mansoor · Thomas Mesnard · Johan Ferret · Kellie Lu · Colton Bishop · Ethan Hall · Victor Carbune · Abhinav Rastogi · Sushant Prakash


Abstract:

Reinforcement learning from human feedback (RLHF) has proven effective in aligning large language models (LLMs) with human preferences, but gathering high-quality preference labels can be expensive. RL from AI Feedback (RLAIF), introduced in Bai et al. (2022b), offers a promising alternative to RLHF by training the reward model (RM) on preferences generated by an off-the-shelf LLM in lieu of human annotators. Across the tasks of summarization, helpful dialogue generation, and harmless dialogue generation, we show that RLAIF achieves comparable performance to RLHF. Furthermore, we take a step towards `self-improvement' by demonstrating that RLAIF can outperform the supervised fine-tuned baseline even when the AI labeler is the same size as the policy model. Finally, we introduce direct RLAIF (d-RLAIF) - a simple technique that obtains rewards directly from an off-the-shelf LLM during RL and achieves superior performance to canonical RLAIF. Our results suggest that RLAIF can achieve performance on-par with using human feedback, offering a potential solution to the scalability limitations of RLHF.

Live content is unavailable. Log in and register to view live content