ICML Poster Policy Filtration for RLHF to Mitigate Noise in Reward Models

Poster

Policy Filtration for RLHF to Mitigate Noise in Reward Models

Chuheng Zhang · Wei Shen · Li Zhao · Xuyun Zhang · Xiaolong Xu · Wanchun Dou · Jiang Bian

West Exhibition Hall B2-B3 #W-420

[ Abstract ] [ Lay Summary ]

[ OpenReview]

Tue 15 Jul 11 a.m. PDT — 1:30 p.m. PDT

Abstract: While direct policy optimization methods exist, pioneering LLMs are fine-tuned with reinforcement learning from human feedback (RLHF) to generate better responses under the supervision of a reward model learned from preference data. One major challenge of RLHF is the inaccuracy of the intermediate reward model, especially in the tasks that requires complex reasoning for the reward model to score a response. We find that the reliability of the reward model varies across responses assigned with different rewards. This motivates us to filter the samples whose rewards may be unreliable to improve the signal-to-noise ratio during policy learning, resulting in Policy Filtration for Proximal Policy Optimization (PF-PPO). To choose a proper policy filtering strategy, we use the coefficient of determination ($R^2$) between the rewards and actual scores on filtered samples as the metrics to help us find promising strategies since it measures how well the rewards filtered by PF-PPO indicate real performance. We provide extensive experiments to validate the effectiveness of PF-PPO in code generation and math reasoning tasks. In code generation, PF-PPO achieves the state-of-the-art performance of 7-billion-parameter models on HumanEval (+7.9%), MBPP (+0.7%), and LeetCode Contest (+10.0%) which is a more challenging benchmark created by us. In math reasoning, PF-PPO yields performance increase using different reward models and benchmarks (Ape210K and CMATH).

Lay Summary:

Recently, RLHF has become a key technique in LLM, and PPO is the dominate RL algroithm used for industrial-tier LLMs. Since we are optimizing the LLM to follow the value of humans which cannot be precisely described by a precise math formulaiton, researchers train a large reward model based on a few human annotations and use this model to further guide the learning of LLMs. This introduces an issue to the PPO algorithm which is originally proposed to optimized under the guidance of low-noise signals. Therefore, our motivation is to explore techniques that alleviate the problem raised by noisy rewards.Following this idea, we first monitor the fidelity of the reward signals given by the reward model. If we can find some indicators that are correlated with the fidelity of the reward signal, we can filter the samples based on this indicator to increase the signal-to-noise ratio during the training and thus improves the performance. Luckily, we find that the reward value itself is a good indicator - samples with highest and lowest rewards are more relieable.Then, we conduct experiments using the found sample filtering strategy and find this technique works pretty well in a wide range of tasks, ranging from math reasoning and code generation to even knowledge and instruction following tasks. Our work is mostly based on empirical obseravtion and experiments, and we will go deeper to reveal theoretical insight behind our method in the future work.

Chat is not available.