Poster
Policy Filtration for RLHF to Mitigate Noise in Reward Models
Chuheng Zhang · Wei Shen · Li Zhao · Xuyun Zhang · Xiaolong Xu · Wanchun Dou · Jiang Bian
West Exhibition Hall B2-B3 #W-420
Recently, RLHF has become a key technique in LLM, and PPO is the dominate RL algroithm used for industrial-tier LLMs. Since we are optimizing the LLM to follow the value of humans which cannot be precisely described by a precise math formulaiton, researchers train a large reward model based on a few human annotations and use this model to further guide the learning of LLMs. This introduces an issue to the PPO algorithm which is originally proposed to optimized under the guidance of low-noise signals. Therefore, our motivation is to explore techniques that alleviate the problem raised by noisy rewards.Following this idea, we first monitor the fidelity of the reward signals given by the reward model. If we can find some indicators that are correlated with the fidelity of the reward signal, we can filter the samples based on this indicator to increase the signal-to-noise ratio during the training and thus improves the performance. Luckily, we find that the reward value itself is a good indicator - samples with highest and lowest rewards are more relieable.Then, we conduct experiments using the found sample filtering strategy and find this technique works pretty well in a wide range of tasks, ranging from math reasoning and code generation to even knowledge and instruction following tasks. Our work is mostly based on empirical obseravtion and experiments, and we will go deeper to reveal theoretical insight behind our method in the future work.
Live content is unavailable. Log in and register to view live content