Reward Modeling from Natural Language Human Feedback
Abstract
Reinforcement Learning with Verifiable Reward (RLVR) on preference data has become the mainstream approach for training Generative Reward Models (GRMs). Typically, GRMs generate reasoning chains ending with critiques and preference labels, with RLVR using label correctness as the training reward. However, we demonstrate that such binary classification tasks make GRMs susceptible to guessing correct outcomes without sound critiques, introducing noise into the reward signal and impairing learning effectiveness. To address this, we propose Reward Modeling from Natural Language Human Feedback (RM-NLHF), which leverages natural language feedback to obtain process reward signals. Specifically, we compute the similarity between GRM-generated and human critiques as the process reward, providing more accurate signals than outcome-only supervision. Considering that human critiques are difficult to scale, we introduce MetaRM which learns to predict process reward from datasets with human critiques and generalizes to data without them. Experiments on multiple benchmarks demonstrate that RM-NLHF consistently outperforms state-of-the-art models trained with outcome reward, confirming the superiority of natural language over binary feedback.