Unbiased Principles, Robust Rewards
Qingnan Ren ⋅ Zhen Fang ⋅ Shiting Huang ⋅ Yu Zeng ⋅ Lin Chen ⋅ Zehui Chen ⋅ Feng Zhao
Abstract
Reward models are central to Reinforcement Learning from Human Feedback (RLHF), especially for open-ended tasks where evaluation is inherently multi-dimensional. Recent Generative Reward Models (GRMs) improve interpretability by producing natural-language rationales and task-specific evaluation principles. However, most existing GRMs generate principles after reading the actor's response, i.e., $Q+R \rightarrow P$. We show that this coupling induces Principle Drift: when the actor performs reward hacking (e.g., verbosity, self-aggrandizement, or hallucinated self-justifications), the reward model may shift its criteria to rationalize the response, yielding inflated scores that in turn reinforce hacking during RL. We propose IP-GRM (Independent Principle GRM), a two-stage framework that first generates principles solely from the question ($Q \rightarrow P$) and then evaluates the response conditioned on $(Q, R, P)$. This decoupling keeps criteria invariant to response content, producing more objective and stable reward signals. For efficient training, we further introduce a Principle Cache strategy that reuses principles within a group, improving GRPO throughput by 23.66\% while maintaining strict intra-group consistency. In GRPO training on creative writing, IP-GRM suppresses reward hacking and improves WritingBench and CreativeWriting-v3 by up to +4.6 and +7.1 points based on Qwen3-8B, achieving state-of-the-art performance among open-source models.
Successful Page Load