One Bias After Another: Mechanistic Reward Shaping and Persistent Biases in Language Reward Models
Daniel Fein ⋅ Max Lamparth ⋅ Violet Xiang ⋅ Mykel Kochenderfer ⋅ Nick Haber
Abstract
Reward Models (RMs) are crucial for online alignment of language models (LMs) with human preferences. However, RM-based preference-tuning is vulnerable to reward hacking, whereby LM policies learn undesirable behaviors from flawed RMs. By systematically measuring biases in five high-quality RMs, including the state-of-the-art, we find that issues persist despite prior work with respect to length, sycophancy, and overconfidence. We also discover new issues related to bias toward model-specific “styles” and answer-order. We categorize RM failures by complexity and propose a simple post-hoc intervention to mitigate low-complexity biases that arise from spurious correlations. Our proposed $\textbf{mechanistic reward shaping}$ reduces targeted biases without degrading reward quality and while using minimal labeled data. The method is extensible to new biases, model-internal, and generalizes out-of-distribution.
Successful Page Load