Oral Tue, Jul 7, 2026 • 6:00 PM – 6:15 PM PDT HALL B2

Mitigating Reward Hacking in RLHF via Bayesian Non-negative Reward Modeling

Zhibin Duan ⋅ Guowei Rong ⋅ Zhuo Li ⋅ Bo Chen ⋅ Mingyuan Zhou ⋅ Dandan Guo

Abstract

Reward models learned from human preferences are central to aligning large language models (LLMs) via reinforcement learning from human feedback, yet they are often vulnerable to reward hacking due to noisy annotations and systematic biases such as response length or style. We propose Bayesian Non-Negative Reward Model (BNRM), a principled reward modeling framework that integrates non-negative factor analysis into Bradley–Terry (BT) preference model.BNRM represents rewards through a sparse, non-negative latent factor generative process that operates at two complementary levels: instance-specific latent variables induce disentangled reward representations, while sparsity over global latent factors acts as an implicit debiasing mechanism that suppresses spurious correlations. Together, this disentanglement-then-debiasing structure enables robust uncertainty-aware reward learning. To scale BNRM to modern LLMs, we develop an amortized variational inference network conditioned on deep model representations, allowing efficient end-to-end training. Extensive empirical results demonstrate that BNRM substantially mitigates reward over-optimization, improves robustness under distribution shifts, and yields more interpretable reward decompositions than strong baselines.

Lay Summary

Reward models trained from human preferences play a key role in aligning large language models (LLMs), but they can be misled by noisy labels and superficial patterns such as response length or writing style. In this work, we ask how to build reward models that are more robust, more interpretable, and less likely to be “hacked” by these spurious cues. We introduce Bayesian Non-Negative Reward Model (BNRM), a new framework that combines preference learning with sparse non-negative latent factor modeling. The model first separates the reward signal into instance-specific latent components, helping it represent different sources of preference in a disentangled way. It then uses sparsity over global latent factors to automatically suppress misleading correlations and reduce bias. This two-stage design, disentangling first and debiasing second, also provides uncertainty-aware reward learning. To make the method practical for modern LLMs, we develop an efficient amortized inference network that can be trained end-to-end. Experiments show that BNRM is more resistant to reward over-optimization, works better under distribution shifts, and produces more interpretable reward structures than strong baseline methods.