Mitigating Reward Hacking in LLM-based Recommendation: A Preference Optimization Approach
Heyu Chen ⋅ Junkang Wu ⋅ Guoqing Hu ⋅ Kexin Huang ⋅ Xiang Wang ⋅ Jiancan Wu
Abstract
Post-training adaptation has become the central paradigm for leveraging large language models (LLMs) in recommendation. While recent preference optimization methods, such as Direct Preference Optimization (DPO), enhance pairwise preference discrimination, they remain vulnerable to \emph{reward hacking}: models exploit imperfections in reward signals, leading to inflated training metrics without genuine recommendation gains. We analyze this issue from a gradient perspective and formalize the concept of the \emph{$\varepsilon$-insensitive region} region, where pairwise updates exert little influence on the ordering between positives and unsampled negatives. Under the Bradley–Terry model, we further show that these regions can occupy a substantial fraction of the preference space, inevitably leading to misaligned rankings. To address this issue, we propose Simulated Preference Optimization for Reward-hacking mitigation using Pseudo-negatives (SIRIUS). Our framework introduces pseudo-negative samples to enrich contrastive signals and reduce the prevalence of $\varepsilon$-insensitive regions. Extensive experiments on three public benchmarks show that \our{} consistently improves ranking quality and effectively mitigates reward hacking, providing both theoretical and practical insights for advancing LLM-based recommendation. Our code is available at \url{https://anonymous.4open.science/r/C557-id}
Successful Page Load