Poster Thu, Jul 9, 2026 • 2:30 PM – 4:15 PM KST Coex: HALL A

UniRRM: Unified Reasoning Reward Models Across Languages and Evaluation Paradigms

Peng Lai ⋅ Yichao Du ⋅ Junchao Wu ⋅ Weibo Gao ⋅ Linan Yue ⋅ Longyue Wang ⋅ Weihua Luo ⋅ Derek F. Wong ⋅ Guanhua CHEN

Abstract

Reinforcement learning (RL) excels on tasks with verifiable rewards, but in open-ended tasks, the reliability of reward models remains a key challenge. Existing solutions either depend on costly proprietary LLM-as-a-Judge systems or opaque scalar reward models that lack interpretability. Recent works on generative reward models offer a promising alternative, but they remain constrained by static evaluation criteria, fragmented evaluation paradigms, and limited multilingual support. To address these challenges, we introduce MixReward, a large-scale multilingual dataset spanning six domains and 103 languages, containing both pairwise and listwise data, and propose UniRRM, a unified reasoning reward model supporting multiple languages and evaluation paradigms. UniRRM uses a staged reasoning chain to dynamically generate task-generic and instruction-specific criteria, enabling fine-grained, input-adaptive judgments while maintaining consistency across languages. Experiments demonstrate that UniRRM-8B and UniRRM-14B achieve performance close to the state-of-the-art for models of comparable size across multiple benchmarks, and are effective for unseen evaluation paradigms. In addition, ablation studies validate the reliability and effectiveness of UniRRM.