ArenaRL: Scaling RL for Open-Ended Agents via Tournament-based Relative Ranking
Qiang Zhang ⋅ Boli Chen ⋅ Fanrui Zhang ⋅ Ruixue Ding ⋅ Shihang Wang ⋅ Qiuchen Wang ⋅ Yinfeng Huang ⋅ Haonan Zhang ⋅ Rongxiang Zhu ⋅ Xin Li ⋅ Houquan zhou ⋅ Pengjun Xie ⋅ Kaipeng Zhang ⋅ Jingren Zhou ⋅ Jiawei Liu
Abstract
Reinforcement learning (RL) has advanced LLM agents on verifiable tasks but remains challenging for open-ended tasks with vast solution spaces (e.g., complex travel planning). Lacking objective ground truth, current RL algorithms rely on reward models assigning scalar scores to individual responses. We contend such pointwise scoring induces discrimination collapse: reward model fails to distinguish subtle advantages among trajectories, compressing intra-group rewards into a narrow range. This drowns effective reward signals in reward model noise, causing optimization stagnation. To address this, we propose ArenaRL, a reinforcement learning paradigm shifting from pointwise scalar scoring to intra-group relative ranking. ArenaRL introduces a process-aware pairwise evaluation with multi-level rubrics for fine-grained relative scoring. Meanwhile, we construct an intra-group adversarial arena and devise a tournament-based ranking scheme to obtain stable advantage signals. ArenaRL achieves high-precision advantage estimation with only $O(N)$ computational complexity, striking a favourable balance between efficiency and accuracy. Furthermore, to address the lack of full-cycle benchmarks for open-ended agents, we introduce two high-quality benchmarks: Open-Travel and Open-DeepResearch, encompassing full training and multi-dimensional evaluation pipelines. Extensive experiments across three open-ended tasks validate the effectiveness of ArenaRL.
Successful Page Load