Poster Tue, Jul 7, 2026 • 6:30 PM – 8:15 PM PDT HALL A #2205

Depth-Breadth Synergy in RLVR: Unlocking LLM Reasoning Gains with Adaptive Exploration

Zhicheng Yang ⋅ Zhijiang Guo ⋅ Yinya Huang ⋅ Yongxin Wang ⋅ Dongchun Xie ⋅ Hanhui Li ⋅ Yiwei Wang ⋅ Xiaodan Liang ⋅ Jing Tang

Project Page

Abstract

Reinforcement Learning with Verifiable Reward (RLVR) is a powerful method for enhancing the reasoning abilities of Large Language Models, but its full potential is limited by a lack of exploration in two key areas: \textbf{Depth} (the difficulty of problems) and \textbf{Breadth} (the number of training instances). Our analysis of the popular GRPO algorithm reveals a bias that down-weights difficult, low-accuracy problems, which are crucial for improving reasoning skills. To address this, we introduce Difficulty Adaptive Rollout Sampling (DARS), a method that re-weights difficult problems by using targeted, multi-stage rollouts. This approach increases the number of rollout outcomes for these harder problems according to our proposed re-balancing schedules and leads to consistent gains in \textit{Pass@K}. We also found that simply enlarging the rollout size isn't effective and can even harm performance. We also investigated the role of breadth by scaling the batch size and using full-batch updates. This significantly improved \textit{Pass@1} performance by maintaining high token-level entropy, which indicates continued exploration and reduced gradient noise. Finally, we present DARS-Breadth, a combined approach that uses DARS with a large breadth of training data. This method demonstrates simultaneous gains in both \textit{Pass@K} and \textit{Pass@1}, confirming that depth (adaptive exploration) and breadth (scaling the training data) are orthogonal and essential dimensions for unlocking the full reasoning power of RLVR.