Knapsack RL: Unlocking Exploration of LLMs via Optimizing Budget Allocation
Ziniu Li ⋅ Congliang Chen ⋅ Tianyun Yang ⋅ Tian Ding ⋅ Ruoyu Sun ⋅ Ge Zhang ⋅ Wenhao Huang ⋅ Zhiquan Luo
Abstract
Large Language Models (LLMs) can improve via reinforcement learning by generating trajectories to discover better solutions. This exploration process represents an investment of finite GPU compute to obtain learning signals. However, current methods typically allocate a small, uniform budget to every task, which is inefficient and ineffective: easy tasks consistently succeed while difficult tasks consistently fail. For policy optimization algorithms such as Group Relative Policy Optimization (GRPO), both edge cases produce zero gradients, resulting in wasted computation. We address this by reframing exploration budget allocation as a resource optimization problem. Viewing each task's exploration as an `"item'' with a distinct "learning value'' and "computational cost'', we establish a connection to the classical knapsack problem and derive an optimal assignment rule. When applied to GRPO, our method increases the ratio of effective gradients by 40\%. As a computational "free lunch'', it enables substantially larger budgets (e.g., 93) for challenging tasks—allocations that would be expensive under a uniform allocation framework. These efficiency gains translate to meaningful improvements on mathematical reasoning benchmarks, with average gains of 2--4 points and peak gains of 9 points. Notably, achieving comparable performance with traditional homogeneous allocation would require approximately $2\times$ the computational resources.
Successful Page Load