Poster Wed, Jul 8, 2026 • 1:00 AM – 2:45 AM PDT HALL A #1901

Small Generalizable Prompt Predictive Models Can Steer Efficient RL Post-Training of Large Reasoning Models

Yun Qu ⋅ Qi Wang ⋅ Yixiu Mao ⋅ Heming Zou ⋅ Yuhang Jiang ⋅ Weijie Liu ⋅ Clive Bai ⋅ Kai Yang ⋅ Yangkun Chen ⋅ Saiyong Yang ⋅ Xiangyang Ji

Abstract

Reinforcement learning enhances the reasoning capabilities of large language models but often involves high computational costs due to rollout-intensive optimization. Online prompt selection presents a plausible solution by prioritizing informative prompts to improve training efficiency. However, current methods either depend on costly, exact evaluations or construct prompt-specific predictive models lacking generalization across prompts. This study introduces Generalizable Predictive Prompt Selection (GPS), which performs Bayesian inference towards prompt difficulty using a lightweight generative model trained on the shared optimization history. Intermediate-difficulty prioritization and history-anchored diversity are incorporated into the batch acquisition principle to select informative prompt batches. The small predictive model also generalizes at test-time for efficient computational allocation. Experiments across varied reasoning benchmarks indicate GPS's substantial improvements in training efficiency, final performance, and test-time efficiency over superior baseline methods. The code is available at https://github.com/thu-rllab/GPS.

Lay Summary

Training large language models for reasoning with reinforcement learning often requires substantial computational resources because models must repeatedly generate and evaluate responses for many prompts. However, not all prompts contribute equally to learning. Some are too easy to provide useful improvement signals, while others may be too difficult for effective learning. This work introduces Generalizable Predictive Prompt Selection (GPS), a lightweight method that predicts which prompts are likely to be the most informative for training. Instead of exhaustively evaluating every prompt, GPS learns from previous shared optimization history to estimate prompt difficulty and select diverse, useful training prompts. The method can also be used during inference to allocate computation more efficiently. Experiments on multiple reasoning benchmarks show that GPS improves training efficiency, final model performance, and test-time efficiency compared with existing prompt selection approaches.