Learning Useful Supervision for Reinforcement Learning in Reasoning Models
Abstract
Supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR) are two widely used post-training paradigms for improving the reasoning ability of large language models (LLMs). Recent methods attempt to integrate SFT and RLVR in a single stage by reweighting or scheduling their objectives. However, such coupling can be counterproductive because supervised updates are not uniformly beneficial for reward optimization, which can diminish reward gains. To address this, we propose \textsc{BRIDGE}, a scalable framework in which SFT learns to supervise RL by selectively transferring knowledge that improves reward optimization. Specifically, \textsc{BRIDGE} employs two nested optimization loops during meta-training: the inner loop updates base model parameters using a fused SFT--RL gradient. Concurrently, the outer loop updates a lightweight low-rank adapter (LoRA) to coordinate the two objectives by maximizing a reward-gap signal, defined as the reward of joint SFT--RL training over an RL-only baseline. Across three model scales and five reasoning benchmarks, \textsc{BRIDGE} consistently outperforms two-stage cold start, naive mixing, and representative single-stage integration baselines, yielding over three points average absolute improvement and more stable training dynamics.