Unleashing Implicit Rewards: Prefix-Value Learning for Distribution-Level Optimization
Abstract
Process reward models (PRMs) provide fine-grained reward signals along the reasoning process, improving credit assignment beyond outcome-only rewards. Training reliable PRMs often relies on step annotations or heavy verification pipelines, making them expensive to scale and refresh during online RL. Implicit PRMs mitigate this cost by learning decomposable token/step rewards from trajectory-level outcome labels, but they suffer a train-inference mismatch: training only constrains a sequence-level aggregate, while deployment queries token-level scores as local step quality. As a result, token credits are weakly identified and can become predictive of final success without faithfully reflecting which step is correct. This unreliability can even undermine a key promise of implicit PRMs—scoring many candidate tokens—because noisy per-token advantages may systematically reinforce incorrect continuations. We address this with a novel Implicit Prefix-Value Reward Model (IPVRM), which directly learns a prefix-conditioned value function estimating the probability of eventual correctness, and derives step signals via temporal-difference (TD) differences. IPVRM substantially improves step-verification F1 on ProcessBench. Building on these calibrated signals, we further propose Distribution-Level RL (DistRL), which computes TD advantages not only for sampled tokens but also for high-probability candidate tokens across the vocabulary, enabling dense counterfactual updates without additional rollouts. DistRL brings limited benefits with mis-calibrated implicit rewards, but consistently improves downstream reasoning once powered by IPVRM’s reliable prefix values.