$V_0$: A Generalist Value Model for Any Policy at State Zero
Yi-Kai Zhang ⋅ Zhiyuan Yao ⋅ Hongyan Hao ⋅ Yueqing Sun ⋅ Qi GU ⋅ Hui Su ⋅ Xunliang Cai ⋅ De-Chuan Zhan ⋅ Han-Jia Ye
Abstract
Traditional value models $V^{\pi}$ in LLM reinforcement learning face a coupling dilemma: they require synchronous training alongside the updating policy $\pi$, causing inefficiency and overfitting. In this paper, we propose $V_0$, a generalist value model that decouples value estimation from specific policy parameters by reframing the task as in-context learning to predict performance for unseen policies. We utilize the policy's historical query-performance pairs as a capability representation, transforming from $V^{\pi}(s_0)$ to $V(C_{\pi}, s_0)$, where $C_{\pi}$ serves as an in-context input. This architecture enables us to scale the diversity of policies within the training set. Consequently, $V_0$ achieves scaling in learning to rapidly identify the capability boundaries of any policy without updating its parameters. Technically, we employ a Residual Query Adapter to compress the high-dimensional policy representation and the target query into structured features, which are then processed by a pre-trained TabPFN head. Empirical results show that $V_0$ outperforms coupled value models in tracking policy evolution during GRPO training, optimizes cold-start budget allocation, and approaches the performance-cost Pareto frontier in inference routing.
Successful Page Load