DisPPO: Quantile-Based Distributional Reinforcement Learning for Large Language Models
Zhijian Zhou ⋅ Long Li ⋅ Xuan Zhang ⋅ Zongkai Liu ⋅ Yanting Miao ⋅ Yuchen Liu ⋅ Deshu Chen ⋅ Ke Li ⋅ Xing Sun ⋅ Ruoxi Jiang ⋅ Xiaoyu Tan ⋅ chao qu ⋅ Yuan Qi
Abstract
Reinforcement Learning (RL) has become a cornerstone for enhancing the reasoning capabilities of Large Language Models (LLMs). However, standard actor-critic methods, such as PPO, rely on scalar value functions that estimate only the expectation of cumulative returns. This reduction inherently discards higher-order statistical information (e.g., variance and multimodality), leading to inaccurate value estimation and suboptimal credit assignment in complex tasks. While Distributional RL offers a solution by modeling the full return distribution, its application to LLMs remains challenging due to the computational intractability of value-based operations over large vocabularies and the instability and memory burden of off-policy replay mechanisms. In this paper, we propose DisPPO, a novel on-policy framework that seamlessly integrates non-parametric quantile regression into PPO. Theoretically, we prove that our distributional update operator---composed of the $\lambda$-return Bellman operator and quantile projection---is a contraction mapping in the Wasserstein metric, guaranteeing convergence to a unique fixed point. Empirically, we evaluate DisPPO using Llama and Qwen models across diverse benchmarks, including mathematical reasoning and Text-to-SQL generation. DisPPO consistently outperforms standard PPO and recent group-based baselines in both Pass@1 and Pass@$k$ metrics, demonstrating that distributional critics provide a richer, more robust learning signal for large-scale reasoning models.
Successful Page Load