Bootstrapping Language and Numerical Feedback for Reinforcement Learning in LLMs
Abstract
Reinforcement learning for large language models (LLMs) often relies on scalar rewards, a practice that discards valuable textual rationale buried in the rollouts and hampers training efficiency. While keeping and re-using the textual information seems intuitive to improve sample efficiency, we show that naive attempts are often counterproductive, risking either memorization from leaked solutions or behavior collapse due to irrelevant context. To address this, we propose Language-And-Numerical Policy Optimization (LANPO), a framework that explicitly separates the roles of feedback: language guides exploration, while numerical rewards drive optimization. Central to LANPO is a dynamically evolving experience pool accumulated from on-policy rollouts, which facilitates two principles to ensure the effectiveness of feedback: Reward-Agnostic Reflection for safe intra-sample self-correction and Relevant Abstraction to distill generalizable lessons from inter-sample experiences. Across the evaluated mathematical reasoning and coding benchmarks, LANPO consistently outperforms strong GRPO-trained baselines for 7B and 14B models in zero-shot accuracy, and further improves performance with feedback-augmented inference at test time.