Beyond Majority Voting: Self-Reflective Test-Time Reinforcement Learning for LLM Reasoning
Sitong Wu ⋅ Haoru Tan ⋅ Xichen Zhang ⋅ Bin Xia ⋅ Shaofeng Zhang ⋅ XIAOJUAN QI ⋅ Bei Yu ⋅ Jiaya Jia
Abstract
The core challenge of Test-Time Reinforcement Learning (TTRL) lies in estimating rewards without access to ground-truth supervision. Existing TTRL methods predominantly rely on majority voting to generate pseudo-labels, under the assumption that the most frequent answer among sampled trajectories is correct. However, we observe that this assumption frequently breaks down in complex reasoning tasks, where correct solutions often constitute a logical minority. As a result, rare yet correct trajectories are systematically undervalued by majority-voting-based approaches. To address this limitation, we propose Self-Reflective Test-Time Reinforcement Learning (SR-TTRL), a novel framework that leverages self-reflective verification to produce high-fidelity pseudo-labels. Specifically, given multiple sampled trajectories for a problem, SR-TTRL first groups trajectories according to their final answers and selects one representative from each group to form a candidate pool. Each candidate trajectory is then summarized to preserve its core reasoning steps while reducing verbosity. Finally, the model performs self-reflection over the candidate pool, critically evaluating and selecting the most plausible trajectory as the pseudo-label. Empirically, SR-TTRL achieves substantially higher pseudo-label fidelity and sample efficiency than prior majority-voting-based TTRL methods. Extensive experiments across diverse benchmarks and model families demonstrate that SR-TTRL consistently outperforms majority-voting baselines and significantly improves generalization to novel problems. For example, SR-TTRL improves the Pass@1 accuracy of Qwen3-8B on AIME24 from $29.1$ to $55.8$ (a gain of $+26.7$), exceeding standard TTRL by an additional $+9.1$.
Successful Page Load