Where Signals Are Sparse, We Synthesize: Reinforcing Self-Corrective Reasoning in Vision–Language Models via Rollout Augmentation
Yi Ding ⋅ Ziliang Qiu ⋅ Bolian Li ⋅ Ruqi Zhang
Abstract
Self-correction is essential for solving complex reasoning problems in vision–language models (VLMs), yet existing reinforcement learning (RL) methods struggle to learn it. Effective self-correction behaviors emerge only rarely during RL, making learning signals sparse. To address this challenge, we propose c**o**rre**ct**i**o**n-s**p**ecific rollo**u**t**s**} (**Octopus**), a rollout-augmentation framework that synthesizes dense self-correction supervision by recombining existing rollouts without computational overhead. This rollout augmentation simultaneously improves sample efficiency and stabilizes RL optimization. Furthermore, we introduce a two-stage RL training strategy that disentangles self-correction and direct reasoning, avoiding signal conflicts and enabling both behaviors to be learned effectively. Building on this, we introduce $\texttt{Octopus-8B}$, an advanced reasoning VLM with controllable self-correction capabilities. It achieves SoTA performance among open-source VLMs across 7 benchmarks, outperforming the best RLVR baseline by 1.0 score while requiring only $0.72\times$ training time per step.
Successful Page Load