iVGR: Internalizing Visually Grounded Reasoning for MLLMs with Reinforcement Learning
Abstract
While visually grounded Chain-of-Thought (CoT) has emerged as a promising paradigm to enhance fine-grained perception in multimodal large language models (MLLMs), its efficacy during the inference phase remains under-scrutinized. In this work, we empirically find that mandating the explicit object boxes in visually grounded CoT during inference often degrades performance compared to standard textual CoT---which reasons without explicit visual grounding. We hypothesize that the visual localization capability can be internalized into the textual CoT and that the mandatory explicit grounding imposes unnecessary task interference, which detracts from the model's primary focus on answer prediction. To address this problem, we propose Internalizing Visually Grounded Reasoning (iVGR), a novel reinforcement learning framework that transfers localization capabilities into the textual reasoning process. We employ a dual-stream training strategy, where a textual stream is aligned with a high-quality (visually) grounded stream via a proposed consistency reward, enabling the model to localize accurately without explicit grounding during inference. Extensive experiments on Qwen2.5-VL and Qwen3-VL demonstrate that our method significantly outperforms existing baselines on fine-grained benchmarks, while maintaining the flexibility to support tool-assisted inference workflows.