TVI-CoT: Text-Visual Interleaved Chain-of-Thought Reasoning for Multimodal Understanding
Lianyu Hu ⋅ Xiaoyu Ma ⋅ Zeqin Liao ⋅ Yang Liu
Abstract
Chain-of-thought (CoT) reasoning has proven effective for enhancing problem-solving in large language models. However, when applied to multimodal LLMs (MLLMs), existing CoT approaches suffer from a fundamental limitation: \textit{they perform reasoning entirely in text without accessing visual features during the reasoning process}. After initial visual encoding, image information becomes inaccessible, forcing models to reason based solely on whatever was captured in the initial description, which forms a ``vision-blind reasoning'' paradigm that limits fine-grained visual extraction, error verification, and adaptive attention. We propose Text-Visual Interleaved Chain-of-Thought (TVI-CoT), a framework that enables explicit interleaving of textual reasoning and visual feature access through learnable control tokens ($\langle\text{Think}\rangle$, $\langle\text{Look}\rangle$, $\langle\text{Answer}\rangle$). These tokens allow dynamic switching between reasoning and visual grounding, attending to relevant image regions conditioned on the evolving reasoning state. Experiments on eight benchmarks demonstrate state-of-the-art results among MLLM-based CoT methods and notable performance boost compared to the baseline: +6.1\% on MMMU, +3.8\% on MathVerse, +3.4\% on MathVista, and +3.4\% on ScienceQA. Plentiful visualizations verify that TVI-CoT can perform stepwise reasoning with precise visual grounding abilities.
Successful Page Load