Position: Your VLM May Not Be Thinking with Interleaved Images
Abstract
"Thinking with images" has emerged as a central research theme in the realm of Vision-Language Models (VLMs). This multimodal reasoning paradigm typically features interleaved images generated via tool use or code execution as part of the Chain-of-Thought (CoT). While reinforcement learning (RL) has driven impressive performance within this paradigm, in this position paper, we argue that current VLMs seldom truly "think" with interleaved images. Through empirical evidence and analysis, we demonstrate that interleaved images do not play a significant role in the success of recent "Thinking with images" methods. Instead, the primary source of performance gains is the improved language generation distribution resulting from fine-tuning. These findings challenge the prevailing belief that "Thinking with images" VLMs actively utilize visual information to complete visual tasks. To improve mechanistic transparency, we suggest that future "Thinking with images" works include lightweight ablation studies to verify the necessity of interleaved images. Furthermore, we call upon the community to develop fundamentally novel benchmarks and advocate for more informative visual tools.