Reason with Thumbnails, Answer with Focus: An Efficient and Effective Paradigm for Multimodal Grounded Visual Reasoning
Abstract
To enhance the interpretability of multimodal large language models' outputs, recent efforts explored Grounded Visual Reasoning (GVR), in which the model is trained to select relevant image regions before answering the question. However, the multi-round ``ground-then-answer'' and reasoning nature of these methods imposes much more computational costs compared to non-GVR methods. To attain efficient and effective GVR, in this paper, we propose a novel paradigm called Reason with Thumbnails, Answer with Focus (RTAF), which feeds the model with low-resolution images to reason the relevant regions and high-resolution crops to answer the final answer. Our motivation arises from the observation that, in many cases, the key area required to answer questions can be inferred from the low-resolution thumbnails, without the need for a full-resolution image. Additionally, for extreme cases where thumbnails lack sufficient information (leading to undirected region guessing and increased computation), we equip the model with a tool to access higher-resolution images. For training efficiency, we adopt pure reinforcement learning (i.e., GRPO) and design a suite of reward functions to supervise the model's behavior, alongside a resolution-aware training data selection strategy. Finally, our model, based on Qwen2.5-VL, achieves significant improvements across a range of benchmarks with reduced computation, demonstrating the effectiveness and efficiency of our proposed RTAF, e.g., compared to the non-GVR model Qwen2.5-VL, our model achieves a performance gain of 5.8 while using comparable visual tokens (471 vs. 391). Against state-of-the-art GVR methods, RTAF reduces visual token usage by half while delivering superior performance.