Diversity Over Frequency: Rethinking Tool Use in Visual Chain-of-Thought Agents
Abstract
Visual agents employ tools such as zoom-in cropping within visual chains of thought to access fine-grained details. Prior work has primarily demonstrated the effectiveness of these tools on visual search tasks, leaving their applicability to more diverse and complex visual problems underexplored. In this paper, we move beyond visual search and study challenging visual tasks that require advanced spatial understanding and reasoning, such as 3D spatial reasoning, where agents must not only crop or zoom in on relevant regions but also understand how these local details relate to the global context. We identify a tool-use collapse phenomenon: models progressively stop using tools while still achieving higher task accuracy. Moreover, we observe a clear asymmetry: (i) completely eliminating tool use degrades performance, whereas (ii) incentivizing tool use yields only marginal gains despite substantially increasing usage. We find that vanilla training and rewarding tool-use encouragement reduce rollout diversity during training, explaining why higher tool-use does not yield stronger reasoning performance. Motivated by these findings, we encourage diverse rollout exploration by adding an entropy-regularization term to the reinforcement learning objective, which results in the best performance despite tool usage gradually declining during training. Overall, our findings suggest a training-time view of tools as scaffolding, where broader exploration in both text and vision shapes representations that improve despite tool-use collapse.