Does Reasoning Improve Seeing? Understanding When Vision-Language Models Benefit from Thinking
Abstract
Vision–language models (VLMs) now support both direct Instruct and explicit-reasoning Thinking modes, but practitioners lack principled ways to decide when reasoning helps or how much computation to allocate at test time. We investigate whether VLMs encode meta-cognitive signals for adaptive inference. We derive oracle labels for two properties: (1) reasoning helpfulness—whether explicit reasoning improves accuracy, and (2) optimal generation length—the minimal token budget for correctness. Probing final-layer representations in InternVL and Qwen3-VL models, we find Thinking models encode these signals more linearly than Instruct models, suggesting reasoning-oriented training enhances meta-cognitive structure. Head-wise attribution reveals two circuits: length-control heads in lower layers, and reasoning/difficulty heads in higher layers. Causal interventions confirm these roles: scaling length heads controls output length with little accuracy loss, while scaling reasoning heads enables a perception–reasoning trade-off, improving accuracy by up to 5.3\%. These effects generalize across benchmarks. Our results show reasoning-tuned VLMs develop localized, manipulable circuits for meta-cognitive control, enabling test-time steering of computation and reasoning without retraining.