Poster Tue, Jul 7, 2026 • 10:30 PM – 12:15 AM PDT HALL A #2617

Position: The Systemic Lack of Agency in Visual Reasoning

Yizhao Huang ⋅ Haoyang Chen ⋅ Pohsun Huang ⋅ Jiayuan Li ⋅ Shiqin Wang ⋅ Haoyuan Du ⋅ Yandong Shi ⋅ Zheng Wang ⋅ Zhixiang Wang

Abstract

This paper argues that a systemic lack of Agency constrains the implicit reasoning capabilities of current Vision-Language Models (VLMs). Implicit reasoning refers to the ability to autonomously discover and utilize hidden visual evidence to bridge information gaps, rather than merely relying on explicitly specified targets. This capacity underlies human visual understanding and everyday reasoning. We argue that this limitation arises from a tendency to equate visual reasoning with passive semantic retrieval, rather than with active, situated reasoning that depends on autonomous visual exploration. As a result, most existing benchmarks primarily assess Passive Capacity, leaving this aspect of reasoning largely unmeasured. To address this gap, we introduce the Visual Implicit Reasoning Benchmark (V-IRD), which targets this missing quadrant by requiring models to derive answers strictly through autonomous visual analysis. Our results show that, despite strong retrieval abilities, prominent VLMs struggle to utilize reference objects and to attend to visual evidence that requires self-directed inquiry. Simply put, strong semantic recognition does not equate to active visual exploration, revealing a critical gap in current VLMs.