Poster

Position: Reasoning After Perception Means Reasoning Without Vision

Hongcheng Gao ⋅ Zihao Huang ⋅ Jingyi Tang ⋅ Lin Xu ⋅ Xinhao Li ⋅ Haoyang Li ⋅ Yue Liu ⋅ Minhua Lin ⋅ Xinlong Yang ⋅ Taihang Hu ⋅ Ge Wu ⋅ Baolong Bi ⋅ Hongyu Chen ⋅ Zhiqi Huang ⋅ Wentao Zhang

Project Page

Abstract

A common belief in multimodal research is that the perceptual weaknesses of vision--language models can be compensated by stronger language reasoning (e.g., chain-of-thought, in-context learning, or external tools). We challenge this assumption. We argue that for a broad class of visual tasks hard to specify in language, failures stem from a structural fatality where the temporal decision of \textit{when} to reason strictly dictates the spatial constraint of where reasoning takes place. When visual reasoning is deferred to language generation, current architectures do not merely delay computation; they displace it from the continuous visual representation to a discrete textual space. Consequently, the sequential "Perception-then-Reasoning" paradigm degenerates perception into a passive, one-off feature encoding process, rendering it functionally equivalent to "Reasoning-in-Text-Space", where task-critical spatial signals are collapsed before reasoning begins. We substantiate this claim with the Turing Eye Test (TET): tasks that must be resolved in visual space and are hard to verbalize; results show text-only reasoning cannot remedy these perceptual failures. Our findings suggest rethinking the architectural divide: shifting from reasoning \textit{about} perception to reasoning within perception. This facilitates actively reasoning-driven perception that operates directly on pixel-level visual representations, rather than within a collapsed textual space.