Are VLMs Seeing or Just Saying? Uncovering the Illusion of Visual Re-examination
Chufan Shi ⋅ Cheng Yang ⋅ Yaokang Wu ⋅ Linghao Jin ⋅ Bo Shui ⋅ Taylor Berg-Kirkpatrick ⋅ Xuezhe Ma
Abstract
Vision-Language Models (VLMs) frequently generate self-reflective statements during reasoning, such as ``let me check the figure again.'' Do such statements trigger genuine visual re-examination, or merely represent learned textual patterns? We investigate this question through VisualSwap, an image-swap probing framework: after a model generates reasoning for an image, we replace it with a visually similar but semantically different image and test whether the model detects the change. We introduce VS-Bench, a benchmark of $800$ image pairs curated from MathVista, MathVerse, MathVision, and MMMU-Pro. Experiments across Qwen3-VL, Kimi-VL, and ERNIE-VL families reveal a striking failure: models overwhelmingly fail to detect image changes, with accuracy dropping by up to 60\%. Counterintuitively, thinking models exhibit nearly 3$\times$ greater vulnerability than their instructed counterparts, and scaling provides no mitigation. However, multi-turn interaction with user instructions can restore visual grounding, while self-generated reflective statements during continuous generation cannot. Attention analysis reveals the underlying mechanism: self-reflection does not increase attention to visual tokens, whereas user instructions substantially elevate it. Our findings reveal that current VLMs tend to say rather than actually see when claiming visual re-examination.
Successful Page Load