What You Think is What You See: Driving Exploration in VLM Agents via Visual-Linguistic Curiosity
Abstract
To navigate partially observable visual environments, recent VLM agents increasingly internalize world modeling capabilities directly into their policies via explicit CoT reasoning with reinforcement learning (RL). However, mere passive exploitation of reasoning on visited states is insufficient for sparse-reward agentic tasks, as it lacks the epistemic drive to actively uncover the known unknown required for robust generalization. We ask: Can VLM agents actively find signals that challenge and update their internal world model through curiosity-driven exploration? In this work, we propose GLANCE, a unified framework that bridges reasoning and exploration by grounding the agent's linguistic world model into the stable visual representations of an evolving target network. Crucially, GLANCE leverages the discrepancy between linguistic prediction and visual reality as an intrinsic curiosity signal within reinforcement learning, steering the agent to actively explore areas where its internal model is uncertain. Extensive experiments across a series of agentic tasks show the effectiveness of GLANCE, and demonstrate that aligning what the agent thinks with what the agent sees is key to solving complex or sparse agentic tasks.