GRASP: Awakening Latent Spatial Reasoning in LVLMs via Training-free Geometric Rectification
Abstract
Large Vision-Language Models (LVLMs) exhibit remarkable general capabilities but struggle significantly with spatial reasoning tasks. In this paper, we uncover a critical representation-output misalignment via linear probing: LVLMs correctly encode spatial features internally, but generate incorrect results in the final text. To address this, we pioneer the Inference-time Geometric Manifold Adaptation paradigm and propose GRASP (Geometric Rectification for Active Spatial Perception), a training-free framework to awaken these latent capabilities. GRASP employs Manifold Differential Search to identify optimal geometric counterfactuals, which then drive a dual-level rectification mechanism: Implicit Trajectory Correction to rectify attenuated intrinsic geometric features in intermediate decoder layers, and Explicit Distribution Alignment to break the dominance of language priors at the output layer. Extensive experiments spanning diverse architectures (LLaVA, Qwen 2.5/3-VL) and positional encoding paradigms (1D APE, 2D/3D RoPE) across image and video benchmarks (WhatsUp, VSR, VSI-Bench) demonstrate that GRASP significantly mitigates spatial hallucinations without parameter updates, achieving accuracy gains of up to 26.1% on image benchmarks and 9.7% on video reasoning tasks, consistently outperforming baseline methods.