The Art of Interrogation: Consistency Amplifies Factuality in Spatial Reasoning
Abstract
Current Large Reasoning Models (LRMs) exhibit remarkable general capabilities but significantly underperform in spatial reasoning tasks. Existing approaches treat this gap as a knowledge deficit, relying on supervised fine-tuning (SFT) to ingest labeled data from external vision sources or synthetic engines. In contrast, we argue that for many tasks, spatial capabilities are already present in pre-trained LRMs but require alignment through principles of internal logical coherence. In this work, we propose a self-supervised reinforcement learning (RL) framework that targets the internal Chain-of-Thought (CoT) process without requiring ground-truth annotations. By formalizing the notion of consistency verifiers—reward functions that check for geometric and semantic consistency under transformations like flipping or swapping the order of objects in the question—and optimizing them via our new OT-GRPO strategy, a minimal-consistency matching variant of group relative policy optimization, we demonstrate that models can self-correct their spatial logic. Our results show that this label-free consistency training approaches the accuracy of models trained with ground-truth supervision and achieves similar generalization across diverse tasks and domains.