ReaForest: Fostering Generative Video Reasoning for Spatial Planning
Abstract
Verbal logic and visual mental simulation are two essential components of human intelligence. Modern Large Language Models (LLMs) have demonstrated strong verbal reasoning capabilities through textual Chain-of-Thought (CoT) reasoning. In contrast, current Video Generation Models (VGMs) struggle with visual reasoning tasks such as spatial planning. We attribute this limitation to two fundamental gaps: (i) VGMs are predominantly trained on general-purpose video corpora emphasizing perceptual fidelity over visual reasoning, leaving reasoning abilities underdeveloped; (ii) most VGMs generate videos in a single pass without mechanisms to explore alternative reasoning trajectories and to revise intermediate errors. Motivated by these limitations, we introduce ReaForest, a framework that fosters the reasoning capacity of VGMs in spatial planning through both training-time activation and inference-time scaling. ReaForest comprises three key components: (1) ReaGen-27k, a dataset covering diverse spatial planning tasks that require multi-step reasoning, which activates basic reasoning capabilities of VGMs for spatial planning; (2) Reflective Entropy-Aware Test-Time Scaling (ReaTTS), an inference framework that evolves multiple reasoning branches while enabling failure recovery; (3) Hierarchical constraint verification, which provides actionable feedback for ReaTTS based on decomposed constraints. Extensive experiments demonstrate that ReaForest substantially surpasses advanced textual reasoning models (e.g., Gemini-2.5-Pro) and video generation models (e.g., Sora-2). ReaForest exhibits emergent properties including self-correction, parallel thinking, and scalable reasoning, advancing VGMs toward human-like visual mental simulation.