Poster Tue, Jul 7, 2026 • 2:00 PM – 3:45 PM KST Coex: HALL A

ReaForest: Fostering Generative Video Reasoning for Spatial Planning

Kun Ouyang ⋅ Yuanxin Liu ⋅ Xinhao Li ⋅ Linli Yao ⋅ Xiangyu Zeng ⋅ Haoning Wu ⋅ Hao Zhou ⋅ Fandong Meng ⋅ Jie Zhou ⋅ Xu SUN

Abstract

Verbal logic and visual mental simulation are two essential components of human intelligence. Modern Large Language Models (LLMs) have demonstrated strong verbal reasoning capabilities through textual Chain-of-Thought (CoT) reasoning. In contrast, current Video Generation Models (VGMs) struggle with visual reasoning tasks such as spatial planning. We attribute this limitation to two fundamental gaps: (i) VGMs are predominantly trained on general-purpose video corpora emphasizing perceptual fidelity over visual reasoning, leaving reasoning abilities underdeveloped; (ii) most VGMs generate videos in a single pass without mechanisms to explore alternative reasoning trajectories and to revise intermediate errors. Motivated by these limitations, we introduce ReaForest, a framework that fosters the reasoning capacity of VGMs in spatial planning through both training-time activation and inference-time scaling. ReaForest comprises three key components: (1) ReaGen-27k, a dataset covering diverse spatial planning tasks that require multi-step reasoning, which activates basic reasoning capabilities of VGMs for spatial planning; (2) Reflective Entropy-Aware Test-Time Scaling (ReaTTS), an inference framework that evolves multiple reasoning branches while enabling failure recovery; (3) Hierarchical constraint verification, which provides actionable feedback for ReaTTS based on decomposed constraints. Extensive experiments demonstrate that ReaForest substantially surpasses advanced textual reasoning models (e.g., Gemini-2.5-Pro) and video generation models (e.g., Sora-2). ReaForest exhibits emergent properties including self-correction, parallel thinking, and scalable reasoning, advancing VGMs toward human-like visual mental simulation.