From Perception to Planning: Evolving Ego-Centric Task-Oriented Spatiotemporal Reasoning via Curriculum Learning
Abstract
Modern vision-language models achieve strong performance in static perception, but remain limited in the complex spatiotemporal reasoning required for embodied, egocentric tasks. A major source of failure is their reliance on temporal priors learned from passive video data, which often leads to spatiotemporal hallucinations and poor generalization in dynamic environments. To address this, we present EgoTSR, a curriculum-based framework for learning task-oriented spatiotemporal reasoning. EgoTSR is built on the premise that embodied reasoning should evolve from explicit spatial understanding to internalized task-state assessment and finally to long-horizon planning. To support this paradigm, we construct EgoTSR-Data, a large-scale dataset comprising 46 million samples organized into three stages: Chain-of-Thought supervision for explicit reasoning, weakly supervised tagging for internalizing reasoning, and long-horizon sequences composed of orthogonal atomic subtasks. We further introduce a Reasoning-Enhanced Task Decomposition mechanism that explicitly models causal dependencies between actions and objects, enabling controllable planning from high-level task descriptions. To rigorously validate our approach, we establish a Dual-Level Evaluation Framework that covers both atomic spatial perception and logical planning. Extensive experiments demonstrate that EgoTSR effectively eliminates chronological biases, achieving 92.4% accuracy on long-horizon logical reasoning tasks while maintaining 88.2% fine-grained perceptual precision, significantly outperforming existing open-source and closed-source state-of-the-art models.