SpaceVista: All-Scale Visual Spatial Reasoning from mm to km
Abstract
With the current surge in spatial reasoning, researchers have made significant progress in understanding indoor scenes, but still struggle with more diverse applications. This paper aims to advance all-scale spatial reasoning by tackling two key challenges: 1) the heavy reliance on indoor 3D scans and labor-intensive annotations for dataset curation; 2) the absence of all-scale modeling, which often leads to overfitting to single scenes. In this paper, we introduce a holistic solution that integrates a structured spatial reasoning knowledge system, scale-aware modeling, and a progressive training paradigm, as the first attempt to broaden the scope of all-scale spatial intelligence. Using a task-specific, specialist-driven automated pipeline, we curate over 38K video scenes across 5 spatial scales to create SpaceVista-1M, a dataset comprising 1M spatial QAs spanning 19 diverse tasks. While specialist models offer valuable domain knowledge, they are often unreliable evaluators. Therefore, we build an all-scale benchmark with precise annotations by manually recording and retrieving videos. Nevertheless, naive training with SpaceVista-1M often yields suboptimal results due to the potential knowledge conflict. Accordingly, we introduce SpaceVista-7B, a spatial reasoning model that accepts inputs beyond semantics and uses scale as an anchor for scale-aware experts and progressive rewards. Finally, extensive evaluations across 5 benchmarks, including our SpaceVista-Bench, demonstrate competitive performance, showcasing generalization across all scales and scenarios. All materials will be released at https://mm2km.github.io/.