VideoLoom: A Video Large Language Model for Joint Spatial-Temporal Understanding
Jiapeng Shi ⋅ junke Wang ⋅ Zuyao You ⋅ Bo He ⋅ Zuxuan Wu
Abstract
Recent advancements in Video Large Language Models (Video LLMs) have demonstrated impressive results, yet existing approaches handle either temporal or spatial dimension in isolation, struggling in the analysis of complex events that require spatial-temporal integration. To bridge this gap, we propose VideoLoom, a unified Video LLM for joint spatial-temporal understanding. To facilitate the development of fine-grained spatial and temporal localization capabilities, we curate LoomData-8.7k, a character-centric video dataset with temporally grounded and spatially localized captions. With this, VideoLoom achieves the state-of-the-art performance across a variety of spatial and temporal benchmarks. In addition, we introduce LoomBench, a benchmark consisting of temporal, spatial, and compositional video–question pairs, with a novel metric $J$&$F_{bi-fore}$, enabling a comprehensive evaluation of Video LLMs from diverse aspects. Collectively, these contributions offer a universal and effective suite for joint spatial-temporal video understanding, setting a new standard in multimodal intelligence.
Successful Page Load