Video-MTR: Reinforced Multi-Turn Reasoning for Long Video Understanding
Yuan Xie ⋅ Tianshui Chen ⋅ Zheng Ge ⋅ Lionel Ni
Abstract
Long-form video understanding remains a formidable challenge due to the complexity of modeling long-range temporal dependencies and multi-event narratives. Existing methods often rely on static reasoning or external Visual-Language Models (VLMs), resulting in high computational complexity and sub-optimal performance. In this paper, we propose Video-MTR, a reinforced multi-turn reasoning framework that operates solely through data-efficient, pure RL post-training. Video-MTR reformulates video understanding as a dynamic decision-making process, where the agent iteratively selects key segments conditioned on the evolving context of previously processed frames and the query. To ensure effective intermediate reasoning and training stability, we introduce a novel gated bi-level reward system, which synergizes trajectory-level rewards (answer correctness) with turn-level rewards (frame-query relevance). This mechanism eliminates the need for data-intensive supervised fine-tuning, thereby substantially reducing reliance on large-scale datasets. Remarkably, Video-MTR achieves competitive or superior performance using only $\sim$8K training samples, compared to existing approaches that require 257K to 4.4M examples. Extensive experiments on benchmarks including VideoMME, MLVU, LongVideoBench, LVBench, and EgoSchema demonstrate that Video-MTR surpasses state-of-the-art methods in both accuracy and efficiency.
Successful Page Load