VideoTrace-R1: Long Video-based Retrieval-Augmented Generation via Temporal Path Graph Understanding
Abstract
Long-video temporal reasoning remains challenging for Large Video Language Models (LVLMs). Recent reasoning-enhanced models apply reinforcement learning with outcome supervision to improve temporal understanding. However, outcome-only rewards cannot distinguish whether a model arrived at the correct answer through valid temporal reasoning or fabricated claims, a fundamental limitation that undermines trustworthiness. We observe a key structural correspondence: in videos, events form \emph{temporal traces}, ordered sequences of how entities interact over time; in model reasoning, \emph{reasoning traces} capture step-by-step temporal claims. Correct temporal reasoning requires the latter to mirror the former. This correspondence enables us to \emph{verify} reasoning traces against the video's temporal structure. We introduce \textbf{Temporal Reasoning Traces (TRT)}, a unified representation that indexes ordered event chains from videos and serves as a verification oracle for model reasoning. Building on TRT, we propose \textbf{trace-grounded process supervision}: during reinforcement learning, each temporal claim in the model's reasoning trace is programmatically verified against TRT,rewarding grounded reasoning and penalizing fabrications. Unlike neural reward models that may themselves err, our verification is fully deterministic. Extensive experiments show the effectiveness of our model, achieving state-of-the-art performance.