Foresee-to-Ground: From Predictive Temporal Perception to Evidence-Driven Reasoning for Video Temporal Grounding
Abstract
Current Video-LLM approaches for Video Temporal Grounding (VTG) typically rely on direct timestamp generation from an unstructured visual-token stream, often resulting in brittle numerics and inconsistent boundaries. To address this, we propose Foresee-to-Ground (F2G), a framework that enforces a verifiable Identify-then-Measure routine. F2G couples predictive temporal perception with evidence-driven reasoning: it learns boundary-sensitive temporal representations to constructs a video-wide evidence pool of candidate event segments, and then augments the LLM input with citable evidence units and enforces identifying the moment by citing the evidence before measuring final metric boundaries under the cited hypothesis. This design decouples event identification from precise measurement, effectively stabilizing the reasoning process. Extensive experiments demonstrate that F2G consistently improves grounding accuracy across diverse benchmarks, transfers robustly across different Video-LLM backbones, and preserves general video understanding capabilities.