Poster Wed, Jul 8, 2026 • 1:00 AM – 2:45 AM PDT HALL A #409

Video-OPD: Efficient Post-Training of Multimodal Large Language Models for Temporal Video Grounding via On-Policy Distillation

Jiaze Li ⋅ Hao Yin ⋅ Haoran Xu ⋅ Boshen Xu ⋅ Wenhui Tan ⋅ Zewen He ⋅ Jianzhong Ju ⋅ Zhenbo Luo ⋅ Jian Luan

Abstract

Reinforcement learning has emerged as a principled post-training paradigm for Temporal Video Grounding (TVG) due to its on-policy optimization, yet existing GRPO-based methods remain fundamentally constrained by sparse reward signals and substantial computational overhead. We propose Video-OPD, an efficient post-training framework for TVG inspired by recent advances in on-policy distillation. Video-OPD optimizes trajectories sampled directly from the current policy, thereby preserving alignment between training and inference distributions, while a frontier teacher supplies dense, token-level supervision via a reverse KL divergence objective. This formulation preserves the on-policy property critical for mitigating distributional shift, while converting sparse, episode-level feedback into fine-grained, step-wise learning signals. Building on Video-OPD, we introduce Teacher-Validated Disagreement Focusing (TVDF), a lightweight training curriculum that iteratively prioritizes trajectories that are both teacher-reliable and maximally informative for the student, thereby improving training efficiency. Empirical results demonstrate that Video-OPD consistently outperforms GRPO while achieving substantially faster convergence and lower computational cost, establishing on-policy distillation as an effective alternative to conventional reinforcement learning for TVG.

Lay Summary

Understanding videos is important for many AI applications, from video search to robotics and virtual assistants. One particularly difficult problem is helping AI find the exact moment in a video that matches a text description, such as identifying when a person starts running or when two people begin interacting. This task requires the AI to connect language with complex events that unfold over time. Recent AI systems have started using reinforcement learning, a training method where models improve through trial and error. While this approach can make video understanding more robust, existing methods are often slow, expensive to train, and inefficient because they provide only limited feedback about what the model did correctly or incorrectly during the reasoning process. In this work, we introduce Video-OPD, a new training framework that helps AI models learn video understanding more effectively and efficiently. Instead of giving feedback only at the end of an entire video analysis, our method provides detailed guidance throughout the process, allowing the model to learn from each intermediate decision. This leads to faster and more stable learning while reducing computational cost. We further develop a lightweight training strategy called Teacher-Validated Disagreement Focusing (TVDF), which helps the model concentrate on the most informative and challenging training examples. This improves learning efficiency and helps the model make better use of labeled video data. Experiments on several widely used video benchmarks show that our approach consistently outperforms existing reinforcement learning methods while requiring significantly less computation. Our method also generalizes well to broader video understanding tasks, demonstrating a more practical and scalable way to train future video AI systems.