OmniVideo-R1: Reinforcing Audio-visual Reasoning with Query Intention and Modality Attention
Abstract
Humans perceive the world through diverse modalities that operate synergistically to support a holistic understanding of their surroundings. However, existing omnimodal models still exhibit substantial performance degradation on visual tasks when the audio modality is incorporated. We identify this “modality interference” as a consequence of pre-training data imbalances, where the scarcity of mixed modality supervision induces a bias towards isolated modalities, resulting in an inherent trade-off. To address this challenge, we propose OmniVideo-R1, a novel reinforced reasoning framework that leverages post-training to rectify modality bias. OmniVideo-R1 empowers models to “think with omnimodal cues” and integrate cross-modal information. The framework consists of two key strategies: (1) query-intensive grounding based on self-supervised learning paradigms; and (2) modality- attentive fusion built upon contrastive learning paradigms. Extensive experiments on multiple benchmarks demonstrate that OmniVideo-R1 consistently outperforms strong baselines, highlighting its effectiveness and robust generalization capabilities.