OmniVL-Guard: Towards Unified Vision-Language Forgery Detection and Grounding via Balanced RL
Abstract
Existing forgery detection methods are often limited to uni-modal or bi-modal settings, failing to handle the interleaved text, images, and videos prevalent in real-world misinformation. To bridge this gap, we propose OmniVL-Guard, a unified framework for omni vision-language forgery detection and grounding. In this unified setting, the interplay between diverse modalities and the dual requirements of simultaneous detection and localization pose significant optimization challenges. Through extensive investigations, we identify a critical difficulty bias in this multi-task optimization: the simpler veracity classification task tends to dominate the gradients, leading to suboptimal performance in fine-grained grounding. To address this imbalance, we first develop a Self-Evolving CoT Generation pipeline to synthesize high-quality reasoning paths, effectively overcoming the cold-start challenge. Building upon this, we propose Adaptive Reward Scaling Policy Optimization (ARSPO). By dynamically modulating reward scales and task weights, ARSPO ensures a balanced joint optimization that prioritizes challenging grounding objectives. Extensive experiments demonstrate that OmniVL-Guard significantly outperforms state-of-the-art methods and exhibits robust zero-shot generalization across out-of-domain scenarios.