Native Active Perception as Reasoning for Omni-Modal Understanding
Zhenghao Xing ⋅ Ruiyang Xu ⋅ Yuxuan Wang ⋅ Jinzheng He ⋅ Ziyang Ma ⋅ Qize Yang ⋅ Yunfei Chu ⋅ Jin Xu ⋅ Junyang Lin ⋅ Chi Wing Fu ⋅ Pheng Ann Heng
Abstract
Passive models for long video understanding typically rely on a ``watch-it-all'' paradigm, processing data uniformly regardless of query difficulty, causing input complexity to scale linearly with video duration. Although interactive frameworks have emerged, they often rely on global pre-scanning, failing to strictly decouple perception from temporal length. We introduce OmniAgent, a POMDP-based active perception framework. OmniAgent executes on-demand actions to selectively distill audio-visual signals into a persistent textual memory, imposing an information bottleneck that fundamentally decouples reasoning complexity from raw video duration. To operationalize this, we bootstrap the agent via success-driven trajectory synthesis and optimize its policy using TAURA (Turn-aware Adaptive Uncertainty Rescaled Advantage). TAURA addresses credit assignment ambiguity in long-horizon reasoning by leveraging token entropy as a proxy for decision criticality, explicitly steering gradients toward pivotal discovery steps. Crucially, OmniAgent demonstrates a positive test-time scaling property, where performance improves as the reasoning turns increase, validating the efficacy of adaptive perception. Empirical results across ten benchmarks (e.g., VideoMME, LVBench) demonstrate that OmniAgent establishes new state-of-the-art performance. Notably, on LVBench, our 7B agent outperforms the 10$\times$ larger Qwen2.5-VL-72B (50.5 vs. 47.3), validating the effectiveness of query-conditional active perception.
Successful Page Load