ActiveScope: Actively Seeking and Correcting Perception for MLLMs
Yajing Wang ⋅ Chao Bi ⋅ Junshu Sun ⋅ Shufan Shen ⋅ Zhaobo Qi ⋅ Shuhui Wang ⋅ Qingming Huang
Abstract
Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in vision-language understanding, yet they still struggle with fine-grained perception in high-resolution images. While existing training-free methods typically rely on attention-based localization or coarse-to-fine search, they are often misled by distractors and fail to locate multiple targets. To address these limitations, our investigation reveals two causes for failed localization: (a)\textit{Contextual Dominance}, where salient distractors overwhelm target attention, leading to inaccurate localization, and (b) \textit{Semantic Bias}, where aggregated global semantics cause the model to fixate on the most salient concept, resulting in incomplete localization under multi-object scenarios. Built on these insights, we propose {ActiveScope}, a training-free framework that enhances MLLMs by actively seeking and correcting perception. ActiveScope features two modules. The \textit{Semantic Anchor Localization (SAL)} utilizes fine-grained semantics as anchors to independently localize key targets, thereby mitigating semantic bias. The \textit{Interference-Suppressed Refinement (ISR)} refines localization by suppressing attention on salient distractions, effectively overcoming contextual dominance. Extensive experiments on high-resolution image understanding benchmarks demonstrate that ActiveScope outperforms existing training-free methods(e.g., 96.34\% accuracy on $V^{*}$ Bench), validating the superiority of the active search and self-correction paradigm.
Successful Page Load