HiDe: Rethinking The Zoom-IN method in High Resolution MLLMs via Hierarchical Decoupling
Abstract
Multimodal Large Language Models (MLLMs) have made substantial progress on visual understanding tasks, yet they still perform poorly on high-resolution images. Prior work often attributes this limitation to perceptual constraints, arguing that MLLMs fail to recognize small objects and therefore rely on "zoom-in" strategies to recover fine details. In contrast, our analysis shows that the dominant failure mode is background interference rather than object size. We study the "zoom-in" operation through a hierarchical decoupling analysis and propose the Hierarchical Decoupling Framework (HiDe), a training-free method that turns implicit attention into explicit region selection. HiDe first performs Token-wise Attention Decoupling (TAD) to disentangle question semantics and identify the most informative tokens, then uses their attention patterns to pinpoint the corresponding visual regions. It subsequently applies Layout-Preserving Decoupling (LPD) to extract these regions from cluttered backgrounds and construct a compact representation that retains key spatial structure while filtering out irrelevant context. HiDe achieves state-of-the-art results on high-resolution benchmarks like V*Bench, HRBench4K, and HRBench8K. It boosts Qwen2.5-VL 7B and InternVL3 8B to state of the art performance, reaching 92.1\% and 91.6\% on V*Bench, and even surpasses reinforcement learning based methods. After optimization, HiDe reduces memory usage by 75\% compared with the previous training-free approach.