DEGAP: Dynamic Entropy-Guided Attention Perturbation for Contrastive Decoding in Large Vision-Language Models
Abstract
Large Vision–Language Models (LVLMs) have shown outstanding performance across various multimodal tasks, but they still suffer from hallucinations, where they generate incorrect information by relying on language priors without visual grounding. To alleviate this issue, prior work has explored contrastive decoding approaches that compare the output of an original LVLM with that of a contrast branch. However, existing methods typically generate contrast logits through preprocessing of the input image. Such input-level perturbations fail to sufficiently reflect the model’s internal degree of visual reliance during the decoding process. To address this limitation, we propose Dynamic Entropy-Guided Attention Perturbation (DEGAP) for contrastive decoding in LVLMs. DEGAP performs contrastive decoding by directly perturbing visual attention and leveraging the resulting logits, without requiring any additional image preprocessing. To this end, we analyze the layer-wise effects of visual attention perturbations and, based on these observations, dynamically select the layers at which attention perturbation is applied according to the model’s confidence. Experimental results on seven benchmarks demonstrate that DEGAP effectively mitigates various types of hallucinations and consistently outperforms state-of-the-art methods in general VQA performance.