Beyond Blind Noising: Disentangled Visual Rectification for Hallucination Mitigation in MLLMs
Yujia Chen ⋅ Rui Sun ⋅ Zhaoyang Li ⋅ Wangkai Li ⋅ Huayu Mai ⋅ Bingzhou Wang ⋅ Aibing Li ⋅ Wenzhang SUN
Abstract
Visual Contrastive Decoding (VCD) mitigates hallucinations in Multimodal Large Language Models (MLLMs) by penalizing the output shift from noise-perturbed images, assuming this shift captures the hallucination direction. We prove this assumption flawed: noise-induced drift in Language-Image Pretrained (LIP) encoders is a \emph{coupled vector} entangling (i) structural degradation from corrupted visual information with (ii) hallucination induction from linguistic prior activation. VCD's indiscriminate penalty inevitably suppresses valid visual semantics. Our key insight is that Self-Supervised Learning (SSL) encoders exhibit \emph{only} structural degradation under noise—geometrically orthogonal to hallucination paths—enabling principled disentanglement via LIP--SSL differential response. We propose \textbf{Disentangled Visual Rectification (DVR)}, a training-free dual-stream framework performing visual-layer rectification and decoding-layer contrast on purified representations. DVR achieves approximately $5\times$ theoretical error reduction over VCD and establishes SOTA performance on POPE, MME, LLaVA-Bench and CHAIR benchmarks.
Successful Page Load