MM-Snowball: Evaluating and Mitigating Hallucination Snowballing in Multimodal Multi-turn Dialogue
Abstract
Multimodal Large Language Models (MLLMs) demonstrate remarkable visual understanding, yet their reliability in interactive settings is severely undermined by {hallucination snowballing}: a phenomenon where initial errors amplify across conversational turns, leading to a collapse in coherence. This failure reveals a fundamental vulnerability where models progressively neglect visual grounding in favor of over-relying on polluted textual history. Existing benchmarks are predominantly confined to single-turn VQA or simplistic dialogues, which fail to capture the complex dynamics of error propagation in realistic, long-horizon interactions. To address this, we introduce MM-Snowball, the first benchmark for fine-grained diagnosis of hallucination snowballing within 6-turn dialogues. Extensive evaluation shows that our benchmark poses a significant challenge even to advanced MLLMs and reveals the inefficacy of existing mitigation methods designed for single-turn tasks. To counteract this degradation, we propose Conflict-Aware Visual Rectification (CAVR). This training-free framework mitigates snowballing through a synergistic dual-mechanism that refreshes visual grounding at the representation level and rectifies output distributions at the logit level, effectively re-anchoring the model to visual facts. Experiments demonstrate that our proposed CAVR achieves state-of-the-art performance, offering a promising path toward more reliable interactive AI.