Debate with Images: Detecting Deceptive Behaviors in Multimodal Large Language Models
Sitong Fang ⋅ Shiyi Hou ⋅ Kaile Wang ⋅ Boyuan Chen ⋅ Donghai Hong ⋅ Jiayi Zhou ⋅ Juntao Dai ⋅ Yaodong Yang ⋅ Jiaming Ji
Abstract
As frontier AI systems become increasingly capable, concerns about deceptive behaviors have intensified. Unlike hallucinations, which stem from capability limitations, deception involves strategically misleading responses despite correct internal representations. While prior work has primarily studied deception in text-only settings, little is known about how such behaviors manifest in multimodal large language models. In this work, we systematically investigate multimodal deception and introduce *MM-DeceptionBench*, the first benchmark designed to evaluate deceptive behaviors in vision–language models across six realistic categories. We find that existing text-centric monitoring approaches are insufficient in multimodal settings due to the complexity of cross-modal reasoning. To address this gap, we propose *debate with images*, a multi-agent evaluation framework that enforces visual grounding through adversarial debate. Experiments show that this approach achieves substantially higher agreement with human judgments than MLLM-as-a-judge baselines, improving Cohen’s kappa by up to 1.5$\times$ and accuracy by up to 1.25$\times$ on GPT-4o.
Successful Page Load