Poster Wed, Jul 8, 2026 • 6:30 PM – 8:15 PM PDT HALL A #1508

Self-Captioning Multimodal Interaction Tuning: Amplifying Exploitable Redundancies for Robust Vision Language Models

Yuriel Ryan ⋅ Ip Man ⋅ Adriel Kuek ⋅ Paul Pu Liang ⋅ Roy Lee

Project Page

Abstract

Current vision language models face hallucination and robustness issues against ambiguous or corrupted modalities. We hypothesize that these issues can be addressed by exploiting the shared information between modalities to compensate for the impaired one. To this end, we analyze multimodal interactions -- redundant (shared), unique (exclusive), and synergistic (emergent) task-relevant information provided by the modalities -- to determine their impacts on model reliability. Specifically, amplifying redundant interactions would increase this exploitable shared information to resolve these issues; yet, modern instruction datasets often eliminate redundancies to prioritize visual grounding. We bridge this gap through a self-captioning workflow featuring a \textsc{Multimodal Interaction Gate}: a mechanism to convert unique interactions into redundant interactions. Our findings suggest that increasing redundancy can reduce visual induced errors by 38.3\% and improve consistency by 16.8\%.

Lay Summary

Vision-language models are prone to mistakes when one input is unclear. They make up objects that do not exist or contradict themselves on similar prompts. We hypothesize that increasing the degree of shared information (redundancy) between the inputs could help address these mistakes --- the same way we (the humans) exploit redundancy to cover for an ambiguous input (e.g., reading captions to decipher a blurry image). We test this by drawing on a framework from information theory that measures the degree of shared, exclusive, or emergent information --- collectively denoted as multimodal interactions --- from the input modalities. This let us directly adjust the interactions within the training data rather than tweaking the model itself. We integrate the framework into our proposed mechanism, the Multimodal Interaction Gate, which has the model selectively caption its own training images for calibrated increases in redundancy. We show that models trained this way produced fewer visual hallucinations and were more consistent when the inputs were ambiguous or corrupted. More broadly, our work offers a more nuanced view on the training data of these models: the convention of concentrating task-relevant information within the image may not be optimal if reliability is the end-goal.