Poster Tue, Jul 7, 2026 • 6:30 PM – 8:15 PM PDT HALL A #3310

Probing Cross-modal Information Hubs in Audio-Visual LLMs

Jihoo Jung ⋅ Chaeyoung Jung ⋅ Ji-Hoon Kim ⋅ Joon Son Chung

Project Page

Abstract

Audio-visual large language models (AVLLMs) have recently emerged as a powerful architecture capable of jointly reasoning over audio, visual, and textual modalities. In AVLLMs, the bidirectional interaction between audio and video modalities introduces intricate processing dynamics, necessitating a deeper understanding of their internal mechanisms. However, unlike extensively studied text-only or large vision language models, the internal workings of AVLLMs remain largely unexplored. In this paper, we focus on cross-modal information flow between audio and visual modalities in AVLLMs, investigating where information derived from one modality is encoded within the token representations of the other modality. Through an analysis of multiple recent AVLLMs, we uncover two common findings. First, AVLLMs primarily encode integrated audio-visual information in sink tokens. Second, sink tokens do not uniformly hold cross-modal information. Instead, a distinct subset of sink tokens, which we term cross-modal sink tokens, specializes in storing such information. Based on these findings, we further propose a simple training-free hallucination mitigation method by encouraging reliance on integrated cross-modal information within cross-modal sink tokens. Our code is available at https://github.com/kaistmm/crossmodal-hub.

Lay Summary

Audio-visual LLMs (AVLLMs) can now describe videos by using both what is visible and what is audible. But when these systems get something right-or make a mistake- we still know little about how they combine sound and visual information internally. In this paper, we look inside several AVLLMs to find where information from one sense is stored in the model’s representation of the other. We discover that shared audio-visual information is concentrated in special internal positions called sink tokens, rather than being spread evenly across all tokens or stored mainly in object-specific regions. More importantly, only some of these sink tokens serve this role: cross-modal sink tokens act like meeting points where sound and visual cues are integrated. We use this insight to build a training-free method that steers the model toward these cross-modal tokens while reducing object hallucinations when generating captions. These findings help explain how AVLLMs combine different senses and offer a practical step toward more trustworthy video understanding systems.