Towards Understanding Modality Interaction in Multimodal Language Models via Partial Information Decomposition
Abstract
Understanding \emph{modality interaction} in multimodal large language models (MLLMs) remains a central challenge for reliable and interpretable deployment. We introduce Partial Information Decomposition (PID) as a unified, decision-level framework that separates \emph{unique}, \emph{redundant}, and \emph{synergistic} contributions of sensory and linguistic inputs, moving beyond representation alignment and outcome-based evaluation. Across vision–language benchmarks, PID reveals stable \emph{interaction regimes}: reasoning-oriented tasks consistently exhibit high cross-modal synergy, whereas knowledge-oriented tasks are dominated by language-unique information. These regimes generalize across architectures and scales and predict causal sensitivity to modality-level interventions. We extend this framework to tri-modal systems with Sensory PID, treating language as a control variable to decompose information gain from video and audio. Applied to omni-modal models, this analysis uncovers a persistent \emph{sensory synergy bottleneck}, where decisions remain dominated by visual information even on fusion-dependent tasks. Layer-wise analysis further show that sensory integration emerges late and is instruction-gated, following early visual saturation.