Securing Multimodal AI through Internal Information Decomposition
Jehyeok Yeon ⋅ Hyeonjeong Ha ⋅ Qiusi Zhan ⋅ Heng Ji
Abstract
Multimodal large language models introduce attack surfaces absent in unimodal systems: adversaries can distribute malicious intent across modalities to evade unimodal safeguards. This motivates using cross-modal consistency as a detection signal rather than inspecting each modality in isolation. Our key observation is that benign inputs induce compatible predictive behavior from text-only and vision-only reasoning that stabilizes when fused, whereas adversarial manipulation disrupts this consistency, causing abnormal multimodal behavior. Existing defenses that examine raw inputs or outputs overlook this internal fusion process, rendering them brittle and computationally expensive. We propose FlowGuard, a lightweight inference-time framework that detects harmful inputs by monitoring internal multimodal consistency. Unlike approaches that rely on scalar confidence metrics, FlowGuard derives FlowVectors inspired by Partial Information Decomposition that quantify cross-modal redundancy, synergy, and modality-specific dominance, capturing whether multimodal fusion aligns with unimodal semantic evidencebetween unimodal and fused multimodal output distributions. In a one-class classification problem trained solely on benign data, FlowGuard reduces Attack Success Rates from $>90\%$ to $<15\%$ on unseen attacks, with $<3\%$ utility loss and up to a $6\times$ latency reduction. Our results demonstrate that monitoring cross-modal consistency offers an efficient and effective defense for multimodal reasoning.
Successful Page Load