MADA-Attack: Transferable Multi-modal Attention Distraction Adversarial Attack against Vision Language Models
Abstract
Vision Language Models (VLMs) achieve strong performance across multi-modal tasks but remain vulnerable to universal adversarial perturbations (UAPs). Existing UAP methods mainly operate on the visual modality, overlooking structured textual semantics and cross-modal interactions, which limits their ability to disrupt alignment and generalize across tasks and model architectures. To address these limits, we propose Multi-modal Attention Distraction Adversarial Attack (MADA-Attack) framework. We begin by conducting several insight experiments and discover that modality attention distributes differently over layers and early phase of optimization is decisive. Building on these observations, we introduce Semantic Token Manipulation (STM) to steer text-guided attention, and Fused Embedding Training (FET) to jointly optimize textual and visual embedding losses for coordinated misalignment. We further incorporate an Adaptive Data Augmentation (ADA) strategy that dynamically balances attack strength, transferability, and training efficiency. Extensive experiments demonstrate that MADA-Attack consistently achieves state-of-the-art performance and strong transferability while remaining computationally lightweight, with an average ASR of 82.60\% and 73.42\% in zero-shot classification and image captioning tasks. For the visual question answering (VQA) and I-T Retrieval task, our method exceeds the SOTA baseline by 10\%. Our code will be available soon.