Localize and Neutralize: Gradient-Guided Token Suppression Against Visual Prompt Injection Attack
Abstract
Adversarial images pose a severe security threat to multimodal large language models through prompt injection. Existing defenses largely lack a principled understanding of the underlying mechanisms and struggle to balance efficiency and fidelity. In this work, we show that successful adversarial attacks do not rely on the entire image uniformly but instead depend on a small subset of critical image tokens. Based on this insight, we propose a defense that first localizes these critical tokens via gradient analysis and then neutralizes them through masking. We show that attribution based on output probabilities fails when adversarial attacks preserve the predicted token. To overcome this limitation, we introduce the Hidden-State Gradient Norm score for adversarial behavior attribution and prove that its ranking is consistent with that of the full adversarial loss gradient, providing a theoretical guarantee for accurate localization. GTM requires only a single forward–backward pass to identify and zero out a small number of high-scoring tokens, effectively disrupting the adversarial attack path. Extensive experiments on prompt injection and multimodal jailbreak attacks demonstrate that our approach reduces attack success rates (ASR) to near zero while preserving model utility with negligible computational overhead.