Attacking Gray-Box Large Vision-Language Models with Adaptive SVD-Structured Adversarial Alignment
Abstract
Large vision-language models (LVLMs) have demonstrated remarkable capabilities across a wide range of multimodal reasoning tasks. However, recent research shows that they are susceptible to adversarial examples. Existing LVLM attack methods are generally deployed in the white- or black-box setting, which severely rely on full-model gradients or elaborated transfer strategies, resulting in large resource costs. To this end, this paper focuses on a more efficient gray-box attack setting by solely accessing LVLM's vision encoder. Instead of using target images as the adversarial guidance, our main goal is to perturb the visual feature to best match more natural attacker-chosen target texts. Specifically, we develop a global semantic alignment module to project the visual features onto the SVD-structured subspace spanned by the textual semantics. We also propose to align detailed visual features with multi-context semantic texts extended by LLMs over discrete distributions via optimal transport. Extensive experiments demonstrate the superiority of the proposed method, while our attack is further proven to achieve great transferability across various LVLMs with CLIP-aware transfer designs.