DART: Distribution-Aware Adaptive Relational Transfer for Adversarial Attacks against Closed-Source MLLMs
Abstract
This paper studies the critical problem of targeted adversarial attacks against closed-source MLLMs, which aim to generate highly transferable adversarial samples with open-source MLLMs. Previous approaches typically focus on maximizing the similarity of latent representations between adversarial samples and target samples. However, these approaches could overfit specific target samples with severely limited generalization ability to closed-source MLLMs. Towards this end, we propose a novel approach named Distribution-aware Adaptive Relational Transfer (DART) for adversarial attacks against closed-source MLLMs. The core of our DART is to adopt a statistical lens to characterize the intrinsic semantics of images for more generalized and robust alignment. In particular, each augmented image is considered an example from the intrinsic distribution of the original image. Then, we utilize non-parametric Energy Distance to measure the distribution divergence, which is naturally adopted for the semantic alignment in the hidden space. To further enhance transferability to specific target models, we learn a graph neural network (GNN) to explore the complex relations between source and target MLLMs on transferability and adaptively select surrogate models to maximize transferability across diverse targets. Extensive experiments on benchmark datasets validate the superior robustness and effectiveness of the proposed DART in comparison to various competing baselines.