Two Modalities Are Better Than One: Efficient Adversarial Purification via Multimodal Diffusion Models
Abstract
Adversarial purification uses generative models to restore clean data distributions from unseen attacks without retraining classifiers. However, unimodal diffusion-based approaches struggle to preserve semantic consistency, while recent multimodal variants rely on computationally expensive adversarial training or distillation. Both approaches often lack theoretical guarantees. In this work, we propose MultiDAP, a novel framework leveraging multimodal diffusion models for efficient adversarial purification. MultiDAP first learns continuous class-agnostic prompts from clean data to capture rich semantic priors, replacing rigid hand-crafted templates. Guided by these prompts, MultiDAP purifies adversarial inputs by minimizing a regularized DDPM loss for only a few steps (e.g., 5-20). We provide theoretical guarantees for both the likelihood improvement via prompt learning and the convergence of the purification process. Extensive experiments on CIFAR-10, CIFAR-100, and ImageNet-1K demonstrate that MultiDAP matches the robustness of state-of-the-art baselines but with improved efficiency.