Oral
in
Workshop: 2nd ICML Workshop on New Frontiers in Adversarial Machine Learning

Visual Adversarial Examples Jailbreak Aligned Large Language Models

Xiangyu Qi · Kaixuan Huang · Ashwinee Panda · Mengdi Wang · Prateek Mittal

Keywords: foundation models Visual Language Models Adversarial Examples large language models

2023 Oral
in
Workshop: 2nd ICML Workshop on New Frontiers in Adversarial Machine Learning

Project Page [ Poster] [ OpenReview]

Abstract

The growing interest in integrating vision into Large Language Models (LLMs), exemplified by Visual Language Models (VLMs) like Flamingo and GPT-4, is steering a convergence of vision and language foundation models. Yet, risks associated with this integration are largely unexamined. This paper sheds light on the security and safety implications of this trend. First, we underscore that the continuous and high-dimensional nature of the additional visual input makes it a weak link against adversarial attacks, representing an expanded attack surface of vision-integrated LLMs. Second, we highlight that the versatility of LLMs also presents visual attackers with a wider array of achievable adversarial objectives, extending the implications of security failures beyond mere misclassification. As an illustration, we present a case study in which we exploit visual adversarial examples to circumvent the safety guardrail of aligned LLMs with integrated vision. To our surprise, we discover that a single visual adversarial example can universally jailbreak an aligned model, inducing it to heed a wide range of harmful instructions and generate harmful content far beyond merely imitating the derogatory corpus used to optimize the adversarial example. Our study underscores the escalating adversarial risks associated with the pursuit of multimodality. More broadly, our findings connect the long-studied fundamental adversarial vulnerabilities of neural networks to the nascent field of AI alignment. The presented attack suggests a fundamental adversarial challenge for AI alignment, especially in light of the emerging trend towards multimodality in frontier foundation models.

Video

Chat is not available.