Jailbreaking Vision-Language Models Through the Visual Modality
Aharon Azulay ⋅ Jan Dubiński ⋅ Zhuoyun Li ⋅ Atharv Mittal ⋅ Yossi Gandelsman
Abstract
The visual modality of vision-language models (VLMs) is an underexplored attack surface for bypassing safety alignment. We introduce four jailbreak attacks exploiting the vision component: (1) encoding harmful instructions as visual symbol sequences with a decoding legend, (2) replacing harmful objects with benign substitutes (e.g., bomb $\rightarrow$ banana) then prompting for harmful actions using the substitute term, (3) replacing harmful text in images (e.g., on book covers) with benign words while visual context preserves the original meaning, and (4) visual analogy puzzles whose solution requires inferring a prohibited concept. Evaluating across five frontier VLMs, we find visual attacks achieve comparable and sometimes superior success rates to their text-only counterparts. For example, our visual cipher achieves 40.9\% attack success on Claude-Haiku-4.5 versus 10.7\% for an equivalent textual cipher. To further our insight into the attack mechanism, we present preliminary interpretability and mitigation results. These findings highlight that robust VLM alignment requires treating vision as a first-class target for safety post-training.
Successful Page Load