Greedy Coordinate Diffusion: Effective and Semantically Coherent Adversarial Attacks via Diffusion Guidance
Abstract
Although there is a rich literature on adversarial attacks on large language models, their current practical impact is limited. Gradient-based attacks such as Greedy Coordinate Gradient (Zou et al., 2023) typically produce high-perplexity, incoherent suffixes that are easily detectable and thus easy to guard against, especially in combination with other defense-in-depth techniques (Bengio et al., 2024). On the other hand, attacks that aim to produce coherent prompts often alter the semantic intent of the original query. When the model complies with such altered query, it often produces a response that is not actually useful for the original query, thus incurring the so-called "jailbreak tax". In this work, we introduce a novel framework that can efficiently generate adversarial attacks against safety-aligned models while maintaining low perplexity and high semantic adherence to the adversary's original intent. The framework, Greedy Coordinate Diffusion (GCD), leverages the generative priors of discrete diffusion language models to guide the search for adversarial suffixes that achieve semantic coherence and adherence. Furthermore, unlike GCG, GCD does not require direct gradient access, allowing it to operate in a gray-box setting. We empirically demonstrate the power of GCD by showing it achieves state-of-the-art attack success rates against aligned models, and its adversarial prompts are not detected by semantic filters such as llama-guard-3.