Lookahead-GCG: Improving Multi-Model Gradient-Based Jailbreaking Attacks via Nesterov Momentum
Abstract
Transferable jailbreaking attacks enable red-teaming of black-box large language models by optimizing adversarial prompts on open-source surrogates. A natural approach to improve transferability is multi-model training---optimizing against multiple source models simultaneously. Yet this approach has been largely abandoned, as it yields only marginal gains with standard optimizers. We argue the root cause is poor generalization: standard gradient descent lacks stability whenaggregating gradients from diverse models. Since GCG and its variants~\citep{zou2023universal, jia2024improved, yang2025guiding} mplicitly perform SGD in discrete token space, they inherit this instability in multi-model settings.We address this with \textbf{Lookahead-GCG}, which combines: (1) Stochastic Nesterov Accelerated Gradient (SNAG), whose lookahead mechanism reduces sensitivity to individual gradient updates, providing stability for multi-model optimization; (2) embedding-space momentum accumulation, which enables SNAG in discrete token optimization; and (3) maximally distant initialization, which exploits SNAG's improved generalization by starting from a universally beneficial point. Experiments show our method achieves 50.37\% ASR on open-source and 34.03\% on closed-source LLMs, outperforming GCG and TransferAttack with +11.78\% gains from multi-model optimization.