Broadening the Backdoor Basin: Understanding LLM Backdoors Collapse and Making Backdoors Persistent
Abstract
Large Language Models (LLMs) have shown to be vulnerable to backdoor attacks, yet we observe that many LLM backdoors do not survive when end users perform supervised fine-tuning (SFT). In this work, we provide a geometric explanation: by probing the backdoor objective under controlled weight perturbations, we find that conventional poisoning often drives the backdoor loss to a narrow and sharp basin; consequently, even modest parameter drift induced by downstream SFT can push the model out of the low-loss and high-ASR region, leading to rapid backdoor forgetting. Motivated by this insight, we propose BAD-BOOM, a resilient backdoor attack via broader smoothness minimization, which explicitly broadens and smooths the backdoor basin. BAD-BOOM extends sharpness-aware minimization with a Fisher-induced ellipsoidal constraint that allocates larger perturbation budgets to backdoor-sensitive parameters, encouraging solutions whose neighborhoods also maintain low backdoor loss. Across two threat settings (sentiment steering and targeted refusal), three attacks (AddSent, Sleeper, VPI), three open-source LLMs, and three trigger-free downstream SFT tasks (SST-2, GSM8K, instruction following), BAD-BOOM consistently preserves high ASR while maintaining competitive utility.