Jailbreaking Open-Weight LLM Defenses Without Fine-tuning
Abstract
Recent defenses for open-weight LLMs are designed to persist after adversarial fine-tuning. While prior work has shown that these defenses can be bypassed with various adjustments to the fine-tuning attack, we show that they are also vulnerable to even simpler attacks. We test two attacks of abliteration and prefilling, which require minimal compute without gradient computation. On three harmfulness evaluation benchmarks (BeaverTails, HarmBench, AdvBench), we show that these methods improve attack success rate on open-weight defenses from under 10% up to rates ranging from 16% to 97%. Our results suggest that current open-weight defenses have not moved beyond shallow alignment, and that evaluation of such methods should include a broader set of attacks beyond fine-tuning.