Decoy for the Judge: Disrupting Multi-Turn Jailbreaks using Semantics-Preserving Output Rewriting
Abstract
Multi-turn jailbreak attacks have emerged as a powerful threat to LLM safety, leveraging feedback from auxiliary judge models to iteratively refine harmful queries. Existing defenses mainly focus on detecting or blocking harmful content at the final turn, leaving the judge-driven refinement loop intact and allowing attackers to extract informative feedback from intermediate interactions. We introduce Decoy for the Judge (DJ), a semantics-preserving output rewriting approach that intervenes directly in this feedback loop by modifying the victim LLM’s responses before they are evaluated by the attacker’s judge. By misaligning the judge’s feedback signal, DJ perturbs the judge-driven refinement loop and degrades the attacker’s optimization process while preserving the semantic content of the original response. To enable robust and transferable rewriting, we construct a dataset that captures fine-grained distinctions among semantically similar responses with differing harmfulness signals. Leveraging this dataset, DJ is trained using a combination of supervised fine-tuning (SFT) and direct preference optimization (DPO), allowing it to reliably manipulate judge feedback across diverse judge LLMs. Experiments on HarmBench show that DJ significantly reduces the success rate of state-of-the-art multi-turn jailbreaks, while preserving performance on benign multi-turn benchmarks.