Rapid Poison: Practical Poisoning Attacks Against the Rapid Response Framework
Abstract
The Rapid Response (RR) framework (Peng et al., 2024), deployed in production systems including Anthropic’s ASL-3 safeguards (Anthropic, 2025), dynamically adapts jailbreak detection classifiers by generating synthetic training data from emerging attacks. We reveal that prompt injection can infiltrate this pipeline to deliver poisoned samples into the classifier’s training set, enabling two attack objectives: (I) targeted poisoning attacks that create false positives on harmless samples by categorizing them as a jailbreak, with a specific desired feature (e.g., certain formatting, subject, or keyword), (II) concept-based backdoor attacks that induce false negatives on jailbreak inputs, generalizing even to jailbreaks from attack strategies the defender explicitly trained against, when the backdoor trigger is present. Importantly, our threat model restricts adversaries to modify- ing only jailbreak samples (not benign data or labels), a constraint unexplored by prior work that makes the second objective particularly challeng- ing. We address this with Omission Attack, which exploits a new phenomenon: when training on concept-absent unsafe samples, the classifier mis- associates that concept’s presence with the safe label. Both attacks flip nearly all target labels with only 1% poisoning rate. Code: anonymous.tbd.