Poster Tue, Jul 7, 2026 • 10:30 PM – 12:15 AM PDT HALL A #2906

RLCracker: Evaluating the Worst-Case Vulnerability of LLM Watermarks with Adaptive RL Attacks

Hanbo Huang ⋅ Yiran Zhang ⋅ Hao Zheng ⋅ Xuan Gong ⋅ Yihan Li ⋅ Lin Liu ⋅ Zhuotao Liu ⋅ Shiyu Liang

Project Page

Abstract

Large language model (LLM) watermarking has shown promise in detecting AI-generated content and mitigating misuse, with prior work claiming robustness against paraphrasing and text editing. In this paper, we argue that existing evaluations are not sufficiently adversarial, obscuring critical vulnerabilities and overstating the security. To address this, we introduce the adaptive robustness radius, a formal metric that quantifies the worst-case resilience of watermarks against adaptive adversaries. By lifting the paraphrase space into a KL-divergence ball, we approximate this radius and theoretically demonstrate that optimizing the attack context and model parameters can significantly reduce the approximated radius, making watermarks highly vulnerable to paraphrase attacks. Leveraging this insight, we propose RLCracker, a reinforcement learning (RL)–based adaptive attack that erases watermarks while preserving semantic fidelity. RLCracker requires only limited watermarked examples and zero access to the detector. Despite weak supervision, it empowers a 3B model to achieve 98.5\% removal success with minimal semantic shift on 1,500-token Unigram-marked texts after training on only 100 short samples. This performance dramatically exceeds 6.75\% by GPT-4o and generalizes across five model sizes over ten watermarking schemes. Our code is available in this repository.

Lay Summary

Text generated by large language models can be marked with hidden statistical signals (watermarks), so that people can later identify where it came from. These watermarks are often tested against simple rewriting, but real users may use stronger models, better prompts, or repeated reasoning to change the text while keeping its meaning. We introduce RLCracker, a stress-testing method that asks how much rewriting a watermark can withstand in the worst case, rather than only under average paraphrasing. RLCracker trains a rewriting model to preserve the original meaning while moving its output away from patterns typical of watermarked text and closer to ordinary human-written text. It needs only a small number of watermarked examples and does not require access to the watermark detector. In experiments across many models and watermarking methods, RLCracker revealed that several current watermarks can be removed much more easily than previous evaluations suggested, especially on long texts. These results do not mean watermarking is useless; they show that watermarking systems should be tested against adaptive, meaning-preserving attacks before deployment. Our work provides a practical diagnostic tool for building more reliable AI provenance safeguards.