Oral
in
Workshop: 2nd ICML Workshop on New Frontiers in Adversarial Machine Learning

Baselines for Identifying Watermarked Large Language Models

Keywords: large language models watermarks cryptography security

Project Page [ OpenReview]

Abstract

We consider the emerging problem of identifying the presence and use of watermarking schemes in widely used, publicly hosted, closed source large language models (LLMs). That is, rather than determine if a given text is generated by a watermarked language model, we seek to answer the question of if the model itself is watermarked. To do so, we introduce a suite of baseline algorithms for identifying watermarks in LLMs that rely on analyzing distributions of output tokens and logits generated by watermarked and unmarked LLMs. Notably, watermarked LLMs tend to produce distributions that diverge qualitatively and identifiably from standard models. Furthermore, we investigate the identifiability of watermarks at varying strengths and consider the tradeoffs of each of our identification mechanisms with respect to watermarking scenario.

Video

Chat is not available.