Skip to yearly menu bar Skip to main content


Poster
in
Workshop: 2nd ICML Workshop on New Frontiers in Adversarial Machine Learning

Baselines for Identifying Watermarked Large Language Models

Leonard Tang · Gavin Uberti · Tom Shlomi

Keywords: [ security ] [ cryptography ] [ watermarks ] [ large language models ]


Abstract:

We consider the emerging problem of identifying the presence of watermarking schemes in publicly hosted, closed source large language models (LLMs). Rather than determine if a given text is generated by a watermarked language model, we seek to answer the question of if the model itself is watermarked. We introduce a suite of baseline algorithms for identifying watermarks in LLMs that rely on analyzing distributions of output tokens and logits generated by watermarked and unmarked LLMs. Notably, watermarked LLMs tend to produce token distributions that diverge qualitatively and identifiably from standard models. Furthermore, we investigate the identifiability of watermarks at varying strengths and consider the tradeoffs of each of our identification mechanisms with respect to watermarking scenario.

Chat is not available.