Skip to yearly menu bar Skip to main content


Poster

Position Paper: Rethinking LLM Censorship as a Security Problem

David Glukhov · Ilia Shumailov · Yarin Gal · Nicolas Papernot · Vardan Papyan


Abstract:

Large language models (LLMs) have exhibited impressive capabilities in comprehending complex instructions. However, their blind adherence to provided instructions has led to concerns regarding risks of malicious use. Existing defence mechanisms, such as model fine-tuning or output censorship using LLMs, have proven to be fallible, and LLMs can still generate problematic responses. Commonly employed censorship approaches treat the issue as a machine learning problem and rely on another LLM to detect undesirable content in LLM outputs. In this paper, we present fundamental limitations of such semantic censorship approaches, demonstrating that this view of AI safety is ill-defined and unverifiable. Specifically, we demonstrate that semantic censorship can be perceived as an undecidable problem, highlighting the inherent challenges in censorship that arise due to LLMs' programmatic and instruction-following capabilities. Furthermore, we argue that the challenges extend beyond semantic censorship, as knowledgeable attackers can reconstruct impermissible outputs from a collection of permissible ones. As a result, we propose that the problem of censorship needs to be reevaluated, and instead call for viewing it as a security problem and call for the adaptation of security-based defenses to mitigate potential risks and provide certain guarantees.

Live content is unavailable. Log in and register to view live content