ICML Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals in Large Language Models

Poster
in
Workshop: Next Generation of AI Safety

Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals in Large Language Models

Bang An · Sicheng Zhu · Ruiyi Zhang · Michael-Andrei Panaitescu-Liess · Yuancheng Xu · Furong Huang

Keywords: [ controllable text generation ] [ pseudo-harmful prompts ] [ safety alignment ] [ false refusals ] [ usability-safety trade-off ] [ LLM ]

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract:

Aligned large language models (LLMs) can falsely refuse pseudo-harmful user prompts, like "how to kill a mosquito," which seem harmful but are actually not. Frequent false refusals not only affect user experience but also cause the public to disdain the values alignment seeks to protect. In this paper, we propose the first method for auto-generating pseudo-harmful prompts, leveraging a white-box LLM to generate natural, varied, and controllable prompts. Using this method, we construct an evaluation dataset called PHTest, which is ten times larger than existing datasets, covers more false refusal patterns, and separately annotates controversial samples. We evaluate 14 models, including Claude 3, on PHTest, uncovering new insights due to its scale and fine-grained annotations. Additionally, we reveal a trade-off between false refusals and safety against jailbreak attacks. Our method and dataset can help developers evaluate and fine-tune safer and more usable LLMs.

Chat is not available.

Poster in Workshop: Next Generation of AI Safety

Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals in Large Language Models

Bang An · Sicheng Zhu · Ruiyi Zhang · Michael-Andrei Panaitescu-Liess · Yuancheng Xu · Furong Huang

Poster
in
Workshop: Next Generation of AI Safety