ICML Large Language Models as Misleading Agents in Conversation

Poster
in
Workshop: Next Generation of AI Safety

Large Language Models as Misleading Agents in Conversation

Betty L Hou · Kejian Shi · Jason Phang · Steven Adler · James Aung · Rosie Campbell

Keywords: [ large language models ] [ persuasion ] [ manipulation ] [ deception ] [ alignment ]

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract:

Large Language Models (LLMs) are able to provide assistance on a wide range of information-seeking tasks. However, outputs may be misleading, whether unintentionally or in cases of intentional deception. We investigate the ability of LLMs to be deceptive in the context of providing assistance on a reading comprehension task to another LLM, comparing outcomes of when the model is prompted to provide truthful assistance versus when it is prompted to be subtly misleading, or when the model has been provided incorrect information. Our experiments show that GPT-4 can successfully mislead both GPT-3.5-turbo and GPT-4, with deceptive assistants resulting in reduced accuracy of up to 23% on the task compared to when a truthful assistant is used. We also find that providing the information-seeking model with additional context from the passage mitigates the influence of the deceptive model. This work highlights the risks associated with LLMs disseminating misleading information and the importance of developing robust methods to detect and prevent AI-driven deception.

Chat is not available.

Poster in Workshop: Next Generation of AI Safety

Large Language Models as Misleading Agents in Conversation

Betty L Hou · Kejian Shi · Jason Phang · Steven Adler · James Aung · Rosie Campbell

Poster
in
Workshop: Next Generation of AI Safety