Do LLMs selectively encode the goal of an agent's reach?
Laura Ruis · Arduin Findeis · Herbie Bradley · Hossein A. Rahmani · Kyoung Whan Choe · Edward Grefenstette · Tim Rocktäschel

Fri Jul 28 03:15 PM -- 04:30 PM (PDT) @
In this work, we investigate whether large language models (LLMs) exhibit one of the earliest Theory of Mind-like behaviors: selectively encoding the goal object of an actor's reach (Woodward, 1998). We prompt state-of-the-art LLMs with ambiguous examples that can be explained both by an object or a location being the goal of an actor's reach, and evaluate the model's bias. We compare the magnitude of the bias in three situations: i) an agent is acting purposefully, ii) an inanimate object is acted upon, and iii) an agent is acting accidentally. We find that two models show a selective bias for agents acting purposefully, but are biased differently than humans. Additionally, the encoding is not robust to semantically equivalent prompt variations. We discuss how this bias compares to the bias infants show and provide a cautionary tale of evaluating machine Theory of Mind (ToM). We release our dataset and code.

Author Information

Laura Ruis (University College London)
Arduin Findeis (University of Cambridge)
Arduin Findeis

I am a PhD candidate in the Department of Computer Science at the University of Cambridge. My research focuses on the evaluation of applied machine learning (ML) systems. Much of my work is centred around creating standardised benchmark tools for specific problems – to help accelerate progress on these problems. My current work focuses on the evaluation of language models. Previously, I also worked evaluation of (meta) reinforcement learning (RL) methods in the context of building control systems. I am part of the AI4ER UKRI Centre for Doctoral Training (CDT). Prior to joining my current PhD programme, I completed an MPhil in machine learning and machine intelligence in Cambridge and an undergraduate degree in mathematics at the University of Edinburgh.

Herbie Bradley (EleutherAI / University of Cambridge)
Hossein A. Rahmani (University College London)

PhD student at UCL

Kyoung Whan Choe (Carper AI)
Edward Grefenstette (Facebook AI Research & UCL)
Tim Rocktäschel (Facebook AI Research & University College London)

