Do Language Models Track Entities Across State Changes?
Abstract
Entity tracking (ET), the ability to keep track of states, is a fundamental skill that underlies complex reasoning. An increasing amount of work investigates how transformer language models (LMs) solve entity binding without state changes; however, there is limited understanding of how non-toy LMs address ET problems of realistic difficulties expressed in natural language. To this end, we investigate the mechanisms underlying ET in more complex scenarios featuring multiple state-changing operations. We find that LMs do not build world states incrementally across tokens or layers, but simply retrieve and aggregate relevant information at the last token when the query becomes evident. We further investigate mechanisms of individual operations (PUT, REMOVE, MOVE) to elucidate how exactly tracking is implemented non-incrementally. Surprisingly, LMs implement the REMOVE operation with a fragile global suppression tag; we provide a mechanistic solution of nullifying this tag to partially address this issue. This global removal mechanism also predicts various additional failure modes that we confirm behaviorally. Our findings suggest directions for training and finetuning for more robust tracking mechanisms, and furthermore offer a mechanistic hypothesis for why chain-of-thought prompting improves ET.