Different Usage of Shared Components Explains Behavioral Variance in LLMs
Zhuonan Yang ⋅ Jacob Xiaochen Li ⋅ Francisco Velez ⋅ Eric Todd ⋅ David Bau ⋅ Michael L. Littman ⋅ Stephen Bach ⋅ Ellie Pavlick
Abstract
One of the most common complaints about large language models (LLMs) is their prompt sensitivity---i.e., the fact that their ability to perform a task or provide a correct answer to a question can depend unpredictably on the way the question is posed. We investigate this variation by comparing two very different but commonly-used styles of prompting: those using $\textit{instructions}$, which describe the task in natural language, and those using in-context $\textit{examples}$, which provide few-shot demonstration pairs to illustrate the task. We find that, despite large variation in performance as a function of the prompt, all the prompts we study engage the same common underlying mechanisms. Specifically, we identify task-specific heads that are interpretable in vocabulary space--- which we dub $\textit{lexical task heads}$---and show that these heads are shared across prompting styles and are essential to triggering subsequent answer production. We further find that behavioral variation between prompts can be explained by the degree to which these heads are activated, and that failures are at least sometimes due to competing task representations that dilute the signal of the target task. Our results together present an increasingly clear picture of how LLMs' internal representations can explain behavior that otherwise seems idiosyncratic to users and developers.
Successful Page Load