On the Limits of LLM Adaptability: Impact of LLM Pre-Training on Annotation Task Performance
Abstract
Pre-trained Large Language Models (LLMs) are increasingly used for zero-shot annotation and LLM-as-a-judge tasks, yet their reliability hinges on how pre-trained priors interact with user-provided instructions. We investigate three dimensions of this interaction: (1) how an LLM’s familiarity with data and task definitions affects performance, (2) the extent to which additional information in prompts can correct zero-shot errors (“decision stickiness”), and (3) model susceptibility to misaligned task definitions. Through experiments on toxicity detection across diverse datasets (spanning social media, gaming, news, and forums) using both dense and mixture-of-experts models, we find that nearly two-thirds of zero-shot errors are resistant to correction, with an overall rescue rate (fraction of initial errors corrected by prompting) of only 36.4%. High-confidence errors prove especially resistant to correction. When given misaligned definitions, LLMs follow them while maintaining confidence levels unchanged from the aligned condition. Crucially, we introduce Definition-Specific Familiarity (DSF), a metric measuring alignment between a model’s internal concept and the task definition. After controlling for dataset-level confounds, DSF shows positive association with model performance (partial r = +0.34), while text memorization as measured by ROUGE-L shows no positive association (partial r = −0.19). Overall, these findings suggest clear limits on prompt-based correction in annotation tasks and underscore the importance of definition alignment over text-level memorization.