Mechanistic Data Attribution: Tracing the Training Origins of Interpretable LLM Units
Abstract
Mechanistic Interpretability has successfully identified functional circuits in Large Language Models (LLMs), yet their causal origins in the training data remain poorly understood. We bridge this gap by introducing Mechanistic Data Attribution (MDA), a scalable framework that traces the formation of specific interpretable units back to training samples using Influence Functions. Through extensive pre-training experiments on the Pythia family, we causally validate that removing a small fraction of high-influence samples significantly hinders the emergence of targeted heads, whereas augmenting them accelerates formation—effects that random interventions fail to replicate. Leveraging MDA, we reveal that highly repetitive structural data—such as LaTeX and HTML—acts as a "catalyst" that significantly accelerates the emergence of induction heads. Furthermore, we observe that interventions targeting induction head formation induce a concurrent change in the model’s in-context learning (ICL) capability. This provides direct causal evidence for the long-standing hypothesis regarding the functional link between induction heads and ICL. Finally, we propose a mechanistic data augmentation pipeline that builds upon these insights to consistently accelerate mechanistic convergence across diverse model scales, offering a principled methodology for understanding and steering the fine-grained development of LLM behaviors.