Position: Causality is Key for Interpretability Claims to Generalise
Abstract
Interpretability research on large language models (LLMs) has produced methods that align model components to high-level concepts, yet their use has been accompanied by recurring failures: findings that do not generalise, and causal language that outruns the evidence. Our position is that Pearl’s causal hierarchy formally defines what constitutes a good alignment, what data or assumptions it requires, and what inferences it supports. Specifically, observations of model behaviour support only associational claims; interventions enable cause-effect claims, but not necessarily predictions of model behaviour; counterfactuals, or predictions of behaviour on unseen examples, are often unverifiable in current studies. We show how interpretability research can benefit from causal representation learning (CRL), which provides tools for provably extracting semantic variables and their relationships from activations, and outline practical requirements for generalisable insights: robustness to distribution shifts, sensitivity to assumptions, and compositionality of interventions. Our diagnostic framework helps practitioners select appropriate methods and mitigate failures to ensure that claims match evidence and findings generalise.