Beyond Benchmarks: Toward Causally Faithful Evaluation of Large Language Models
Abstract
Current LLM evaluations often conflate benchmark performance with intrinsic model capability. This is misleading, as observed outcomes arise from the entire evaluation system, including datasets, prompting methods, decoding parameters, and the software–hardware stack, rather than the model alone. When this system is underspecified, attribution becomes unreliable; in practice, evaluation choices alone can induce accuracy swings of up to 70\%. This attribution challenge is compounded by the open-ended nature of LLM evaluation. Questions span languages, domains, and usage styles, forming highly variable and implicitly shifting datasets. Consequently, strong performance on static benchmarks may reflect alignment with surface patterns rather than robust underlying capability. Prior studies either focus on individual components, overlooking their interactions, or investigate manually curated and small-scale question variants, lacking a holistic perspective, precluding precise attribution of intrinsic model capabilities amidst the confounding influences. To address these limitations, we propose LLM evaluatology, a principled framework that grounds LLM evaluation in a causally informed system design. By jointly modeling evaluation components and structured question variations, it enables interpretable, reproducible, and causally faithful assessment of model capability, establishing clear conditions under which evaluation results are meaningful and trustworthy.