You're reading LLM leaderboards wrong: Disentangling models from pipelines in engineering benchmarks
Abstract
LLM leaderboard scores are widely treated as measures of model capability. We argue they are not - they are joint outcomes of the model and the evaluation pipeline. We reproduce four benchmarks (MMLU, ScienceQA, SceMQA, MatSciBench) and show two concrete ways pipelines distort scores: prompt design shifts accuracy by 5-9 percentage points and produces opposite effects depending on task type, and removing tool access from MatSciBench drops o4-mini from 74% to 38%. Engineering benchmarks are especially affected because they combine tool-dependent computation with multimodal inputs, making the pipeline contribution uniquely large compared to general NLP tasks. We call for benchmark papers to, at minimum, provide full pipeline specifications and key ablations for reproducibility, and ideally report score ranges across reasonable pipeline variations rather than single point estimates.