Benchmark Scores Rank Methods, Not Capabilities: Theory, Evidence, and Protocols for the Saturation-Collapse Cycle
Dipam Paul
Abstract
When a leaderboard reports "95% on Benchmark X," it conflates two claims requiring different evidence: method rankings need only reliability, while capability claims demand validity—generalization beyond the benchmark distribution. We formalize this distinction using measurement theory and derive a falsifiable prediction: structure-preserving variants maintain rankings; structure-altering variants produce collapse. Empirically, within-cohort rankings survive ($\tau > 0.9$ across 264,761 paraphrased questions) while scores drop on structurally novel successors (ARC-AGI-2$\to$3: 84.6%$\to$0.37%; SWE-bench Verified$\to$Pro: 93.9%$\to$77.8%, rankings preserved). We introduce the Robustness Ratio (RR), a distribution-free diagnostic revealing that two 100%-scoring models differ $\sim$40$\times$ in paraphrase sensitivity, and propose claim-type labeling protocols matching evidence to inference scope. Together, these close the theory-benchmark loop—demonstrating the virtuous cycle this workshop seeks.
Successful Page Load