Noise Tectonics: Measuring the Stability of AI Benchmark Ecosystems
Abstract
AI benchmark ecosystems compress rich evaluation data into aggregate leaderboard scores, but these scores contain substantial measurement noise whose sources and magnitudes remain unquantified. Without systematic methods to measure this noise and separate signal from artifact, it is unclear when benchmark rankings reflect genuine capability differences versus measurement error. We introduce a psychometric framework to methodically test hypotheses about benchmark ecosystem structure and quantify the reliability of common benchmark-derived claims. Applying Confirmatory Factor Analysis and Generalizability Theory to 4,000+ models from the Open LLM Leaderboard, we decompose sources of ranking variance and find that human contributors account for more variance (9\%) than model architecture (4.8\%), revealing that benchmark noise stems as much from contributor practices as from model characteristics. We further demonstrate methods to assess the reliability of scaling laws by controlling for model size and other confounds. Our findings provide actionable diagnostics for when benchmark rankings can be trusted and establish a measurement framework for evaluating the validity of AI evaluation claims.