Item Response Scaling Laws: A Measurement Theory Approach for Efficient and Generalizable Neural Scaling Estimation
Sang Truong ⋅ Yuheng Tu ⋅ Rylan Schaeffer ⋅ Sanmi Koyejo
Abstract
Scaling laws provide a fundamental framework for understanding the performance of Large Language Models (LLMs), yet deriving them requires prohibitively expensive evaluations across thousands of checkpoints or millions of inference samples. To address this, we introduce Item Response Scaling Laws (IRSL), a unified framework that integrates Item Response Theory (IRT) within scaling law formulation. Unlike traditional approaches that treat each model-benchmark pair in isolation, IRSL disentangles latent model ability from question characteristics, factorizing the scaling law estimation for $M$ models and $N$ questions to significantly reduce parameter complexity from $O(M \times N)$ to $O(M + N)$. We propose Beta-IRT, a novel extension that leverages the empirical probability responses of LLMs, such as token probabilities in pre-training and pass rates in test-time sampling, to capture richer signals than binary responses. We validate our approach across two prevalent scaling paradigms: (1) pre-training downstream scaling, using 6,612 LLM checkpoints and 37,682 questions from 10 benchmarks; and (2) test-time scaling, using 12 LLMs and 120 questions from 4 benchmarks with up to 2,500 samples per question. In both cases, we demonstrate that IRSL yields more reliable scaling estimates under limited query budgets. Furthermore, we show that the estimated latent model abilities are generalizable, enabling accurate performance forecasting across benchmarks that share the same measurement objective.
Successful Page Load