Combining Theory and Benchmarks: Towards A Virtuous Cycle to Understand and Guarantee Foundation Model Performance
Abstract
Benchmarks such as HELM [1] and Big-Bench [2] have significantly advanced quantitative model evaluation. However, current practice remains largely empirical and while measuring performance, does not provide guarantees on what capabilities a model has, when those capabilities will or will not manifest, and why. Debates over emergent abilities at scale illustrate this lack of predictive understanding [3,4]. In parallel, a substantial body of theoretical work addresses scaling laws for pre-training [5,6], generalization [7,8,9], and benchmark predictability [10]. Yet these theoretical advances are often disconnected from real-world benchmarking, and as a result, theoretical insights rarely inform benchmark design. This structural disconnect limits our ability to make reliable claims about model behavior, constrains trustworthy deployment, and slows the development of foundation models. This workshop will focus on advancing a predictive science of foundation model performance. We structure the workshop around three key research challenges: 1) Quantification of capabilities across task levels: How can we move from scores to formal, quantitative guarantees on performance across task levels? 2) Foundations of generalization and composition: Which mathematical frameworks can explain when and why models generalize? 3) Reliable and structured empirical evaluation: How should benchmarks be constructed to evaluate reasoning, robustness under distribution shift, and calibrated uncertainty? This workshop will convene researchers across mathematics, statistics, machine learning, and industry to catalyze a new research agenda that tightly couples theory and empirical evaluation of foundation models. By advancing frameworks that make performance predictable and quantifiable, we aim to influence: (i) how benchmarks are designed, (ii) how models are stress-tested, and (iii) how reliability claims are substantiated. These developments have direct implications for large-scale deployment, evaluation pipelines, and red-teaming practices in industry. More broadly, the workshop will help accelerate the emergence of a principled science of foundation models grounded in predictive theory, structured evaluation, and rigorous performance guarantees.