Expanding the AI Evaluation Toolbox with Statistical Models
Abstract
Benchmarks are widely used to evaluate and compare the performance of artificial intelligence systems. However, some approaches to computing benchmark metrics produce invalid uncertainty estimates or make unrecognized assumptions about the evaluation setting. We leverage statistical modeling to make two contributions to the practice of AI benchmarking. First, we formally distinguish measurements of benchmark accuracy from generalized accuracy (performance on all potential test items similar to those included in the benchmark). Then, in a simulated setting and with large-scale evaluation of 22 API-access frontier large language models on 3 popular benchmarks, we show how analysis via generalized linear mixed model can estimate generalized accuracy while more efficiently quantifying uncertainty compared to existing regression-free approaches. We also show how this approach can equip evaluators with important context on evaluation results, including variance decomposition and item difficulty estimates that illuminate important aspects of LLM performance and benchmark construction.