Cutting LLM Evaluation Costs with SySRs: A Bandit Algorithm that Provably Exploits Model Similarity
Abstract
Large Language Models are commonly benchmarked on a dataset by evaluating all relevant models on all queries in the test set. This can be wasteful for a practitioner who wants to find the best model to deploy—if a model clearly performs worse than others, there is no need to precisely estimate its performance. Best-arm identification algorithms can drastically reduce costs by dynamically allocating evaluation budget. When applying these algorithms to language models, we can further leverage that their responses to the same prompt are often very similar. While previous attempts to make use this additional structure can exploit model similarity in some cases, they significantly underperform simple baselines in others. We propose Synchronized Successive Rejects (SySRs), augmenting the classical Successive Reject algorithm with paired comparisons. Unlike prior work, our approach is hyperparameter-free and comes with performance guarantees that improve with the degree of similarity between evaluated models. Empirically our method outperforms all baselines, both in terms of average error rate on a suite of 15 standard benchmarks, and in terms of the fraction of benchmark data required to reliably identify the best model on these benchmarks.