How Can I Publish My LLM Benchmark Without Giving the True Answers Away?
Abstract
Publishing a large language model (LLM) benchmark (especially its ground-truth answers) on the Internet risks contaminating future LLMs and enabling evaluation gaming: it may be unintentionally (or intentionally) used to train or select a model, or exploited to overfit and hack leaderboards when labels are accessible. A common mitigation is to keep the benchmark private and let participants submit their models or predictions to the organizers, but this still permits test-set overfitting through feedback loops. To overcome this issue, we propose a way to publish benchmarks without completely disclosing the ground-truth answers to the questions, while still maintaining the ability to openly evaluate LLMs. The main underlying idea is to reduce the best possible accuracy, i.e., Bayes accuracy, by injecting randomness to the answers by preparing several logically correct answers, and only include one of them as the solution in the benchmark. Not only is this helpful to keep us from disclosing the ground truth, but this also offers a test for detecting contamination. In principle, even fully capable models should not surpass the Bayes accuracy. If a model surpasses this ceiling despite this expectation, this is a strong signal of leakage or gaming. We present theoretical and experimental evidence that our method can accurately detect data contamination or hacking on a wide range of benchmarks, models, training methodologies, and scenarios including feedback loops.