Poster
in
Workshop: Data-centric Machine Learning Research (DMLR): Datasets for Foundation Models
Benchmark Inflation: Revealing LLM Performance Gaps Using Retro-Holdouts
Jacob Haimes · Cenny Wenner · Kunvar Thaman · Vassil Tashev · Clement Neo · Esben Kran · Jason Schreiber, nĂ© Hoelscher-Obermaier
Recent research indicates that the integrity of public benchmarks is compromised as the training material for many Large Language Models (LLMs) is contaminated with test data, suggesting a performance gap between benchmark scores and actual capabilities. Ideally, a private holdout set could be used to accurately verify existing scores. Unfortunately, such datasets do not exist for most benchmarks, and post-hoc construction of sufficiently similar datasets is non-trivial. To address these issues, we introduce a systematic methodology for (i) retrospectively constructing a holdout dataset for a target dataset, (ii) demonstrating the statistical indistinguishability of this retro-holdout dataset, and (iii) comparing LLMs on the two datasets to quantify the performance gap which is solely due to the dataset's public availability. Applying these methods to TruthfulQA, we construct and release Retro-TruthfulQA, on which we evaluate 16 LLMs and find that some have inflated scores by as much as 10 percentage points. These results demonstrate the importance of treating public benchmark scores with caution and point to a need for greater consideration in the release of evaluation datasets to ensure their long-term utility and reliability.