Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Data-centric Machine Learning Research (DMLR): Datasets for Foundation Models

Benchmark Inflation: Revealing LLM Performance Gaps Using Retro-Holdouts

Jacob Haimes · Cenny Wenner · Kunvar Thaman · Vassil Tashev · Clement Neo · Esben Kran · Jason Schreiber, nĂ© Hoelscher-Obermaier


Abstract:

Recent research indicates that the integrity of public benchmarks is compromised as the training material for many Large Language Models (LLMs) is contaminated with test data, suggesting a performance gap between benchmark scores and actual capabilities. Ideally, a private holdout set could be used to accurately verify existing scores. Unfortunately, such datasets do not exist for most benchmarks, and post-hoc construction of sufficiently similar datasets is non-trivial. To address these issues, we introduce a systematic methodology for (i) retrospectively constructing a holdout dataset for a target dataset, (ii) demonstrating the statistical indistinguishability of this retro-holdout dataset, and (iii) comparing LLMs on the two datasets to quantify the performance gap which is solely due to the dataset's public availability. Applying these methods to TruthfulQA, we construct and release Retro-TruthfulQA, on which we evaluate 16 LLMs and find that some have inflated scores by as much as 10 percentage points. These results demonstrate the importance of treating public benchmark scores with caution and point to a need for greater consideration in the release of evaluation datasets to ensure their long-term utility and reliability.

Chat is not available.