Position: LLM Benchmark Datasets should be Contamination-Resistant
Abstract
Benchmark datasets are critical for reproducible, reliable and discriminative evaluation of LLMs. However, recent studies reveal that many benchmark datasets are included in pretraining corpora, i.e. contaminated, which diminishes their value as a reliable measure of model generalization. In this position paper, we argue that benchmark datasets should be contamination-resistant, i.e. unlearnable but support inference. To accomplish this, we first underline the wide prevalence of benchmark dataset contamination and outline the properties of contamination-resistant datasets. Second, we highlight how the asymmetry between the inference and training pipelines in the Transformer architecture can be leveraged to support contamination-resistance. Third, we outline mathematical advancements to make these datasets interoperable across various LLM architectures. Based on the above, we call on the community to ensure the reliability of LLM benchmarking by: (i) advancing novel contamination-resistant methodologies, (ii) develop supporting methods and platforms, and (iii) adopt contamination-resistant benchmarks into existing evaluation pipelines.