Poster
in
Workshop: Combining Theory and Benchmarks: Towards A Virtuous Cycle to Understand and Guarantee Foundation Model Performance Thu, Jul 9, 2026 • 7:00 PM – 8:00 PM PDT

Uplifting Human Decision Making in AI Evaluation by Automating Benchmark Validity Analysis

Rodolfo Corona ⋅ Sang Truong ⋅ Ritwik Gupta ⋅ Nhi N Truong ⋅ Atnafu Lambebo Tonja ⋅ Mena Attia ⋅ Fahim Faisal ⋅ Kaushal K Maurya ⋅ Fred Philippy ⋅ Belu Ticona ⋅ Sumaya Nur Adan ⋅ Fazl Barez ⋅ Omar Florez ⋅ Supheakmungkol Sarin ⋅ Aseem Srivastava ⋅ Xiaoyuan Yi ⋅ Nick Haber ⋅ Dan Klein ⋅ Thamar Solorio ⋅ Xing Xie ⋅ Sanmi Koyejo ⋅ Robert Trager

Project Page

Abstract

Evaluations of modern AI systems largely originate from English-speaking, Western nations, posing adoption challenges for other regions due to language resource scarcity, misalignment in cultural values, and blind spots to region-specific problems, knowledge, and perspectives. In this paper, we present a framework and automated pipeline to assess the applicability of AI benchmarks in the context of specific deployment use cases and target populations. Our framework is structured around the ontological, instance-level, and representational components of benchmark inputs and outputs, specifying the conditions under which benchmark evaluation validity would be violated if transferred to a different cultural or geographic context. To enable scalable validity analysis, our automated pipeline leverages large language models to evaluate benchmark-use-population triplets across our six validity dimensions. We validate this pipeline through human expert studies and apply it to assess 24 benchmark-use-population triplets spanning five global regions, surfacing systematic patterns in how porting strategies affect validity. We conclude with policy recommendations for actors to improve the benchmarking ecosystem in their respective regional contexts.