Uplifting Human Decision Making in AI Evaluation by Automating Benchmark Validity Analysis
Abstract
Evaluations of modern AI systems largely originate from English-speaking, Western nations, posing adoption challenges for other regions due to language resource scarcity, misalignment in cultural values, and blind spots to region-specific problems, knowledge, and perspectives. In this paper, we present a framework and automated pipeline to assess the applicability of AI benchmarks in the context of specific deployment use cases and target populations. Our framework is structured around the ontological, instance-level, and representational components of benchmark inputs and outputs, specifying the conditions under which benchmark evaluation validity would be violated if transferred to a different cultural or geographic context. To enable scalable validity analysis, our automated pipeline leverages large language models to evaluate benchmark-use-population triplets across our six validity dimensions. We validate this pipeline through human expert studies and apply it to assess 24 benchmark-use-population triplets spanning five global regions, surfacing systematic patterns in how porting strategies affect validity. We conclude with policy recommendations for actors to improve the benchmarking ecosystem in their respective regional contexts.