Position: Stop Using Culturally Biased Human Cognitive Benchmarks to Evaluate LLMs
Abstract
Recent work uses human cognitive benchmarks to evaluate how LLMs represent concepts, claiming to assess "human-like" understanding. This position paper argues that this approach is misguided: these benchmarks come from narrow, typically Western populations yet are treated as universal standards, despite cross-cultural research showing culture shapes how people think, not just what they think about. LLMs trained on global multilingual data should not be expected to mirror thinking patterns from limited groups. Moreover, LLM outputs can shift with minor changes in prompting, unlike the stable human mental structures these benchmarks were designed to measure. These problems show up as contradictory findings across studies, making benchmark results poor evidence for claims about how LLMs represent concepts. We call for evaluation approaches designed for what LLMs actually are—systems trained on diverse global data—rather than tests measuring how closely they match a single population’s way of thinking.