A Computational Framework for Evaluating Human-likeness in LLMs' Open-ended Human Behaviors
Abstract
Large Language Models (LLMs) have found widespread application and research in scenarios such as role-playing and sociological simulations. Despite the growing use of LLM-based agents to simulate human activities, the extent to which their behaviors resemble human behavior remains underexplored. As diverse LLMs proliferate, the traditional Turing test is ineffective for scalable evaluation and prone to bias from human-crafted challenges, leading to unfair assessments. In this work, we propose a novel distribution-based framework that comprehensively evaluates human-likeness and believability of AI behaviors by leveraging large-scale open-ended human behavior data from web. For better evaluation, we design generic metrics to cover three principles: rationality, consistency, and diversity. Implemented across online shopping, open-topic Q&A, and urban mobility, our framework reveals that even the currently best LLM still exhibits a significant gap from real user behavior, underscoring the necessity of comprehensive research and evaluation of AI’s human-like capabilities.