Poster
in
Workshop: DataWorld: Unifying data curation frameworks across domains

SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors

Tiancheng Hu ⋅ Joachim Baumann ⋅ Lorenzo Lupo ⋅ Nigel Collier ⋅ Dirk Hovy ⋅ Paul Röttger

Keywords: large language models computational social science human behavior simulation human-AI alignment benchmarking human-centered AI calibration

Project Page [ OpenReview]

Abstract

Simulations of human behavior based on large language models (LLMs) have the potential to revolutionize the social and behavioral sciences, if and only if they faithfully reflect real human behaviors. Prior work across many disciplines has evaluated the simulation capabilities of specific LLMs in specific experimental settings, but often produced disparate results. To move towards a more robust understanding, we introduce SimBench, the first large-scale benchmark to evaluate how well LLMs can simulate group-level human behaviors across diverse settings and tasks. SimBench compiles 20 datasets in a unified format, measuring diverse types of behavior (e.g., decision-making vs. self-assessment) across hundreds of thousands of diverse participants from different parts of the world. Using SimBench, we can ask fundamental questions regarding when, how, and why LLM simulations succeed or fail. For example, we show that, while even the best LLMs today have limited simulation ability, there is a clear log-linear scaling relationship with model size, and a strong correlation between simulation and scientific reasoning abilities. We also show that base LLMs, on average, are better at simulating high-entropy response distributions, while the opposite holds for instruction-tuned LLMs. By making progress measurable, we hope that SimBench can accelerate the development of better LLM simulators in the future.

Chat is not available.