CauSciBench: Evaluating LLM Causal Inference for Scientific Research
Abstract
Identifying and estimating causal relationships from data is an important component of the scientific research process because it enables researchers to understand how variables affect one another. While large language models (LLMs) show potential for assisting research workflows, their ability to perform causal inference in empirical studies remains underexplored, despite the importance of causality in domains such as medicine and public policy. To address this gap, we introduce CauSciBench, a benchmark for evaluating LLMs' ability to perform end-to-end causal inference autonomously to answer causal questions that arise in empirical research. CauSciBench contains over 300 evaluation tasks derived from real-world studies across multiple disciplines, synthetic scenarios, and textbook datasets. Prior causal inference benchmarks primarily evaluate whether LLMs can implement user-specified methods. In contrast, CauSciBench evaluates performance across the full causal analysis pipeline, including variable selection, method selection, causal effect estimation, and statistical interpretation. We evaluate seven frontier models using several test-time scaling strategies, including Chain-of-Thought, Program-of-Thought, and ReAct. Results show a clear performance gap between real-world and synthetic settings, highlighting limitations in current agentic capabilities for data-driven causal analysis.