Benchmarking World-Model Learning with Environment-Level Queries
Abstract
World models are central to building AI agents capable of flexible reasoning and planning. Yet current evaluations (i) test only properties measurable from observed interactions within an environment, such as next-frame prediction or task return, and (ii) do not test whether a learned model supports diverse queries about the same environment. In contrast, humans build \textit{general-purpose} models that can answer many different questions about an environment---including questions that require understanding global structure and counterfactual consequences. We propose WorldTest: a protocol for evaluating agents' ability to learn general-purpose models. A WorldTest benchmark pairs environments with multiple environment-level queries---properties of the full environment---rather than objectives defined only on observed trajectories. Individually, these queries can target global and counterfactual properties (e.g., reachability or the effects of interventions) that are not determined by any single rollout distribution. Collectively, they assess model generality across query types. We instantiate WorldTest as AutumnBench, a minimal yet expressive benchmark of 43 interactive grid-world environments and 129 tasks across three query families for both humans and learning agents. AutumnBench supports diverse environments and evaluations, including queries to evaluate prediction, counterfactual reasoning, and long-horizon planning. Experiments with 517 human participants and five frontier models show that humans substantially outperform these models, a gap we attribute to differences in exploration and belief updating. WorldTest and AutumnBench provide a rigorous framework for evaluating world-model learning and expose critical limitations in current approaches.