MapShift: Controlled Post-Intervention Evaluation for Embodied World Models
Aarav Sinha
Abstract
Embodied agents are often evaluated in the same environment in which they were explored or trained, making it difficult to assess whether a learned world model supports planning after the world changes. Existing evaluations can conflate memory of the explored environment, belief update after change, and planning under the updated belief. We introduce MapShift, an executable benchmark for controlled post-intervention evaluation (CPE): an agent explores a base environment without a task reward; the environment is modified by a controlled intervention in the metric, topology, dynamics, or semantics; and the agent is evaluated on post-intervention planning, inference, and adaptation tasks. The contribution is measurement infrastructure: matched base/intervened pairs, family-wise estimands, severity ladders, invariant validators, benchmark-health gates, protocol-comparison tooling, and reproducible artifact generation. In the expanded 24-motif release, health gates pass with zero fatal leakage, zero task rejections, perfect reference solvability, no intervention-validator failures, and no severity-magnitude failures. A deterministic mechanism diagnostic shows that same-environment evaluation underestimates the belief-update advantage by 3x on topology shifts, from $\Delta=0.304$ under CPE versus $0.102$ same-environment, and by 7x on semantic shifts, from $\Delta=0.724$ versus $0.102$; planning slices also exhibit protocol-induced rank reversals.
Successful Page Load