Adapting Vision-Language Models for Evaluating World Models
Abstract
World models—generative models that simulate environment dynamics conditioned on past observations and actions—are gaining prominence in planning, simulation, and embodied AI. However, evaluating their rollouts remains a fundamental challenge, requiring fine-grained, temporally grounded assessment of action alignment and semantic consistency -- capabilities not captured by existing metrics.Vision-Language Models (VLMs) have shown promise as automatic evaluators of generative content due to their strong multimodal reasoning abilities. Yet, their use in fine-grained, temporally sensitive evaluation tasks remains limited and requires targeted adaptation. We introduce a evaluation protocol targeting two recognition tasks -- action recognition and character recognition -- each assessed across binary, multiple-choice, and open-ended formats. To support this, we present UNIVERSE (UNIfied Vision-language Evaluator for Rollouts in Simulated Environments), a method for adapting VLMs to rollout evaluation under data and compute constraints. The resulting unified evaluator matches the performance of task-specific baselines using a single checkpoint. Alignment with human judgments is additionally explored in an accompanying study, establishing UNIVERSE as a scalable, semantics-aware evaluator for world models.