Schema-Guided World Modeling for Understanding Hierarchical Visual Dynamics
Abstract
Multimodal LLMs lack a systematic understanding of visual dynamics in complex human world activities, which requires the model to predict or simulate multiple levels of dynamic constituents, such as the general progression of actions and the associated changes of low-level details in the world. To address this challenge, we propose a dynamic visual schema-guided world model, DynaVieW, optimized for visual dynamic prediction and simulation. DynaVieW achieves an in-depth understanding of visual dynamics by learning interleaved state-transition sequences, where states cover broad visual scenes from video keyframes, and transitions capture comprehensive dynamic constituents within a hierarchical schema. DynaVieW jointly models transition prediction and state simulation under a mixture-of-experts architecture, with a cross-expert selective attention and a schema token re-weighted loss, to ensure effective and robust learning. DynaVieW's superior visual dynamic understanding boosts its downstream performances on both visual narrative creation and world simulation, showing improved consistency and controllability of visual generation and better instruction-following ability.