Multi-view Consistent Latent Action Learning for World Modeling and Control
Abstract
The scalability of world models is currently bottlenecked by the scarcity of action annotations. While self-supervised latent action learning offers a potential solution, existing single-view paradigms—relying on information bottlenecks or Vector Quantization (VQ)—often conflate superficial 2D pixel displacements with the underlying physical-spatial dynamics of an action. Consequently, these methods remain highly susceptible to view-dependent noise, such as camera shake. We introduce MuCoLA (Multi-view Consistent Latent Action learning), a framework that learns robust, view-invariant action representations by enforcing semantic consistency across synchronized video streams. MuCoLA utilizes a Student-Teacher network with DINO-style self-distillation to align action distributions across viewpoints, effectively filtering high-frequency visual noise while preserving motion semantics. Theoretical analysis reveals that our multi-view objective functions as a spectral filter, isolating agent dynamics from environmental nuisances. Empirically, MuCoLA significantly outperforms baselines in action regression, video reconstruction, and downstream visual control tasks. Furthermore, we demonstrate that MuCoLA exhibits favorable scaling properties with respect to model capacity and data volume, paving the way for large-scale action-free world modeling.