PanoWorld-X: Generating Explorable Panoramic Worlds via Sphere-Aware Video Diffusion
Abstract
Achieving a complete and explorable 360-degree visual world is a cornerstone of immersive content creation. While recent advances in video generation have achieved impressive results, they follow a 2D paradigm that treats content generation as transitions of 2D pixels, lacking an intrinsic understanding of the physical 3D world, resulting in frequent geometric inconsistencies. To achieve an explorable and physical-consistent visual world, the generation process should shift to a 3D paradigm: the visual content is governed by the physical relationships of the entire 3D environment together with 3D motion signals. However, under this setting, the conventional modeling methods and control signals, such as spatial attention computation in a 2D space, become unsuitable and ineffective. To address this, we propose PanoWorld-X for explorable 3D scene video generation. Our framework is built on the panoramic representation, which naturally maps a 3D scene into a standard format and provides an ideal basis for consistency. Specifically, we first develop a data curation pipeline to produce high-quality and large-motion 3D scene evolution with movement trajectories. To achieve precise control, we design the Exploration Panoramic Plücker Embedding (PPE), a guidance signal tailored for 3D motion. Furthermore, leveraging the spherical geometric properties of panoramic data, we propose a sphere-aware attention mechanism, which can capture true geometric adjacency by reprojecting features onto a spherical surface. Extensive experiments demonstrate that PanoWorld-X achieves superior performance in motion range, control precision, and visual quality, underscoring its potential for real-world applications.