Native Spatio-Temporal 4D Variational Autoencoder
Abstract
Dynamic 3D content representation is crucial for generating moving 3D objects and scenes. Existing 4D variational autoencoders (VAEs) are mainly based on projected 2D pointmaps, which are only incomplete and view-dependent observations that do not model the native 4D positional relations between points. This often leads to projection-induced distortions and irreversible token dislocation. In this paper, we introduce a novel 4D VAE that operates directly in native 4D space, that is dynamic colored voxel space, without 2D projection. This preserves explicit spatio-temporal coordinates throughout the learned encoder and decoder, enabling both partial and complete 4D content encoding. To support a flexible temporal compression ratio, we also design a novel spatio-temporal window attention module that performs attention within local 4D windows. Additionally, we propose a differentiable voxel rendering loss based on sparse voxel rasterization to improve the geometry and color reconstruction quality. On 4D reconstruction tasks, our approach improves reconstruction fidelity over pointmap VAEs and flow-based VAEs while learning a more structurally consistent latent space. We further demonstrate the generative potential of our method by training a video-conditioned 4D diffusion model.