Boosting Video Diffusion Models via Masked Autoencoders as Tokenizers
Abstract
Latent diffusion models have become the dominant paradigm for video generation, making the video tokenizer a critical role. While most existing tokenizers are trained primarily for reconstruction, diffusion models are optimized to denoise heavily corrupted latents, which creates a mismatch between tokenizer training objectives and downstream generative learning. As a result, reconstruction metrics (e.g., rFVD) can be a poor proxy for generation quality (gFVD), and overly prioritizing reconstruction may even hinder diffusion training. We propose VideoMAETok, a simple family of ViT-based video tokenizers trained explicitly as corruption-inversion models for latent video diffusion. VideoMAETok builds on masked autoencoders: we (i) apply high-ratio token masking and encode only visible spatiotemporal tokens for efficiency, and (ii) corrupt latent tokens with interpolative Gaussian noise to better match the denoising nature of diffusion generators. Training under such corruption encourages latents that remain informative and well-conditioned for downstream denoising. Extensive experiments show that VideoMAETok consistently improves generation quality when paired with off-the-shelf diffusion models (SiT and LightningDiT), achieving state-of-the-art gFVD on Kinetics-600 and UCF-101 while remaining compute-efficient.