WEVSR: Adapting Video Diffusion Generators to Real-World Video Super‑Resolution with Wavelet-Enhanced VAE Encoder
Abstract
Recent advances in video diffusion models have demonstrated remarkable generative capability, yet adapting these large pretrained text-to-video (T2V) models to video super‑resolution (VSR) typically encounters challenges, such as artifacts introduced by complex degradations in real-world scenarios and compromised fidelity due to the strong generative capacity of the powerful T2V models. We present WEVSR, a novel approach that adapts a pretrained flow-matching video diffusion transformer to RealVSR. First, we design a task-oriented adaptation strategy that leverages timestep sampling and noise augmentation to enhance detail restoration while preserving structural stability. Second, we propose a lightweight multi-level discrete wavelet transform (DWT) front-end for the VAE encoder, injecting explicit frequency priors into the latent space without modifying the pretrained decoder. Extensive experiments across multiple RealVSR benchmarks demonstrate that WEVSR achieves state-of-the-art performance against existing approaches.