LiteVSR: Enabling Cross-Domain Fine-Grained Detail Generation in Light-Weight Transformers for Video Super-Resolution
Abstract
Large-scale pre-trained video generators offer powerful priors for Video Super-Resolution (VSR), yet adapting them remains computationally prohibitive. Full fine-tuning demands extensive resources, and ControlNet-style adapters lose their efficiency advantage under modern Diffusion Transformers (DiTs) since the absence of encoder-decoder hierarchy forces duplication of the entire backbone. We observe that low-quality videos, despite degradation, retain reliable structural information such as layout and motion, and that such structural content is largely domain-agnostic. This suggests that a frozen generator can perform VSR when the input structure is properly aligned to its embedding space. Building on this insight, we propose LiteVSR, a minimalist framework that performs VSR using a completely frozen Diffusion Transformer (DiT) with a lightweight State-Aware Adapter. The adapter employs a dual-stream architecture that jointly processes static structural cues from the low-quality input and dynamic cues from intermediate denoising states through time-dependent cross-attention, enabling adaptive transition from structural alignment to texture refinement as denoising proceeds. LiteVSR achieves comparable restoration quality with only 12.68\% trainable parameters and 12 GPU-hours of training on a single A100.