Video-SVD: Efficient Video Diffusion via Orthogonal Basis Composition
Zhang Wan ⋅ Yu Li ⋅ Tianze Huang ⋅ Haochen Li ⋅ Juan Cao ⋅ Sheng Tang
Abstract
Video Diffusion Transformers (VDiTs) represent the state-of-the-art in video generation but are fundamentally constrained by the quadratic computational complexity of self-attention. To accelerate this critical computation, we analyze the pre-softmax matrix ($QK^T$) and reveal two key insights: (1) dense attention patterns inherently reside on a global low-rank manifold characterized by rapid singular value decay; and (2) real motion manifests as hybrid spatio-temporal patterns rather than rigid "spatial vs. temporal" classifications. Guided by these insights, we propose Video-SVD. As a plug-and-play acceleration method that requires no alteration to the original network parameters, it extracts universal bases via offline SVD and employs a dynamic subspace projection strategy at inference, thereby bypassing the expensive full $QK^T$ matrix computation entirely. To ensure high fidelity, we deploy layer-shared dual-stream MLPs to synthesize fine-grained textural details and recover high-frequency RoPE information. Video-SVD achieves significant end-to-end speedup while maintaining high visual quality, reaching 1.92$\times$ on HunyuanVideo, 1.75$\times$ on Wan2.1-1.3B, and 1.79$\times$ on Wan2.1-14B.
Successful Page Load