SpeedVFI: One-step Diffusion for Efficient Video Frame Interpolation
Abstract
Generative video diffusion models have shown strong robustness to large motion and occlusions for video frame interpolation (VFI). However, their inference efficiency lags significantly behind learning-based methods due to the structural redundancy of pairwise inference and the procedural latency of multi-step iterative denoising. To address these limitations, we propose SpeedVFI, a one-step diffusion framework that achieves dual efficiency improvements by interpolating the entire video sequence in a single forward pass to eliminate pairwise overhead, and distilling the generation trajectory into a one-step denoising process to bypass iterative latency. To support this high-efficiency architecture, we introduce temporal RoPE alignment to ensure temporal consistency across the unified sequence, and noise-centric partial attention to reduce computational overhead while preserving global context. Extensive experiments demonstrate that SpeedVFI accelerates diffusion-based VFI by orders of magnitude while maintaining competitive quantitative and visual quality.