Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Video Generation
Abstract
To achieve real-time video generation, current approaches distill pretrained bidirectional video diffusion models into few-step autoregressive (AR) models. This process involves an architectural gap, as it converts full attention into causal attention. In this paper, we demonstrate that existing methods fail to bridge this gap theoretically, leading to suboptimal performance. Specifically, these methods employ ODE distillation to initialize the AR student, where a key requirement is injectivity. We figure out that for an AR student, frame-level injectivity must hold: each noisy frame must map to a unique clean frame under the PF-ODE of the AR teacher. We theoretically prove that existing methods, which distill an AR student from a bidirectional teacher, violate this frame-level injectivity. Consequently, the student fails to recover the teacher's flow map and instead learns a conditional expectation, resulting in subpar performance. To address this issue, we propose Causal Forcing, which employs an AR teacher for ODE initialization, thereby effectively bridging the architectural gap. Empirical results show that our method outperforms all baselines across all metrics, surpassing the SOTA Self-Forcing by 19.3\% in Dynamic Degree, 8.7\% in VisionReward, and 16.7\% in Instruction Following.