Improving Video Sparse Attention with Fine-grained Router and Sparse Rebasing
Abstract
We present VSA2, a frontier trainable sparse attention for video DiTs. VSA2 includes a variety of new architectural features and training procedures that we apply across all stages of DiT development cycle -- including pretraining, RL, and inference -- to produce a 5.8B DiT with comparable or better quality than a full attention counterpart. Architecturally, VSA2 introduces a fine-grained router that improves the precision of identifying critical tokens and supports dynamic computation by allowing each query to attend to a variable number of key–value pairs. On training, we identify a Hard-to-Easy Curriculum, where models trained under high sparsity and later evaluated with lower sparsity during inference not only generalize effectively, but also outperform models trained with full attention in motion quality. VSA2 is also flexible: it can replace full attention during the middle of progressive low-to-high resolution pretraining, rebasing early-stage full-attention checkpoints. Experiments show that VSA2 reduces attention computation by half over VSA with lower loss. On 720P videos, it accelerates attention by 8.9x and end-to-end generation by 4.62x compared to the FlashAttention3 baseline, while achieving comparable or better video quality.