EchoAttention: Exploiting Token-Pair Redundancy and Frame-Block Similarity for Efficient Long Video Generation
Yifei Xia ⋅ Fangcheng Fu ⋅ Hao Yuan ⋅ Suhan Ling ⋅ Xupeng Miao ⋅ Huixia Li ⋅ Yuxi Ren ⋅ Xin Xia ⋅ Xuefeng Xiao ⋅ Bin Cui
Abstract
Diffusion Transformers (DiTs) are increasingly adopted for long-video generation, yet inference is dominated by the quadratic cost of 3D full attention. Sparse attention mitigates this bottleneck by exploiting *token-pair redundancy* and pruning query-key interactions. Nevertheless, its effectiveness on long video generation is often constrained by non-sparse attention heads, making it hard to strike a good balance between inference speed and generation quality. To address this, we identify another pervasive but overlooked redundancy specific to video DiTs: *frame-block similarity*, where frame-blocks in attention weights exhibit highly similar distributions and can be well approximated by lightweight linear calibration. Motivated by this observation, we propose **EchoAttention**, which jointly leverages *token-pair redundancy* (*Sparse* operator) and *frame-block similarity* (*Echo* operator), together with a fine-grained routing policy learned via three-stage distillation. This design enables efficient handling of both sparse and non-sparse heads, overcoming the inherent ceiling of purely sparse attention and yielding a better speed-quality trade-off. Across public video DiTs, EchoAttention consistently improves the speed-quality frontier over SOTA sparse-attention baselines, reducing end-to-end latency up to 2.42$\times$ with minimal quality loss.
Successful Page Load