Transform Trained Transformer for Accelerating Native 4K Video Generation
Jiangning Zhang ⋅ Junwei Zhu ⋅ Teng Hu ⋅ Yabiao Wang ⋅ Donghao Luo ⋅ Weijian Cao ⋅ Zhenye Gan ⋅ Xiaobin Hu ⋅ Zhucun Xue ⋅ Xiangtai Li ⋅ Chengjie Wang ⋅ Yong Liu
Abstract
Native 4K (2176$\times$3840) video generation remains a critical challenge due to the quadratic computational explosion of full-attention as spatiotemporal resolution increases, making it difficult for models to strike a balance between efficiency and quality. This paper proposes a novel Transformer retrofit strategy termed T3 (**T**ransform **T**rained **T**ransformer) that, without altering the core architecture of full-attention pretrained models, significantly reduces compute requirements by optimizing their forward logic. Specifically, T3-Video introduces a multi-scale weight-sharing window attention mechanism and, via hierarchical blocking together with an axis-preserving full-attention design, can effect an “attention pattern” transformation of a pretrained model using only modest compute and data. Results on 4K-VBench show that T3-Video substantially outperforms existing approaches: while delivering performance improvements (+4.29$\uparrow$ VQA and +0.08$\uparrow$ VTC), it accelerates native 4K video generation by more than 10$\times$. Demo and source code are available in \#Supp.
Successful Page Load