Linearizing Vision Transformer with Test-Time Training
Yining Li ⋅ Dongchen Han ⋅ Zeyu Liu ⋅ Hanyi Wang ⋅ Yulin Wang ⋅ Gao Huang
Abstract
While linear-complexity attention mechanisms offer a promising alternative to Softmax attention for overcoming the quadratic bottleneck, training such models from scratch remains prohibitively expensive. Inheriting weights from pretrained Transformers provides an appealing shortcut, yet the fundamental representational gap between Softmax and linear attention prevents effective weight transfer. In this work, we address this conversion challenge from two perspectives: architectural alignment and representational alignment. We discover that Test-Time Training (TTT) shares a similar structure with Softmax attention, enabling direct weight inheritance. To further align representational properties— shift-invariance with keys and locality, we introduce instance normalization and locality enhancement module to better approximate the pretrained feature space. We validate our approach by linearizing Stable Diffusion 3.5 and introduce SD3.5-T$^5$. With only 1 hour fine-tuning on 4$\times$H20, SD3.5-T$^5$ achieves a DPG-Bench score of 84.43 (vs.\ 83.83 for the original) while accelerating inference by 1.32$\times$ and 1.47$\times$ at 1K and 2K resolutions.
Successful Page Load