Rethinking Depth Pruning for Vision Transformers: A Heterogeneity-Aware Perspective
Zhenfeng Su ⋅ Kang Zhao ⋅ Han Bao ⋅ Tao Yuan ⋅ Zhongzhe Hu ⋅ Xianzhi Yu ⋅ Wenxuan Wang
Abstract
While prior studies have successfully compressed vision Transformers (ViTs) through various pruning techniques, most have concentrated on width pruning to achieve significant reductions in model size. Depth pruning, which involves the removal of entire layers from a ViT, is notoriously difficult for accuracy recovery, although depth pruning usually leads to higher speedups of compressed ViTs. Consequently, existing joint approaches that incorporate both width and depth pruning have exhibited limited acceleration ratios due to the inefficiencies of previous depth pruning methods. In this work, we reveal that the failure of existing depth pruning methods lies in their neglect of heterogeneity between different layers. Through a comprehensive analysis of the heterogeneity, we introduce HetDPT, a method that handles heterogeneity during depth pruning while avoiding dimension mismatch. Comprehensive experiments on ImageNet1k, CIFAR-100, COCO, and ADE20K have validated our methods. HetDPT achieves a 1.58$\times$ speedup for DeiT-B while maintaining accuracy, and a 1.39$\times$ speedup for DeiT-S with nearly no accuracy degradation. Furthermore, when combined with width pruning, i.e., HetDPT+, our method sets a new state-of-the-art record in extreme ViT pruning. HetDPT+ enhances the acceleration ratio from 4.24$\times$ to 5.19$\times$ for the Isomorphic-Pruning-2.6G configuration while maintaining near-lossless accuracy.
Successful Page Load