Skip to yearly menu bar Skip to main content


Poster
in
Workshop: ES-FoMo II: 2nd Workshop on Efficient Systems for Foundation Models

Training-Free Acceleration of ViTs with Delayed Spatial Merging

Jung Hwan Heo · Seyedarmin Azizi · Arash Fayyazi · Massoud Pedram


Abstract: Token merging has emerged as a new paradigm that can accelerate the inference of Vision Transformers (ViTs)without any retraining or fine-tuning.To push the frontier of training-free acceleration in ViTs, we improve token merging by adding the perspectives of 1) activation outliers and 2) hierarchical representations.Through a careful analysis of the attention behavior in ViTs, we characterize a delayed onset of the convergent attention phenomenon, which makes token merging undesirable in the bottom blocks of ViTs.Moreover, we augment token merging with a hierarchical processing scheme to capture multi-scale redundancy between visual tokens.Combining these two insights, webuild a unified inference framework called DSM: Delayed Spatial Merging.We extensively evaluate DSM on various ViT model scales (Tiny to Huge) and tasks (ImageNet-1k and transfer learning), achieving up to 1.8$\times$ FLOP reduction and 1.6$\times$ throughput speedup at a negligible loss while being two orders of magnitudefaster than existing methods.

Chat is not available.