Skip to yearly menu bar Skip to main content


Poster

Mobile Attention: Mobile-Friendly Linear-Attention for Vision Transformers

Zhiyu Yao · Jian Wang · Haixu Wu · Jingdong Wang · Mingsheng Long


Abstract: Vision Transformers (ViTs) excel in computer vision tasks due to their ability to capture global context among tokens. However, the quadratic complexity $O(N^2D)$, with $N$ and $D$ being the number of tokens and features respectively, limits their efficiency on mobile devices, necessitating more mobile-friendly ViT designs with reduced latency. Multi-head kernel-based linear attention presents a promising alternative with a linear complexity of $O(NDd)$, where $d$ indicates the per-head dimension. Yet, when $d$ is large, substantial computational costs may arise. Reducing $d$ leads to lower complexity and enhances mobile-friendliness, but it may result in too many small heads weak at learning valuable subspaces, ultimately impeding linear attention expressiveness. To tackle this dilemma, we propose a novel Mobile-Attention mechanism that introduces a head-competitive mechanism, which prevents overemphasis on less important subspaces from trivial heads while preserving essential ones to ensure the model's capability. It also enables linear-time complexity on mobile devices by employing a small per-head dimension $d$ for mobile efficiency. By replacing the attention mechanism of ViTs with Mobile-Attention, our optimized Mobile-Attention ViTs provide enhanced mobility and competitive performance in a range of computer vision tasks. Specifically, on the iPhone 12, we have achieved remarkable reductions in latency.

Live content is unavailable. Log in and register to view live content