Skip to yearly menu bar Skip to main content

Workshop: ES-FoMo: Efficient Systems for Foundation Models

ViT Graph Head Attention for Small Sized Datasets

HyeongJin Kim · GyungHyun Lee · Byoung Chul Ko


In this paper, we propose a new type of vision transformer (ViT) based on a graph head attention (GHA). The GHA creates the graph structure using an attention map generated from the input patches. Because the attention map represents the degree of concentration between image patches, it can be regarded as a type of relationship between patches, which can be converted into a graph structure. To maintain an MHA-like performance with fewer GHAs, we apply a graph attention network to the GHA to ensure attention diversity and emphasize the correlations between graph nodes. The proposed GHA maintains both the locality and globality of the input patches and guarantees diversity of attention. The proposed GHA-ViT commonly outperforms pure ViT-based models on small-sized and a medium-sized ImageNet-1K dataset through scratch training. A top-1 accuracy of 81.7\% was achieved in ImageNet-1K with GHA-B, which is a base model with approximately 29M parameters.

Chat is not available.