Skip to yearly menu bar Skip to main content


Understanding The Robustness in Vision Transformers

Zhou Daquan · Zhiding Yu · Enze Xie · Chaowei Xiao · Animashree Anandkumar · Jiashi Feng · Jose M. Alvarez

Hall E #210

Keywords: [ DL: Attention Mechanisms ] [ DL: Robustness ] [ APP: Computer Vision ]


Recent studies show that Vision Transformers (ViTs) exhibit strong robustness against various corruptions. Although this property is partly attributed to the self-attention mechanism, there is still a lack of an explanatory framework towards a more systematic understanding. In this paper, we examine the role of self-attention in learning robust representations. Our study is motivated by the intriguing properties of self-attention in visual grouping which indicate that self-attention could promote improved mid-level representation and robustness. We thus propose a family of fully attentional networks (FANs) that incorporate self-attention in both token mixing and channel processing. We validate the design comprehensively on various hierarchical backbones. Our model with a DeiT architecture achieves a state-of-the-art 47.6% mCE on ImageNet-C with 29M parameters. We also demonstrate significantly improved robustness in two downstream tasks: semantic segmentation and object detection

Chat is not available.