Timezone: »
Transformer-based models are popularly used in natural language processing (NLP). Its core component, self-attention, has aroused widespread interest. To understand the self-attention mechanism, a direct method is to visualize the attention map of a pre-trained model. Based on the patterns observed, a series of efficient Transformers with different sparse attention masks have been proposed. From a theoretical perspective, universal approximability of Transformer-based models is also recently proved. However, the above understanding and analysis of self-attention is based on a pre-trained model. To rethink the importance analysis in self-attention, we study the significance of different positions in attention matrix during pre-training. A surprising result is that diagonal elements in the attention map are the least important compared with other attention positions. We provide a proof showing that these diagonal elements can indeed be removed without deteriorating model performance. Furthermore, we propose a Differentiable Attention Mask (DAM) algorithm, which further guides the design of the SparseBERT. Extensive experiments verify our interesting findings and illustrate the effect of the proposed algorithm.
Author Information
Han Shi (The Hong Kong University of Science and Technology)
Jiahui Gao (The University of Hong Kong)
Xiaozhe Ren (Huawei)
Hang Xu (Huawei Noah's Ark Lab)
Xiaodan Liang (Sun Yat-sen University)
Zhenguo Li (Huawei Tech. Investment, Co., Ltd)
James Kwok (Hong Kong University of Science and Technology)
Related Events (a corresponding poster, oral, or spotlight)
-
2021 Poster: SparseBERT: Rethinking the Importance Analysis in Self-attention »
Thu. Jul 22nd 04:00 -- 06:00 PM Room Virtual
More from the Same Authors
-
2023 Poster: Effective Structured Prompting by Meta-Learning and Representative Verbalizer »
Weisen Jiang · Yu Zhang · James Kwok -
2023 Poster: Non-autoregressive Conditional Diffusion Models for Time Series Prediction »
Lifeng Shen · James Kwok -
2023 Poster: Explore and Exploit the Diverse Knowledge in Model Zoo for Domain Generalization »
Yimeng Chen · Tianyang Hu · Fengwei Zhou · Zhenguo Li · Zhiming Ma -
2023 Poster: Nonparametric Iterative Machine Teaching »
CHEN ZHANG · Xiaofeng Cao · Weiyang Liu · Ivor Tsang · James Kwok -
2023 Poster: A Study on Transformer Configuration and Training Objective »
Fuzhao Xue · Jianghai Chen · Aixin Sun · Xiaozhe Ren · Zangwei Zheng · Xiaoxin He · Yongming Chen · Xin Jiang · Yang You -
2022 Poster: Policy Diagnosis via Measuring Role Diversity in Cooperative Multi-agent RL »
Siyi Hu · Chuanlong Xie · Xiaodan Liang · Xiaojun Chang -
2022 Poster: Subspace Learning for Effective Meta-Learning »
Weisen Jiang · James Kwok · Yu Zhang -
2022 Spotlight: Policy Diagnosis via Measuring Role Diversity in Cooperative Multi-agent RL »
Siyi Hu · Chuanlong Xie · Xiaodan Liang · Xiaojun Chang -
2022 Spotlight: Subspace Learning for Effective Meta-Learning »
Weisen Jiang · James Kwok · Yu Zhang -
2022 Poster: Efficient Variance Reduction for Meta-learning »
Hansi Yang · James Kwok -
2022 Spotlight: Efficient Variance Reduction for Meta-learning »
Hansi Yang · James Kwok -
2020 Poster: Searching to Exploit Memorization Effect in Learning with Noisy Labels »
QUANMING YAO · Hansi Yang · Bo Han · Gang Niu · James Kwok -
2019 Workshop: Learning and Reasoning with Graph-Structured Representations »
Ethan Fetaya · Zhiting Hu · Thomas Kipf · Yujia Li · Xiaodan Liang · Renjie Liao · Raquel Urtasun · Hao Wang · Max Welling · Eric Xing · Richard Zemel -
2019 Poster: Efficient Nonconvex Regularized Tensor Completion with Structure-aware Proximal Iterations »
Quanming Yao · James Kwok · Bo Han -
2019 Oral: Efficient Nonconvex Regularized Tensor Completion with Structure-aware Proximal Iterations »
Quanming Yao · James Kwok · Bo Han -
2019 Poster: Multivariate-Information Adversarial Ensemble for Scalable Joint Distribution Matching »
Ziliang Chen · ZHANFU YANG · Xiaoxi Wang · Xiaodan Liang · xiaopeng yan · Guanbin Li · Liang Lin -
2019 Oral: Multivariate-Information Adversarial Ensemble for Scalable Joint Distribution Matching »
Ziliang Chen · ZHANFU YANG · Xiaoxi Wang · Xiaodan Liang · xiaopeng yan · Guanbin Li · Liang Lin -
2018 Poster: Online Convolutional Sparse Coding with Sample-Dependent Dictionary »
Yaqing WANG · Quanming Yao · James Kwok · Lionel NI -
2018 Poster: Lightweight Stochastic Optimization for Minimizing Finite Sums with Infinite Data »
Shuai Zheng · James Kwok -
2018 Oral: Lightweight Stochastic Optimization for Minimizing Finite Sums with Infinite Data »
Shuai Zheng · James Kwok -
2018 Oral: Online Convolutional Sparse Coding with Sample-Dependent Dictionary »
Yaqing WANG · Quanming Yao · James Kwok · Lionel NI -
2017 Poster: Follow the Moving Leader in Deep Learning »
Shuai Zheng · James Kwok -
2017 Talk: Follow the Moving Leader in Deep Learning »
Shuai Zheng · James Kwok