Timezone: »
Unsupervised pre-training is popularly used in natural language processing. By designing proper unsupervised prediction tasks, a deep neural network can be trained and shown to be effective in many downstream tasks. As the data is usually adequate, the model for pre-training is generally huge and contains millions of parameters. Therefore, the training efficiency becomes a critical issue even when using high-performance hardware. In this paper, we explore an efficient training method for the state-of-the-art bidirectional Transformer (BERT) model. By visualizing the self-attention distribution of different layers at different positions in a well-trained BERT model, we find that in most layers, the self-attention distribution will concentrate locally around its position and the start-of-sentence token. Motivating from this, we propose the stacking algorithm to transfer knowledge from a shallow model to a deep model; then we apply stacking progressively to accelerate BERT training. The experimental results showed that the models trained by our training strategy achieve similar performance to models trained from scratch, but our algorithm is much faster.
Author Information
Linyuan Gong (Peking University)
Di He (Peking University)
Zhuohan Li (Peking University)
Tao Qin (Microsoft Research Asia)
Liwei Wang (Peking University)
Tie-Yan Liu (Microsoft Research Asia)
Tie-Yan Liu is a principal researcher of Microsoft Research Asia, leading the research on artificial intelligence and machine learning. He is very well known for his pioneer work on learning to rank and computational advertising, and his recent research interests include deep learning, reinforcement learning, and distributed machine learning. Many of his technologies have been transferred to Microsoft’s products and online services (such as Bing, Microsoft Advertising, and Azure), and open-sourced through Microsoft Cognitive Toolkit (CNTK), Microsoft Distributed Machine Learning Toolkit (DMTK), and Microsoft Graph Engine. On the other hand, he has been actively contributing to academic communities. He is an adjunct/honorary professor at Carnegie Mellon University (CMU), University of Nottingham, and several other universities in China. His papers have been cited for tens of thousands of times in refereed conferences and journals. He has won quite a few awards, including the best student paper award at SIGIR (2008), the most cited paper award at Journal of Visual Communications and Image Representation (2004-2006), the research break-through award (2012) and research-team-of-the-year award (2017) at Microsoft Research, and Top-10 Springer Computer Science books by Chinese authors (2015), and the most cited Chinese researcher by Elsevier (2017). He has been invited to serve as general chair, program committee chair, local chair, or area chair for a dozen of top conferences including SIGIR, WWW, KDD, ICML, NIPS, IJCAI, AAAI, ACL, ICTIR, as well as associate editor of ACM Transactions on Information Systems, ACM Transactions on the Web, and Neurocomputing. Tie-Yan Liu is a fellow of the IEEE, a distinguished member of the ACM, and a vice chair of the CIPS information retrieval technical committee.
Related Events (a corresponding poster, oral, or spotlight)
-
2019 Oral: Efficient Training of BERT by Progressively Stacking »
Thu. Jun 13th 07:15 -- 07:20 PM Room Hall B
More from the Same Authors
-
2023 Poster: On the Power of Pre-training for Generalization in RL: Provable Benefits and Hardness »
Haotian Ye · Xiaoyu Chen · Liwei Wang · Simon Du -
2023 Poster: A Complete Expressiveness Hierarchy for Subgraph GNNs via Subgraph Weisfeiler-Lehman Tests »
Bohang Zhang · Guhao Feng · Yiheng Du · Di He · Liwei Wang -
2023 Oral: On the Power of Pre-training for Generalization in RL: Provable Benefits and Hardness »
Haotian Ye · Xiaoyu Chen · Liwei Wang · Simon Du -
2023 Poster: Offline Meta Reinforcement Learning with In-Distribution Online Adaptation »
Jianhao Wang · Jin Zhang · Haozhe Jiang · Junyu Zhang · Liwei Wang · Chongjie Zhang -
2023 Poster: Retrosynthetic Planning with Dual Value Networks »
Guoqing Liu · Di Xue · Shufang Xie · Yingce Xia · Austin Tripp · Krzysztof Maziarz · Marwin Segler · Tao Qin · Zongzhang Zhang · Tie-Yan Liu -
2022 Poster: Pessimistic Minimax Value Iteration: Provably Efficient Equilibrium Learning from Offline Datasets »
Han Zhong · Wei Xiong · Jiyuan Tan · Liwei Wang · Tong Zhang · Zhaoran Wang · Zhuoran Yang -
2022 Spotlight: Pessimistic Minimax Value Iteration: Provably Efficient Equilibrium Learning from Offline Datasets »
Han Zhong · Wei Xiong · Jiyuan Tan · Liwei Wang · Tong Zhang · Zhaoran Wang · Zhuoran Yang -
2022 Poster: SE(3) Equivariant Graph Neural Networks with Complete Local Frames »
weitao du · He Zhang · Yuanqi Du · Qi Meng · Wei Chen · Nanning Zheng · Bin Shao · Tie-Yan Liu -
2022 Spotlight: SE(3) Equivariant Graph Neural Networks with Complete Local Frames »
weitao du · He Zhang · Yuanqi Du · Qi Meng · Wei Chen · Nanning Zheng · Bin Shao · Tie-Yan Liu -
2022 Poster: Analyzing and Mitigating Interference in Neural Architecture Search »
Jin Xu · Xu Tan · Kaitao Song · Renqian Luo · Yichong Leng · Tao Qin · Tie-Yan Liu · Jian Li -
2022 Poster: Nearly Optimal Policy Optimization with Stable at Any Time Guarantee »
Tianhao Wu · Yunchang Yang · Han Zhong · Liwei Wang · Simon Du · Jiantao Jiao -
2022 Poster: Supervised Off-Policy Ranking »
Yue Jin · Yue Zhang · Tao Qin · Xudong Zhang · Jian Yuan · Houqiang Li · Tie-Yan Liu -
2022 Poster: Human-in-the-loop: Provably Efficient Preference-based Reinforcement Learning with General Function Approximation »
Xiaoyu Chen · Han Zhong · Zhuoran Yang · Zhaoran Wang · Liwei Wang -
2022 Spotlight: Nearly Optimal Policy Optimization with Stable at Any Time Guarantee »
Tianhao Wu · Yunchang Yang · Han Zhong · Liwei Wang · Simon Du · Jiantao Jiao -
2022 Spotlight: Supervised Off-Policy Ranking »
Yue Jin · Yue Zhang · Tao Qin · Xudong Zhang · Jian Yuan · Houqiang Li · Tie-Yan Liu -
2022 Spotlight: Human-in-the-loop: Provably Efficient Preference-based Reinforcement Learning with General Function Approximation »
Xiaoyu Chen · Han Zhong · Zhuoran Yang · Zhaoran Wang · Liwei Wang -
2022 Spotlight: Analyzing and Mitigating Interference in Neural Architecture Search »
Jin Xu · Xu Tan · Kaitao Song · Renqian Luo · Yichong Leng · Tao Qin · Tie-Yan Liu · Jian Li -
2021 : Discussion Panel #1 »
Hang Su · Matthias Hein · Liwei Wang · Sven Gowal · Jan Hendrik Metzen · Henry Liu · Yisen Wang -
2021 : Invited Talk #1 »
Liwei Wang -
2021 Poster: Large Scale Private Learning via Low-rank Reparametrization »
Da Yu · Huishuai Zhang · Wei Chen · Jian Yin · Tie-Yan Liu -
2021 Poster: Towards Certifying L-infinity Robustness using Neural Networks with L-inf-dist Neurons »
Bohang Zhang · Tianle Cai · Zhou Lu · Di He · Liwei Wang -
2021 Spotlight: Towards Certifying L-infinity Robustness using Neural Networks with L-inf-dist Neurons »
Bohang Zhang · Tianle Cai · Zhou Lu · Di He · Liwei Wang -
2021 Spotlight: Large Scale Private Learning via Low-rank Reparametrization »
Da Yu · Huishuai Zhang · Wei Chen · Jian Yin · Tie-Yan Liu -
2021 Poster: Near-Optimal Representation Learning for Linear Bandits and Linear RL »
Jiachen Hu · Xiaoyu Chen · Chi Jin · Lihong Li · Liwei Wang -
2021 Poster: The Implicit Bias for Adaptive Optimization Algorithms on Homogeneous Neural Networks »
Bohan Wang · Qi Meng · Wei Chen · Tie-Yan Liu -
2021 Spotlight: Near-Optimal Representation Learning for Linear Bandits and Linear RL »
Jiachen Hu · Xiaoyu Chen · Chi Jin · Lihong Li · Liwei Wang -
2021 Oral: The Implicit Bias for Adaptive Optimization Algorithms on Homogeneous Neural Networks »
Bohan Wang · Qi Meng · Wei Chen · Tie-Yan Liu -
2021 Poster: On Reinforcement Learning with Adversarial Corruption and Its Application to Block MDP »
Tianhao Wu · Yunchang Yang · Simon Du · Liwei Wang -
2021 Poster: How could Neural Networks understand Programs? »
Dinglan Peng · Shuxin Zheng · Yatao Li · Guolin Ke · Di He · Tie-Yan Liu -
2021 Spotlight: On Reinforcement Learning with Adversarial Corruption and Its Application to Block MDP »
Tianhao Wu · Yunchang Yang · Simon Du · Liwei Wang -
2021 Spotlight: How could Neural Networks understand Programs? »
Dinglan Peng · Shuxin Zheng · Yatao Li · Guolin Ke · Di He · Tie-Yan Liu -
2021 Poster: Temporally Correlated Task Scheduling for Sequence Learning »
Xueqing Wu · Lewen Wang · Yingce Xia · Weiqing Liu · Lijun Wu · Shufang Xie · Tao Qin · Tie-Yan Liu -
2021 Poster: GraphNorm: A Principled Approach to Accelerating Graph Neural Network Training »
Tianle Cai · Shengjie Luo · Keyulu Xu · Di He · Tie-Yan Liu · Liwei Wang -
2021 Spotlight: GraphNorm: A Principled Approach to Accelerating Graph Neural Network Training »
Tianle Cai · Shengjie Luo · Keyulu Xu · Di He · Tie-Yan Liu · Liwei Wang -
2021 Spotlight: Temporally Correlated Task Scheduling for Sequence Learning »
Xueqing Wu · Lewen Wang · Yingce Xia · Weiqing Liu · Lijun Wu · Shufang Xie · Tao Qin · Tie-Yan Liu -
2020 Poster: On Layer Normalization in the Transformer Architecture »
Ruibin Xiong · Yunchang Yang · Di He · Kai Zheng · Shuxin Zheng · Chen Xing · Huishuai Zhang · Yanyan Lan · Liwei Wang · Tie-Yan Liu -
2020 Poster: Sequence Generation with Mixed Representations »
Lijun Wu · Shufang Xie · Yingce Xia · Yang Fan · Jian-Huang Lai · Tao Qin · Tie-Yan Liu -
2020 Poster: (Locally) Differentially Private Combinatorial Semi-Bandits »
Xiaoyu Chen · Kai Zheng · Zixin Zhou · Yunchang Yang · Wei Chen · Liwei Wang -
2019 Poster: MASS: Masked Sequence to Sequence Pre-training for Language Generation »
Kaitao Song · Xu Tan · Tao Qin · Jianfeng Lu · Tie-Yan Liu -
2019 Poster: Almost Unsupervised Text to Speech and Automatic Speech Recognition »
Yi Ren · Xu Tan · Tao Qin · Sheng Zhao · Zhou Zhao · Tie-Yan Liu -
2019 Poster: Towards a Deep and Unified Understanding of Deep Neural Models in NLP »
Chaoyu Guan · Xiting Wang · Quanshi Zhang · Runjin Chen · Di He · Xing Xie -
2019 Oral: MASS: Masked Sequence to Sequence Pre-training for Language Generation »
Kaitao Song · Xu Tan · Tao Qin · Jianfeng Lu · Tie-Yan Liu -
2019 Oral: Almost Unsupervised Text to Speech and Automatic Speech Recognition »
Yi Ren · Xu Tan · Tao Qin · Sheng Zhao · Zhou Zhao · Tie-Yan Liu -
2019 Oral: Towards a Deep and Unified Understanding of Deep Neural Models in NLP »
Chaoyu Guan · Xiting Wang · Quanshi Zhang · Runjin Chen · Di He · Xing Xie -
2019 Poster: Gradient Descent Finds Global Minima of Deep Neural Networks »
Simon Du · Jason Lee · Haochuan Li · Liwei Wang · Xiyu Zhai -
2019 Oral: Gradient Descent Finds Global Minima of Deep Neural Networks »
Simon Du · Jason Lee · Haochuan Li · Liwei Wang · Xiyu Zhai -
2018 Poster: Towards Binary-Valued Gates for Robust LSTM Training »
Zhuohan Li · Di He · Fei Tian · Wei Chen · Tao Qin · Liwei Wang · Tie-Yan Liu -
2018 Oral: Towards Binary-Valued Gates for Robust LSTM Training »
Zhuohan Li · Di He · Fei Tian · Wei Chen · Tao Qin · Liwei Wang · Tie-Yan Liu -
2018 Poster: Dropout Training, Data-dependent Regularization, and Generalization Bounds »
Wenlong Mou · Yuchen Zhou · Jun Gao · Liwei Wang -
2018 Poster: Model-Level Dual Learning »
Yingce Xia · Xu Tan · Fei Tian · Tao Qin · Nenghai Yu · Tie-Yan Liu -
2018 Oral: Model-Level Dual Learning »
Yingce Xia · Xu Tan · Fei Tian · Tao Qin · Nenghai Yu · Tie-Yan Liu -
2018 Oral: Dropout Training, Data-dependent Regularization, and Generalization Bounds »
Wenlong Mou · Yuchen Zhou · Jun Gao · Liwei Wang -
2017 Poster: Collect at Once, Use Effectively: Making Non-interactive Locally Private Learning Possible »
Kai Zheng · Wenlong Mou · Liwei Wang -
2017 Talk: Collect at Once, Use Effectively: Making Non-interactive Locally Private Learning Possible »
Kai Zheng · Wenlong Mou · Liwei Wang -
2017 Poster: Dual Supervised Learning »
Yingce Xia · Tao Qin · Wei Chen · Jiang Bian · Nenghai Yu · Tie-Yan Liu -
2017 Talk: Dual Supervised Learning »
Yingce Xia · Tao Qin · Wei Chen · Jiang Bian · Nenghai Yu · Tie-Yan Liu