Timezone: »

On Layer Normalization in the Transformer Architecture
Ruibin Xiong · Yunchang Yang · Di He · Kai Zheng · Shuxin Zheng · Chen Xing · Huishuai Zhang · Yanyan Lan · Liwei Wang · Tie-Yan Liu

Thu Jul 16 06:00 PM -- 06:45 PM & Fri Jul 17 04:00 AM -- 04:45 AM (PDT) @ None #None

The Transformer is widely used in natural language processing tasks. To train a Transformer however, one usually needs a carefully designed learning rate warm-up stage, which is shown to be crucial to the final performance but will slow down the optimization and bring more hyper-parameter tunings. In this paper, we first study theoretically why the learning rate warm-up stage is essential and show that the location of layer normalization matters. Specifically, we prove with mean field theory that at initialization, for the original-designed Post-LN Transformer, which places the layer normalization between the residual blocks, the expected gradients of the parameters near the output layer are large. Therefore, using a large learning rate on those gradients makes the training unstable. The warm-up stage is practically helpful for avoiding this problem. On the other hand, our theory also shows that if the layer normalization is put inside the residual blocks (recently proposed as Pre-LN Transformer), the gradients are well-behaved at initialization. This motivates us to remove the warm-up stage for the training of Pre-LN Transformers. We show in our experiments that Pre-LN Transformers without the warm-up stage can reach comparable results with baselines while requiring significantly less training time and hyper-parameter tuning on a wide range of applications.

Author Information

Ruibin Xiong (Institute of Computing Technology, Chinese Academy of Sciences)
Yunchang Yang (Center for Data Science, Peking University)
Di He (Microsoft Research)
Kai Zheng (Peking University)
Shuxin Zheng (microsoft.com)
Chen Xing (Nankai University)
Huishuai Zhang (Microsoft)
Yanyan Lan (Institute of Computing Technology)
Liwei Wang (Peking University)
Tie-Yan Liu (Microsoft Research Asia)

Tie-Yan Liu is a principal researcher of Microsoft Research Asia, leading the research on artificial intelligence and machine learning. He is very well known for his pioneer work on learning to rank and computational advertising, and his recent research interests include deep learning, reinforcement learning, and distributed machine learning. Many of his technologies have been transferred to Microsoft’s products and online services (such as Bing, Microsoft Advertising, and Azure), and open-sourced through Microsoft Cognitive Toolkit (CNTK), Microsoft Distributed Machine Learning Toolkit (DMTK), and Microsoft Graph Engine. On the other hand, he has been actively contributing to academic communities. He is an adjunct/honorary professor at Carnegie Mellon University (CMU), University of Nottingham, and several other universities in China. His papers have been cited for tens of thousands of times in refereed conferences and journals. He has won quite a few awards, including the best student paper award at SIGIR (2008), the most cited paper award at Journal of Visual Communications and Image Representation (2004-2006), the research break-through award (2012) and research-team-of-the-year award (2017) at Microsoft Research, and Top-10 Springer Computer Science books by Chinese authors (2015), and the most cited Chinese researcher by Elsevier (2017). He has been invited to serve as general chair, program committee chair, local chair, or area chair for a dozen of top conferences including SIGIR, WWW, KDD, ICML, NIPS, IJCAI, AAAI, ACL, ICTIR, as well as associate editor of ACM Transactions on Information Systems, ACM Transactions on the Web, and Neurocomputing. Tie-Yan Liu is a fellow of the IEEE, a distinguished member of the ACM, and a vice chair of the CIPS information retrieval technical committee.

More from the Same Authors