Timezone: »
Poster
Stabilizing Transformer Training by Preventing Attention Entropy Collapse
Shuangfei Zhai · Tatiana Likhomanenko · Etai Littwin · Dan Busbridge · Jason Ramapuram · Yizhe Zhang · Jiatao Gu · Joshua M Susskind
Training stability is of great importance to Transformers. In this work, we investigate the training dynamics of Transformers by examining the evolution of the attention layers. In particular, we track the attention entropy for each attention head during the course of training, which is a proxy for model sharpness. We identify a common pattern across different architectures and tasks, where low attention entropy is accompanied by high training instability, which can take the form of oscillating loss or divergence. We denote the pathologically low attention entropy, corresponding to highly concentrated attention scores, as $\textit{entropy collapse}$. As a remedy, we propose $\sigma$Reparam, a simple and efficient solution where we reparametrize all linear layers with spectral normalization and an additional learned scalar. We demonstrate that $\sigma$Reparam successfully prevents entropy collapse in the attention layers, promoting more stable training. Additionally, we prove a tight lower bound of the attention entropy, which decreases exponentially fast with the spectral norm of the attention logits, providing additional motivation for our approach. We conduct experiments with $\sigma$Reparam on image classification, image self-supervised learning, machine translation, speech recognition, and language modeling tasks. We show that $\sigma$Reparam provides stability and robustness with respect to the choice of hyperparameters, going so far as enabling training (a) a Vision Transformer to competitive performance without warmup, weight decay, layer normalization or adaptive optimizers; (b) deep architectures in machine translation and (c) speech recognition to competitive performance without warmup and adaptive optimizers. Code is available at https://github.com/apple/ml-sigma-reparam.
Author Information
Shuangfei Zhai (Apple)
Tatiana Likhomanenko (Apple)
Etai Littwin (Apple)
Dan Busbridge (Apple)
Jason Ramapuram (Apple)
Yizhe Zhang (Machine Learning Research @ )

I am a Research scientist at Apple MLR, primarily working on Natural language processing and Machine Learning. Before joining Apple, I have been at Meta AI and Microsoft Research, working on natural language generation and NLP pre-training.
Jiatao Gu (Apple (MLR))
Joshua M Susskind (Apple, Inc.)
More from the Same Authors
-
2021 : Implicit Acceleration and Feature Learning in Infinitely Wide Neural Networks with Bottlenecks »
Etai Littwin · Omid Saremi · Shuangfei Zhai · Vimal Thilak · Hanlin Goh · Joshua M Susskind · Greg Yang -
2021 : Implicit Greedy Rank Learning in Autoencoders via Overparameterized Linear Networks »
Shih-Yu Sun · Vimal Thilak · Etai Littwin · Omid Saremi · Joshua M Susskind -
2023 : BOOT: Data-free Distillation of Denoising Diffusion Models with Bootstrapping »
Jiatao Gu · Shuangfei Zhai · Yizhe Zhang · Lingjie Liu · Joshua M Susskind -
2023 Poster: The Role of Entropy and Reconstruction in Multi-View Self-Supervised Learning »
Borja Rodríguez Gálvez · Arno Blaas · Pau Rodriguez · Adam Golinski · Xavi Suau · Jason Ramapuram · Dan Busbridge · Luca Zappella -
2023 Poster: DUET: 2D Structured and Approximately Equivariant Representations »
Xavi Suau · Federico Danieli · T. Anderson Keller · Arno Blaas · Chen Huang · Jason Ramapuram · Dan Busbridge · Luca Zappella -
2023 Poster: NerfDiff: Single-image View Synthesis with NeRF-guided Distillation from 3D-aware Diffusion »
Jiatao Gu · Alex Trevithick · Kai-En Lin · Joshua M Susskind · Christian Theobalt · Lingjie Liu · Ravi Ramamoorthi -
2022 Poster: Efficient Representation Learning via Adaptive Context Pooling »
Chen Huang · Walter Talbott · Navdeep Jaitly · Joshua M Susskind -
2022 Spotlight: Efficient Representation Learning via Adaptive Context Pooling »
Chen Huang · Walter Talbott · Navdeep Jaitly · Joshua M Susskind -
2022 Poster: Position Prediction as an Effective Pretraining Strategy »
Shuangfei Zhai · Navdeep Jaitly · Jason Ramapuram · Dan Busbridge · Tatiana Likhomanenko · Joseph Cheng · Walter Talbott · Chen Huang · Hanlin Goh · Joshua M Susskind -
2022 Spotlight: Position Prediction as an Effective Pretraining Strategy »
Shuangfei Zhai · Navdeep Jaitly · Jason Ramapuram · Dan Busbridge · Tatiana Likhomanenko · Joseph Cheng · Walter Talbott · Chen Huang · Hanlin Goh · Joshua M Susskind -
2021 Poster: Tensor Programs IIb: Architectural Universality Of Neural Tangent Kernel Training Dynamics »
Greg Yang · Etai Littwin -
2021 Spotlight: Tensor Programs IIb: Architectural Universality Of Neural Tangent Kernel Training Dynamics »
Greg Yang · Etai Littwin -
2021 Poster: Uncertainty Weighted Actor-Critic for Offline Reinforcement Learning »
Yue Wu · Shuangfei Zhai · Nitish Srivastava · Joshua M Susskind · Jian Zhang · Ruslan Salakhutdinov · Hanlin Goh -
2021 Spotlight: Uncertainty Weighted Actor-Critic for Offline Reinforcement Learning »
Yue Wu · Shuangfei Zhai · Nitish Srivastava · Joshua M Susskind · Jian Zhang · Ruslan Salakhutdinov · Hanlin Goh -
2020 Poster: Equivariant Neural Rendering »
Emilien Dupont · Miguel Angel Bautista Martin · Alex Colburn · Aditya Sankar · Joshua M Susskind · Qi Shan -
2019 Poster: Addressing the Loss-Metric Mismatch with Adaptive Loss Alignment »
Chen Huang · Shuangfei Zhai · Walter Talbott · Miguel Angel Bautista Martin · Shih-Yu Sun · Carlos Guestrin · Joshua M Susskind -
2019 Oral: Addressing the Loss-Metric Mismatch with Adaptive Loss Alignment »
Chen Huang · Shuangfei Zhai · Walter Talbott · Miguel Angel Bautista Martin · Shih-Yu Sun · Carlos Guestrin · Joshua M Susskind