Timezone: »
Self-attention mechanisms model long-range context by using pairwise attention between all input tokens. In doing so, they assume a fixed attention granularity defined by the individual tokens (e.g., text characters or image pixels), which may not be optimal for modeling complex dependencies at higher levels. In this paper, we propose ContextPool to address this problem by adapting the attention granularity for each token. Inspired by the success of ConvNets that are combined with pooling to capture long-range dependencies, we learn to pool neighboring features for each token before computing attention in a given attention layer. The pooling weights and support size are adaptively determined, allowing the pooled features to encode meaningful context with varying scale. We show that ContextPool makes attention models more expressive, achieving strong performance often with fewer layers and thus significantly reduced cost. Experiments validate that our ContextPool module, when plugged into transformer models, matches or surpasses state-of-the-art performance using less compute on several language and image benchmarks, outperforms recent works with learned context sizes or sparse attention patterns, and is also applicable to ConvNets for efficient feature learning.
Author Information
Chen Huang (Apple Inc.)
Walter Talbott (Apple)
Navdeep Jaitly (Apple)
Joshua M Susskind (Apple, Inc.)
Related Events (a corresponding poster, oral, or spotlight)
-
2022 Spotlight: Efficient Representation Learning via Adaptive Context Pooling »
Thu. Jul 21st 06:00 -- 06:05 PM Room Hall F
More from the Same Authors
-
2021 : Implicit Acceleration and Feature Learning in Infinitely Wide Neural Networks with Bottlenecks »
Etai Littwin · Omid Saremi · Shuangfei Zhai · Vimal Thilak · Hanlin Goh · Joshua M Susskind · Greg Yang -
2021 : Implicit Greedy Rank Learning in Autoencoders via Overparameterized Linear Networks »
Shih-Yu Sun · Vimal Thilak · Etai Littwin · Omid Saremi · Joshua M Susskind -
2023 : BOOT: Data-free Distillation of Denoising Diffusion Models with Bootstrapping »
Jiatao Gu · Shuangfei Zhai · Yizhe Zhang · Lingjie Liu · Joshua M Susskind -
2023 Poster: Stabilizing Transformer Training by Preventing Attention Entropy Collapse »
Shuangfei Zhai · Tatiana Likhomanenko · Etai Littwin · Dan Busbridge · Jason Ramapuram · Yizhe Zhang · Jiatao Gu · Joshua M Susskind -
2023 Poster: NerfDiff: Single-image View Synthesis with NeRF-guided Distillation from 3D-aware Diffusion »
Jiatao Gu · Alex Trevithick · Kai-En Lin · Joshua M Susskind · Christian Theobalt · Lingjie Liu · Ravi Ramamoorthi -
2022 Poster: Position Prediction as an Effective Pretraining Strategy »
Shuangfei Zhai · Navdeep Jaitly · Jason Ramapuram · Dan Busbridge · Tatiana Likhomanenko · Joseph Cheng · Walter Talbott · Chen Huang · Hanlin Goh · Joshua M Susskind -
2022 Spotlight: Position Prediction as an Effective Pretraining Strategy »
Shuangfei Zhai · Navdeep Jaitly · Jason Ramapuram · Dan Busbridge · Tatiana Likhomanenko · Joseph Cheng · Walter Talbott · Chen Huang · Hanlin Goh · Joshua M Susskind -
2021 Poster: Uncertainty Weighted Actor-Critic for Offline Reinforcement Learning »
Yue Wu · Shuangfei Zhai · Nitish Srivastava · Joshua M Susskind · Jian Zhang · Ruslan Salakhutdinov · Hanlin Goh -
2021 Spotlight: Uncertainty Weighted Actor-Critic for Offline Reinforcement Learning »
Yue Wu · Shuangfei Zhai · Nitish Srivastava · Joshua M Susskind · Jian Zhang · Ruslan Salakhutdinov · Hanlin Goh -
2020 Poster: Equivariant Neural Rendering »
Emilien Dupont · Miguel Angel Bautista Martin · Alex Colburn · Aditya Sankar · Joshua M Susskind · Qi Shan -
2019 Poster: Addressing the Loss-Metric Mismatch with Adaptive Loss Alignment »
Chen Huang · Shuangfei Zhai · Walter Talbott · Miguel Angel Bautista Martin · Shih-Yu Sun · Carlos Guestrin · Joshua M Susskind -
2019 Oral: Addressing the Loss-Metric Mismatch with Adaptive Loss Alignment »
Chen Huang · Shuangfei Zhai · Walter Talbott · Miguel Angel Bautista Martin · Shih-Yu Sun · Carlos Guestrin · Joshua M Susskind