Timezone: »
Transformer model with multi-head attention requires caching intermediate results for efficient inference in generation tasks. However, cache brings new memory-related costs and prevents leveraging larger batch size for faster speed. We propose memory-efficient lossless attention (called EL-attention) to address this issue. It avoids heavy operations for building multi-head keys and values, cache for them is not needed. EL-attention constructs an ensemble of attention results by expanding query while keeping key and value shared. It produces the same result as multi-head attention with less GPU memory and faster inference speed. We conduct extensive experiments on Transformer, BART, and GPT-2 for summarization and question generation tasks. The results show EL-attention speeds up existing models by 1.6x to 5.3x without accuracy loss.
Author Information
Yu Yan (Microsoft)
Jiusheng Chen (Microsoft)
Weizhen Qi (University of Science and Technology of China)
Nikhil Bhendawade (Microsoft)
Yeyun Gong (Microsoft Research Asia)
Nan Duan (Microsoft Research)
Ruofei Zhang (Microsoft)
Related Events (a corresponding poster, oral, or spotlight)
-
2021 Spotlight: EL-Attention: Memory Efficient Lossless Attention for Generation »
Thu. Jul 22nd 02:45 -- 02:50 PM Room
More from the Same Authors
-
2023 Poster: Text Generation with Diffusion Language Models: A Pre-training Approach with Continuous Paragraph Denoise »
Zhenghao Lin · Yeyun Gong · Yelong Shen · Tong Wu · Zhihao Fan · Chen Lin · Nan Duan · Weizhu Chen -
2023 Poster: LongCoder: A Long-Range Pre-trained Language Model for Code Completion »
Daya Guo · Canwen Xu · Nan Duan · Jian Yin · Julian McAuley -
2023 Poster: Synthetic Prompting: Generating Chain-of-Thought Demonstrations for Large Language Models »
Zhihong Shao · Yeyun Gong · Yelong Shen · Minlie Huang · Nan Duan · Weizhu Chen -
2021 Poster: BANG: Bridging Autoregressive and Non-autoregressive Generation with Large Scale Pretraining »
Weizhen Qi · Yeyun Gong · Jian Jiao · Yu Yan · Weizhu Chen · Dayiheng Liu · Kewen Tang · Houqiang Li · Jiusheng Chen · Ruofei Zhang · Ming Zhou · Nan Duan -
2021 Spotlight: BANG: Bridging Autoregressive and Non-autoregressive Generation with Large Scale Pretraining »
Weizhen Qi · Yeyun Gong · Jian Jiao · Yu Yan · Weizhu Chen · Dayiheng Liu · Kewen Tang · Houqiang Li · Jiusheng Chen · Ruofei Zhang · Ming Zhou · Nan Duan -
2021 Poster: Poolingformer: Long Document Modeling with Pooling Attention »
Hang ZHANG · Yeyun Gong · Yelong Shen · Weisheng Li · Jiancheng Lv · Nan Duan · Weizhu Chen -
2021 Spotlight: Poolingformer: Long Document Modeling with Pooling Attention »
Hang ZHANG · Yeyun Gong · Yelong Shen · Weisheng Li · Jiancheng Lv · Nan Duan · Weizhu Chen