Poster
in
Workshop: Next Generation of Sequence Modeling Architectures
Viewing Attention as a Recurrent Neural Network
Leo Feng · Frederick Tung · Hossein Hajimirsadeghi · Mohamed Osama Ahmed · Yoshua Bengio · Greg Mori
The advent of Transformers, a high-performant parallelizable model, marked a significant breakthrough in sequence modelling. However, Transformers are computationally expensive at inference time, limiting their applications, particularly in low-resource settings (e.g., embedded devices). Addressing this, we (1) show that attention can be viewed as a special Recurrent Neural Network (RNN) with the ability to compute its many-to-one RNN output efficiently. For sequence modelling, we (2) introduce a new efficient method of computing attention's many-to-many RNN output based on the parallel prefix scan algorithm. Building on the new attention formulation, we (3) introduce Aaren, an attention-based module that can not only (i) be trained in parallel (like Transformers) but also (ii) be updated efficiently with new tokens, requiring only constant memory (like traditional RNNs). Empirically, we show Aarens achieve comparable performance to Transformers on reinforcement learning and event forecasting while being more time and memory-efficient.