Skip to yearly menu bar Skip to main content


Poster

Gated Linear Attention Transformers with Hardware-Efficient Training

Songlin Yang · Bailin Wang · Yikang Shen · Rameswar Panda · Yoon Kim


Abstract:

Transformers with linear attention allow for efficient parallel training but can simultaneously be formulated as an RNN with 2D (matrix-valued) hidden states, thus enjoying linear-time inference complexity. However, linear attention generally underperforms ordinary softmax attention. Moreover, current algorithms for linear attention generally lack IO-awareness and are thus slower than highly optimized softmax attention implementations such as FlashAttention-2. This work describes a hardware-efficient algorithm for linear attention that trades off memory movement against parallelizability. We then generalize this algorithm to a more expressive variant of linear attention with data-dependent gates. The resulting gated linear attention (GLA) Transformer is found to perform competitively against the LLaMA-architecture Transformer as well recent linear-time-inference baselines such as RetNet and Mamba on moderate-scale language modeling. GLA is especially effective at length generalization, enabling a model trained on 2K to generalize to 28K without significant perplexity degradations. For training speed, our Triton-based GLA layer outperforms FlashAttention-2 , while the GLA Transformer has higher throughput than Mamba.

Live content is unavailable. Log in and register to view live content