Poster
in
Workshop: Next Generation of Sequence Modeling Architectures
Latte: Latent Attention for Linear Time Transformers
Rares Dolga · Marius Cobzarenco · Ahmed Shahin · David Barber
The time complexity of the standard attention mechanism in a transformer scales quadratically with the length of the sequence. We introduce a method to reduce this to linear scaling with time, based on defining attention via latent vectors. The method is readily usable as a drop-in replacement for the standard attention mechanism. Our ``Latte Transformer'' model can be implemented for both bidirectional and unidirectional (causal) tasks, with the causal version allowing a fast recurrent implementation which is memory and time-efficient during inference of language generation tasks. Whilst next token prediction scales linearly with the sequence length for a standard transformer, a Latte Transformer requires constant time to compute the next token. We also extend our method with a new definition of rotation-based relative embeddings, which can be easily integrated into our formulation. The empirical performance of our method is comparable to standard attention, yet allows scaling to context windows much larger than practical in standard attention.