Timezone: »

Transformer Quality in Linear Time
Weizhe Hua · Zihang Dai · Hanxiao Liu · Quoc Le

Thu Jul 21 11:40 AM -- 11:45 AM (PDT) @ Hall F

We challenge some key design choices in Transformers by presenting a simpler yet more powerful layer named gated attention unit. The layer offers several desirable properties that enable a high-performance, accelerator-friendly approximate attention mechanism. The resulting architecture family, named FLAC, simultaneously achieves Transformer quality and linear cost over a wide range of context lengths (from 512 to 8K). Experiments on bidirectional and auto-regressive language modeling tasks demonstrate that FLAC can compete with fully augmented Transformers in quality, while being substantially faster to train than existing efficient attention methods.

Author Information

Weizhe Hua (Cornell University / Google Brain)
Zihang Dai
Hanxiao Liu (Google Brain)
Quoc Le (Google Brain)

Related Events (a corresponding poster, oral, or spotlight)

More from the Same Authors