Skip to yearly menu bar Skip to main content


Spotlight Poster

Transformers, parallel computation, and logarithmic depth

Clayton Sanford · Daniel Hsu · Matus Telgarsky

Hall C 4-9 #400
[ ] [ Project Page ]
Wed 24 Jul 4:30 a.m. PDT — 6 a.m. PDT

Abstract:

We show that a constant number of self-attention layers can efficiently simulate—and be simulated by—a constant number of communication rounds of Massively Parallel Computation. As a consequence, we show that logarithmic-depth is sufficient for transformers to solve basic computational tasks that cannot be efficiently solved by several other neural sequence models and sub-quadratic transformer approximations. We thus establish parallelism as a key distinguishing property of transformers.

Live content is unavailable. Log in and register to view live content