Revisiting Padded Transformer Expressivity: Which Architectural Choices Matter and Which Don't
Anej Svete ⋅ William Merrill ⋅ Ryan Cotterell ⋅ Ashish Sabharwal
Abstract
Recent work describes what transformers can and cannot compute through connections to boolean circuits, but existing results lack exact characterizations and are sensitive to modeling choices. Padded transformers---whose input is padded with filler symbols such as ``...''---emerge as a useful gadget for establishing equivalences to circuit classes by providing polynomial space for adaptive parallel computation. However, only a limited set of padded transformer idealizations has been studied, leaving open how robust these equivalences are to choices such as attention type, model width, and uniformity. We find that, under practical assumptions, padded transformers are surprisingly robust to all of these, and identify numeric precision and model depth as the main factors affecting expressivity. Concretely, we prove that polynomially-padded L-uniform constant-precision transformers with growing width equal L-uniform $\text{AC}^{0}$, while growing-precision ones achieve L-uniform $\text{TC}^{0}$ regardless of width. Furthermore, looping enables sequential processing analogous to circuits: $\log^d N$-looped constant-precision transformers reach FO-uniform $\text{AC}^{d}$, and growing-precision ones reach FO-uniform $\text{TC}^{d}$. Interestingly, growing width or precision beyond logarithmic does not increase expressivity, and all our results hold for both softmax and average hard attention transformers.
Successful Page Load