The Expressivity Limits of Transformers
Abstract
We study the fundamental expressivity limits of transformer models by formalizing the notion of accessible sequences---those that a transformer can produce for some prompt---and characterizing how accessibility depends on prompt length and model parameters. Our analysis provides a theoretical explanation for previously observed empirical failures of transformers on simple sequence tasks---such as copying and cramming---and yields both qualitative and quantitative predictions that hold across a wide range of architectures and model sizes. We prove that (i) the maximal length of accessible sequences grows linearly with the prompt length, (ii) beyond a critical threshold the proportion of accessible sequences decays exponentially with sequence length, and (iii) the linear coefficient relating prompt length to accessible sequence length admits a theoretical upper bound. Notably, these results hold even with unbounded context and computation time. Experiments using a “cramming” procedure confirm the linear scaling, the post-threshold exponential decay, and the tightness of the theoretical upper bound on different sizes of Pythia, Llamma, and Qwen architectures.