Length Generalization Bounds for Transformers
Anthony Lin ⋅ Pascal Bergsträßer ⋅ Georg Zetzsche ⋅ Andy Yang ⋅ David Chiang
Abstract
Length generalization is a key property of a learning algorithm that enables it to make correct predictions on inputs of unbounded length, given finite training data. To provide such a guarantee, one needs to be able to compute a length generalization bound, beyond which the model is guaranteed to generalize. This paper concerns the open problem of the computability of such generalization bounds for $\mathsf{C}$-$\mathsf{RASP}$, which is closely linked to transformers. A positive partial result was recently shown Chen et al. whenever the concept is definable in $\mathsf{C}$-$\mathsf{RASP}$ with only one layer and, under some restrictions, also with two layers. We provide complete answers to the above open problem. Our main result is the non-existence of computable length generalization bounds for $\mathsf{C}$-$\mathsf{RASP}$(already with two layers) and hence for transformers. To complement this, we provide a computable bound for concepts representable in the positive fragment of $\mathsf{C}$-$\mathsf{RASP}$, which we show equivalent to fixed-precision transformers. For both positive $\mathsf{C}$-$\mathsf{RASP}$ and fixed-precision transformers, we show that the length complexity is exponential, and prove optimality of the bounds.
Successful Page Load