On the Intrinsic Limits of Transformer Image Embeddings in Non-Solvable Spatial Reasoning
Siyi Lyu ⋅ Quan Liu ⋅ Feng Yan
Abstract
Vision Transformers (ViTs) excel in semantic recognition but exhibit systematic failures in spatial reasoning tasks such as mental rotation. While often attributed to data scale, this work argues that the limitation arises from the intrinsic circuit complexity of the architecture. By formalizing spatial understanding as a **Group Homomorphism Problem**—requiring that latent embeddings preserve the algebraic composition of physical transformations acting on images—a fundamental computational bottleneck is identified. Specifically, for non-solvable groups (e.g., $\mathrm{SO}(3)$), maintaining such structure-preserving embeddings is lower-bounded by the Word Problem, which is $\mathsf{NC^1}$-complete. In contrast, constant-depth ViTs with polynomial precision are strictly bounded by the complexity class $\mathsf{TC^0}$. Under the standard conjecture $\mathsf{TC^0} \subsetneq \mathsf{NC^1}$, a **complexity boundary** emerges: constant-depth architectures lack the logical depth required to capture non-solvable spatial structures. This theoretical gap is empirically validated via the **Latent Space Algebra (LSA)** benchmark, which reveals a structural collapse in ViT representations as the compositional depth of non-solvable tasks increases.
Successful Page Load