Transformers learn factored representations
Abstract
Transformers pretrained via next token prediction learn to factor their world into parts, representing these factors in orthogonal subspaces of the residual stream. We formalize two representational hypotheses: (1) a representation in the product space of all factors, whose dimension grows exponentially with the number of parts, or (2) a factored representation in orthogonal subspaces, whose dimension grows linearly. Both track context-induced uncertainty over the latent parts, but the factored representation sacrifices fidelity when factors are not conditionally independent. We derive precise predictions about the geometric structure of activations for each, including the number of subspaces, their dimensionality, and the arrangement of context embeddings within them. We test between these hypotheses on transformers trained on synthetic processes with known latent structure. When factors are conditionally independent, models learn factored representations; when noise or dependencies break this structure, models gradually expand their effective dimensionality over training to recover fidelity. This provides a principled explanation for why transformers decompose the world into parts, and suggests that interpretable low dimensional structure may persist even in models trained on complex data.