The Extra Tokens Matter: Disentangled Representation Learning with Vision Transformers
Abstract
Vision Transformers increasingly incorporate extra tokens beyond patch tokens—from class tokens for aggregation to register tokens for artifact mitigation. While effective for their intended purposes, these tokens typically lack semantic structure. We ask a more ambitious question: Can we design regularization constraints that transform extra tokens into disentangled representations, enabling them to decompose images into semantic parts (e.g., heads, bodies, legs) without explicit supervision? We propose XTRA, an intuitive yet powerful framework that augments Vision Transformers with dedicated ``factor tokens'' and enforces disentanglement via a novel Minimum Volume Constraint (MVC). A multi-stage aggregation process further enforces these factor tokens into semantically pure components, preventing token collapse that often occurs when training with MVC alone. On ImageNet-1K, XTRA achieves superior disentanglement (8.4× improvement in SEPIN@1 over DINOv2) while simultaneously improving representation quality: KNN accuracy improves by 5.8\% and linear-probe accuracy by 2.3\%.