MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality
Abstract
Unified visual tokenization faces a fundamental trade-off: optimizing for high-fidelity pixel reconstruction (spatial equivariance) inherently conflicts with semantic abstraction (conceptual invariance). We identify the root cause as Manifold Misalignment, where naive joint optimization leads to conflicting gradients that force a zero-sum game between these two objectives. In this paper, we propose MUSE, a framework that resolves this deadlock via Topological Orthogonality. Recognizing Structure as the orthogonal bridge, MUSE physically decouples the optimization subspaces within Transformers. We route structural gradients to refine attention topology and semantic gradients to update feature values, transforming destructive interference into Mutual Reinforcement. Extensive experiments demonstrate that MUSE breaks the trade-off, matching state-of-the-art generation (gFID 3.08) while notably outperforming its own teacher InternViT-300M in linear probing (85.2% vs. 82.5%), proving that structurally aligned reconstruction actively refines semantic perception.