Poster

MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality

Panqi Yang ⋅ Haodong Jing ⋅ Jiahao Chao ⋅ Tingyan Xiang ⋅ Li Lin ⋅ Yao Hu ⋅ Yang Luo ⋅ Yongqiang Ma

Project Page

Abstract

Unified visual tokenization faces a fundamental trade-off: optimizing for high-fidelity pixel reconstruction (spatial equivariance) inherently conflicts with semantic abstraction (conceptual invariance). We identify the root cause as Manifold Misalignment, where naive joint optimization leads to conflicting gradients that force a zero-sum game between these two objectives. In this paper, we propose MUSE, a framework that resolves this deadlock via Topological Orthogonality. Recognizing Structure as the orthogonal bridge, MUSE physically decouples the optimization subspaces within Transformers. We route structural gradients to refine attention topology and semantic gradients to update feature values, transforming destructive interference into Mutual Reinforcement. Extensive experiments demonstrate that MUSE breaks the trade-off, matching state-of-the-art generation (gFID 3.08) while notably outperforming its own teacher InternViT-300M in linear probing (85.2% vs. 82.5%), proving that structurally aligned reconstruction actively refines semantic perception.