Concept-Guided Tokenization: Closing the Gap Between Reconstruction and Generation
Abstract
Recent advances in image generation have been largely driven by image tokenization, which compresses raw pixels into compact latent representations. While existing tokenizers excel at preserving low-level visual details through reconstruction-based training, they often lack explicit semantic guidance, which limits their ability to capture semantically structured representations and thus hinders their performance on downstream tasks like image generation. To overcome this limitation, we propose a novel tokenization framework that incorporates high-level semantics through two key innovations: (1) a text-integrated encoder that jointly processes images and textual descriptions to produce semantically enriched latent representations, and (2) a concept-guided training objective that leverages sparse autoencoders to decompose pre-trained vision-language model features to a semantic concept space, employing sparse and disentangled concept indices for guidance. Our approach achieves strong alignment with semantic concepts, maintaining high reconstruction fidelity with an rFID of 1.39 on ImageNet, while achieving a gFID of 2.65 on the class-conditional image generation task and 10.73 on the text-to-image generation task. By infusing high-level semantic structures into low-level visual fidelity, our method bridges the reconstruction-generation divide and drives generative modeling as a powerful foundation.