A Flat Vocabulary or a Rich Hierarchy? Re-introducing Intrinsic Structure Transforms the Autoregressive Image Generation
Abstract
Autoregressive (AR) models have shown great promise in image generation, yet they face a fundamental inefficiency stemming from their core component: a vast, unstructured vocabulary of visual tokens. By treating tokens as a flat set, standard models overlook the manifold structure where geometric proximity reflects semantic similarity. This oversight unnecessarily complicates the prediction task, hindering training efficiency and limiting generation quality. To resolve this, we propose Manifold-Aligned Semantic Clustering (MASC), a principled framework that constructs a hierarchical semantic tree directly from the codebook's intrinsic geometry. Utilizing a geometry-aware distance metric and density-driven agglomerative construction, MASC faithfully models the token embedding manifold. By transforming the flat, high-dimensional prediction into a structured hierarchical task, MASC introduces a powerful inductive bias that simplifies learning. Designed as a plug-and-play module, MASC accelerates training by up to 71\% and significantly boosts generation quality, improving LlamaGen-XL's FID from 2.87 to 2.49. Crucially, MASC further serves as a convergence enabler for complex architectures. These results establish that structuring the prediction space is as vital as architectural innovation, elevating existing AR frameworks to state-of-the-art performance. Our code is open-sourced via \url{https://anonymous.4open.science/r/anonymous_MASC-50F6/}