Hierarchical Image Tokenization for Multi-Scale Image Super Resolution
Abstract
We introduce a multi-scale Image Super Resolution (ISR) method building on recent advances in Visual Auto-Regressive (VAR) modeling. Recently, VAR models challenged the dominance of diffusion-based models by adopting a next-scale prediction paradigm. Specifically, VAR models iteratively estimate the residual in latent space between gradually increasing image scales, an approach that aligns perfectly with our target ISR task. Previous work taking advantage of this synergy, suffer from two main shortcomings. First, due to the limitations of the residual quantizers used in VAR models, they typically only generate images at a predefined fixed scale, i.e. they fail to map intermediate outputs to the corresponding image scales. Also, to achieve better performance, they rely on large backbones and either external VLM for guidance, or a large corpus of external carefully annotated data. To address both shortcomings, we introduce two novel components to the VAR training for ISR, aiming at increasing its flexibility and reducing its complexity. In particular, we introduce a) a Hierarchical Image Tokenization (HIT) approach with a multi-scale image tokenizer that progressively represents images at different scales while enforcing token overlap across scales, and b) a Direct Preference Optimization (DPO) regularization term that, relying solely on the (LR,HR) pair, encourages the transformer to produce the latter over the former. The resulting model can denoise the LR image and super-resolve at different upscale factors in a single forward pass, and achieves state-of-the-art results using a relatively small model (300M params vs ~1B params of VARSR), and without the need of external training data.