ViTok-v2: Scaling Native-Resolution Autoencoders to 5B
Abstract
Vision Transformer (ViT) tokenizers offer a scal- able alternative to convolutional auto-encoders, yet current architectures have two key limitations: their performance degrades when images vary in aspect ratio or resolution, and their reliance on adversarial losses makes them harder to train at scale. To address this, we introduce ViTok-v2, a ViT tokenizer building on ViTok. We add native resolution support via NaFlex with 2D RoPE and stabilize training by replacing the standard LPIPS-plus-discriminator objective with our novel DINO perceptual loss. We scale our model to 5B parameters, training the largest ViT-based image compression auto- encoder to date and demonstrate continued im- provements with scale. In downstream generation experiments with flow matching models, we find that smaller generators perform best with aggres- sive channel compression while larger generators effectively leverage higher channel counts. ViTok- v2 matches state-of-the-art reconstruction at 256p and outperforms across benchmarks at 512p and higher resolutons, while remaining compatible with any pipeline requiring flexible aspect ratios.