Scalable GANs with Transformers
Abstract
Scalability has driven recent advances in generative modeling, yet it remains underexplored for adversarial learning. We study the scaling behavior of Generative Adversarial Networks through two design choices: training in a compact Variational Autoencoder latent space and using purely transformer-based generators and discriminators. While this setup is efficient and scales well with compute, naively scaling exposes failure modes; underutilization of early layers in the generator and increasing optimization instability. We address these issues with lightweight intermediate supervision and width-aware learning-rate adjustment. Our Generative Adversarial Transformers (GAT) train reliably from small (S) to extra-large (XL) model sizes, and GAT-XL model achieves state-of-the-art single-step class-conditional generation on ImageNet at 256×256 resolution (FID of 2.18) in 60 epochs, requiring 4x fewer epochs than strong baselines.