RobuQ: Pushing DiTs to W1.58A2 via Robust Activation Quantization
Abstract
Diffusion Transformers (DiTs) have emerged as a powerful backbone for image generation, offering superior scalability over U-Nets. However, their practical deployment is hindered by significant computational costs. While Quantization-Aware Training (QAT) shows promise, its application to DiTs is challenged by the high sensitivity and complex distributions of activations. Identifying activation quantization as the primary bottleneck for low-bit settings, we propose RobuQ, a systematic QAT framework. We first establish a strong ternary weight (W1.58A4) baseline. Building on this, we introduce RobustQuantizer, which utilizes the Hadamard transform to convert unknown per-token distributions into normal distributions. Furthermore, we propose AMPN, the first Activation-only Mixed-Precision Network pipeline, applying ternary weights globally while allocating layer-specific activation precisions to eliminate information bottlenecks. Extensive experiments demonstrate that RobuQ achieves state-of-the-art performance on ImageNet-1K, representing the first stable image generation with activations quantized to an average of 2 bits.