FourTune: Towards Fully 4-Bit Efficient Post-Training for Diffusion Models
Bowen Xue ⋅ Zihan Min ⋅ Xingyang Li ⋅ Muyang Li ⋅ Yujun Lin ⋅ Zhekai Zhang ⋅ Haocheng Xi ⋅ Lvmin Zhang ⋅ Maneesh Agrawala ⋅ Jun-Yan Zhu ⋅ Song Han
Abstract
Diffusion models have become a dominant paradigm for high-quality generative modeling, while post-training is essential for adapting them to diverse downstream applications. However, post-training of large diffusion models is still challenging due to the prohibitive memory footprints and slow training speed, which existing parameter-efficient fine-tuning methods only partially address. To overcome these limitations, we propose FourTune, an efficient post-training framework for diffusion models based on an end-to-end W4A4G4 paradigm. FourTune introduces a triple-branch hybrid pipeline that augments the standard LoRA architecture with a frozen numerical stabilizer to isolate quantization-sensitive outliers, enabling stable training under native 4-bit computation. In addition, FourTune employs hardware-efficient block-wise quantization and customized fused kernels to support efficient quantized backpropagation and reduce memory bandwidth overhead. Across customization, reinforcement learning, and distillation tasks, FourTune matches the quality of full-precision fine-tuning. On FLUX.1-dev (12B), FourTune reduces memory overhead by $2.25\times$ and increases end-to-end training throughput by $2.27\times$ compared to BF16 LoRA.
Successful Page Load