Layer-wise Quantization for Quantized Optimistic Dual Averaging
Abstract
Lay Summary
Training advanced AI models across many computers often stalls because of the huge amount of information that must be exchanged. We introduce a technique that squeezes the data shared during training by assigning different compression levels to each layer based on its importance. Key layers receive more precision, while others are represented with fewer bits. We integrate this into a training algorithm called Quantized Optimistic Dual Averaging (QODA), which works seamlessly with compressed data and skips extra synchronization steps. We rigorously prove that, despite the reduced communication, our method converges as reliably as standard uncompressed training. In experiments on image generation and large language models across dozens of GPUs, our approach more than doubles training speed while matching final accuracy. By cutting communication costs and speeding up each training round, our work makes distributed deep learning faster, more scalable, and energy-efficient.