ICML Poster Layer-wise Quantization for Quantized Optimistic Dual Averaging

Poster

Layer-wise Quantization for Quantized Optimistic Dual Averaging

Anh Duc Nguyen · Ilia Markov · Zhengqing Wu · Ali Ramezani-Kebrya · Kimon Antonakopoulos · Dan Alistarh · Volkan Cevher

West Exhibition Hall B2-B3 #W-613

[ Abstract ] [ Lay Summary ]

[ Poster] [ OpenReview]

Thu 17 Jul 4:30 p.m. PDT — 7 p.m. PDT

Abstract: Modern deep neural networks exhibit heterogeneity across numerous layers of various types such as residuals, multi-head attention, etc., due to varying structures (dimensions, activation functions, etc.), distinct representation characteristics, which impact predictions. We develop a general layer-wise quantization framework with tight variance and code-length bounds, adapting to the heterogeneities over the course of training. We then apply a new layer-wise quantization technique within distributed variational inequalities (VIs), proposing a novel Quantized Optimistic Dual Averaging (QODA) algorithm with adaptive learning rates, which achieves competitive convergence rates for monotone VIs. We empirically show that QODA achieves up to a $150$% speedup over the baselines in end-to-end training time for training Wasserstein GAN on $12+$ GPUs.

Lay Summary:

Training advanced AI models across many computers often stalls because of the huge amount of information that must be exchanged. We introduce a technique that squeezes the data shared during training by assigning different compression levels to each layer based on its importance. Key layers receive more precision, while others are represented with fewer bits. We integrate this into a training algorithm called Quantized Optimistic Dual Averaging (QODA), which works seamlessly with compressed data and skips extra synchronization steps. We rigorously prove that, despite the reduced communication, our method converges as reliably as standard uncompressed training. In experiments on image generation and large language models across dozens of GPUs, our approach more than doubles training speed while matching final accuracy. By cutting communication costs and speeding up each training round, our work makes distributed deep learning faster, more scalable, and energy-efficient.

Chat is not available.