Timezone: »

CocktailSGD: Fine-tuning Foundation Models over 500Mbps Networks
Jue Wang · Yucheng Lu · Binhang Yuan · Beidi Chen · Percy Liang · Chris De Sa · Christopher Re · Ce Zhang

Wed Jul 26 02:00 PM -- 03:30 PM (PDT) @ Exhibit Hall 1 #111
Distributed training of foundation models, especially large language models (LLMs), is communication-intensive and so has heavily relied on centralized data centers with fast interconnects. Can we train on slow networks and unlock the potential of decentralized infrastructure for foundation models? In this paper, we propose CocktailSGD, a novel communication-efficient training framework that combines three distinct compression techniques -- random sparsification, top-K sparsification, and quantization -- to achieve much greater compression than each individual technique alone. We justify the benefit of such a hybrid approach through a theoretical analysis of convergence. Empirically, we show that CocktailSGD achieves up to 117$\times$ compression in fine-tuning LLMs up to 20 billion parameters without hurting convergence. On a 500Mbps network, CocktailSGD only incurs $\sim$1.2$\times$ slowdown compared with data center networks.

Author Information

Jue Wang (ETH Zürich)
Yucheng Lu (Cornell University)
Binhang Yuan (Swiss Federal Institute of Technology)
Beidi Chen (CMU / FAIR)
Percy Liang (Stanford University)
Chris De Sa (Cornell)
Christopher Re (Stanford University)
Ce Zhang (ETH Zurich)

More from the Same Authors