Poster
in
Workshop: ES-FoMo II: 2nd Workshop on Efficient Systems for Foundation Models
Enhancing Stability for Large Models Training in Constrained Bandwidth Networks
Yun Dai · Tejas Dharamsi · Pin-Lun Hsu · Tao Song · Hamed Firooz
Training extremely large language models with billions of parameters is a computationally inten- sive task that pushes the limits of current data- parallel training systems. While techniques like ZeRO++ (Wang et al., 2024) have enabled effi- cient distributed training of such giant models on inexpensive low-bandwidth clusters, they can suffer from convergence issues due to potential race conditions in the hierarchical partitioning (hpZ) scheme employed to reduce cross-machine communication. In this work, we first show how these race conditions cause instability when train- ing models with billions of parameters. We then propose a modification to the partitioning algo- rithm that addresses these convergence challenges while maintaining competitive training efficiency. Empirical evaluation on training the multi-billion parameters Falcon Models and LLama-2 models demonstrates the updated algorithm’s ability to achieve reliable convergence on these massive models, where stock ZeRO++ hpZ fails to con- verge. The updated algorithm enables robust train- ing of larger models with 98% throughput and model training speed improvement without sacri- ficing the quality of convergence.