Identifying and Mitigating Errors in Gradient Aggregation of Distributed Data Parallel Training
Zhenheng Tang ⋅ Junlin Huang ⋅ Zichen TANG ⋅ Xueze Kang ⋅ Yuxin Wang ⋅ Peijie Dong ⋅ Shaohuai Shi ⋅ Xiaowen Chu ⋅ Bo Li
Abstract
Hardware-related silent data corruptions during gradient aggregation pose significant challenges to fault-tolerant distributed training, often leading to slow or failed convergence. To address this, we first mathematically formulate these errors as gradient inconsistency and theoretically analyze how they result in accumulated model divergence. Guided by this analysis, we introduce PAFT, a fault-tolerant distributed training system designed with dynamic and asynchronous parameter synchronization. PAFT comprises two core components: PAFT-Sync, which mitigates divergence via periodic synchronization, and PAFT-Dyn, which minimizes overhead through dynamic training overlap and frequency scheduling. Furthermore, the system’s synchronization mechanism is optimized to support standard optimizers, including SGD, SGD momentum, and Adam. We implement PAFT on PyTorch Distributed, and experimental results training ResNet, GPT-2, and LLaMA-2 on 4$\sim$32 GPUs demonstrate that it efficiently defends against aggregation errors while maintaining training performance.
Successful Page Load