Timezone: »
The size of Transformer models is growing at an unprecedented rate. It has taken less than one year to reach trillion-level parameters since the release of GPT-3 (175B). Training such models requires both substantial engineering efforts and enormous computing resources, which are luxuries most research teams cannot afford. In this paper, we propose PipeTransformer, which leverages automated elastic pipelining for efficient distributed training of Transformer models. In PipeTransformer, we design an adaptive on the fly freeze algorithm that can identify and freeze some layers gradually during training, and an elastic pipelining system that can dynamically allocate resources to train the remaining active layers. More specifically, PipeTransformer automatically excludes frozen layers from the pipeline, packs active layers into fewer GPUs, and forks more replicas to increase data-parallel width. We evaluate PipeTransformer using Vision Transformer (ViT) on ImageNet and BERT on SQuAD and GLUE datasets. Our results show that compared to the state-of-the-art baseline, PipeTransformer attains up to 2.83-fold speedup without losing accuracy. We also provide various performance analyses for a more comprehensive understanding of our algorithmic and system-wise design. Finally, we have modularized our training system with flexible APIs and made the source code publicly available at https://DistML.ai.
Author Information
Chaoyang He (University of Southern California)
Chaoyang He is a Ph.D. Candidate in the CS department at the University of Southern California, Los Angeles, USA. He is advised by Salman Avestimehr (USC), Professor Mahdi Soltanolkotabi (USC), Professor Murali Annavaram (USC), and Professor Tong Zhang (HKUST). He also works closely with researchers/engineers at Google, Facebook, Amazon, and Tencent. Previously, He was an R&D Team Manager and Staff Software Engineer at Tencent (2014-2018), a Team Leader and Senior Software Engineer at Baidu (2012-2014), and a Software Engineer at Huawei (2011-2012). His research focuses on distributed/federated machine learning algorithms, systems, and applications. Chaoyang He has received a number of awards in academia and industry, including Amazon ML Fellowship (2021-2022), Qualcomm Innovation Fellowship (2021-2022), Tencent Outstanding Staff Award (2015-2016), WeChat Special Award for Innovation (2016), Baidu LBS Group Star Awards (2013), and Huawei Golden Network Award (2012). During his Ph.D. study, he has published papers at ICML, NeurIPS, CVPR, ICLR, MLSys, among others. Besides pure research, he also has R&D experience for Internet products and businesses such as Tencent Cloud, Tencent WeChat Automotive / AI in Car, Tencent Games, Tencent Maps, Baidu Maps, and Huawei Smartphone. He obtained three years of experience in R&D team management at Tencent between 2016-2018. With his advisors, he also co-founds FedML.ai, built based on a paper that won Best Paper Award at NeurIPS 2020 FL workshop. More details are available at his homepage: https://ChaoyangHe.com
Shen Li (Facebook AI Applied Research)
Mahdi Soltanolkotabi (University of Southern California)
Mahdi Soltanolkotabi is an assistant professor in the Ming Hsieh Department of Electrical and Computer Engineering and Computer Science at the University of Southern California where he holds an Andrew and Erna Viterbi Early Career Chair. Prior to joining USC, he completed his PhD in electrical engineering at Stanford in 2014. He was a postdoctoral researcher in the EECS department at UC Berkeley during the 2014-2015 academic year. His research focuses on developing the mathematical foundations of data analysis at the confluence of optimization, machine learning, signal processing, high dimensional statistics, computational imaging and artificial intelligence. Mahdi is the recipient of the Packard Fellowship in Science and Engineering, a Sloan Research Fellowship, an NSF Career award, an Airforce Office of Research Young Investigator award (AFOSR-YIP), and a Google faculty research award.
Salman Avestimehr (University of Southern California)
Related Events (a corresponding poster, oral, or spotlight)
-
2021 Poster: PipeTransformer: Automated Elastic Pipelining for Distributed Training of Large-scale Models »
Thu. Jul 22nd 04:00 -- 06:00 PM Room
More from the Same Authors
-
2021 : SpreadGNN: Serverless Multi-task Federated Learning for Graph Neural Networks »
Chaoyang He · Emir Ceyani -
2023 : Distributed Architecture Search over Heterogeneous Distributions »
Erum Mushtaq · Chaoyang He · Jie Ding · Salman Avestimehr -
2023 : Don’t Memorize; Mimic The Past: Federated Class Incremental Learning Without Episodic Memory »
Sara Babakniya · Zalan Fabian · Chaoyang He · Mahdi Soltanolkotabi · Salman Avestimehr -
2023 : Privacy-Preserving Federated Heavy Hitter Analytics for Non-IID Data »
Jiaqi Shao · Shanshan Han · Chaoyang He · Bing Luo -
2021 : Securing Secure Aggregation: Mitigating Multi-Round Privacy Leakage in Federated Learning (Q&A) »
Salman Avestimehr -
2021 : Securing Secure Aggregation: Mitigating Multi-Round Privacy Leakage in Federated Learning »
Salman Avestimehr -
2021 Poster: Generalization Guarantees for Neural Architecture Search with Train-Validation Split »
Samet Oymak · Mingchen Li · Mahdi Soltanolkotabi -
2021 Spotlight: Generalization Guarantees for Neural Architecture Search with Train-Validation Split »
Samet Oymak · Mingchen Li · Mahdi Soltanolkotabi -
2021 Poster: Data augmentation for deep learning based accelerated MRI reconstruction with limited data »
Zalan Fabian · Reinhard Heckel · Mahdi Soltanolkotabi -
2021 Spotlight: Data augmentation for deep learning based accelerated MRI reconstruction with limited data »
Zalan Fabian · Reinhard Heckel · Mahdi Soltanolkotabi -
2020 Poster: Compressive sensing with un-trained neural networks: Gradient descent finds a smooth approximation »
Reinhard Heckel · Mahdi Soltanolkotabi -
2019 : Salman Avestimehr: Lagrange Coded Computing: Optimal Design for Resilient, Secure, and Private Distributed Learning »
Salman Avestimehr -
2019 Poster: Overparameterized Nonlinear Learning: Gradient Descent Takes the Shortest Path? »
Samet Oymak · Mahdi Soltanolkotabi -
2019 Oral: Overparameterized Nonlinear Learning: Gradient Descent Takes the Shortest Path? »
Samet Oymak · Mahdi Soltanolkotabi