Timezone: »
Many areas of deep learning benefit from using increasingly larger neural networks trained on public data, as is the case for pre-trained models for NLP and computer vision. Training such models requires a lot of computational resources (e.g., HPC clusters) that are not available to small research groups and independent researchers. One way to address it is for several smaller groups to pool their computational resources together and train a model that benefits all participants. Unfortunately, in this case, any participant can jeopardize the entire training run by sending incorrect updates, deliberately or by mistake. Training in presence of such peers requires specialized distributed training algorithms with Byzantine tolerance. These algorithms often sacrifice efficiency by introducing redundant communication or passing all updates through a trusted server, making it infeasible to apply them to large-scale deep learning, where models can have billions of parameters. In this work, we propose a novel protocol for secure (Byzantine-tolerant) decentralized training that emphasizes communication efficiency.
Author Information
Eduard Gorbunov (Moscow Institute of Physics and Technology)
Alexander Borzunov (HSE University, Yandex)
Michael Diskin (Yandex)
Max Ryabinin (Yandex/HSE University)
Related Events (a corresponding poster, oral, or spotlight)
-
2022 Poster: Secure Distributed Training at Scale »
Thu. Jul 21st through Fri the 22nd Room Hall E #715
More from the Same Authors
-
2023 Poster: FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU »
Ying Sheng · Lianmin Zheng · Binhang Yuan · Zhuohan Li · Max Ryabinin · Beidi Chen · Percy Liang · Christopher Re · Ion Stoica · Ce Zhang -
2023 Oral: FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU »
Ying Sheng · Lianmin Zheng · Binhang Yuan · Zhuohan Li · Max Ryabinin · Beidi Chen · Percy Liang · Christopher Re · Ion Stoica · Ce Zhang -
2023 Poster: SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient »
Max Ryabinin · Tim Dettmers · Michael Diskin · Alexander Borzunov -
2022 Poster: 3PC: Three Point Compressors for Communication-Efficient Distributed Training and a Better Theory for Lazy Aggregation »
Peter Richtarik · Igor Sokolov · Elnur Gasanov · Ilyas Fatkhullin · Zhize Li · Eduard Gorbunov -
2022 Spotlight: 3PC: Three Point Compressors for Communication-Efficient Distributed Training and a Better Theory for Lazy Aggregation »
Peter Richtarik · Igor Sokolov · Elnur Gasanov · Ilyas Fatkhullin · Zhize Li · Eduard Gorbunov -
2021 Poster: MARINA: Faster Non-Convex Distributed Learning with Compression »
Eduard Gorbunov · Konstantin Burlachenko · Zhize Li · Peter Richtarik -
2021 Spotlight: MARINA: Faster Non-Convex Distributed Learning with Compression »
Eduard Gorbunov · Konstantin Burlachenko · Zhize Li · Peter Richtarik