With deep learning models rapidly growing in size, training large models becomes one of the most important challenges in modern AI. We present Amazon SageMaker model parallelism (SMP), a software library that integrates with PyTorch and enables easy training of large models using model parallelism and memory-saving features. The implementation of the SMP library is generic and flexible, in that it can automatically partition arbitrary model architectures, is modular enough to be easily applied to new training scripts, and preserves the native PyTorch experience to a larger degree than existing solutions. In this talk, we will dive deep into the techniques used in SMP, which includes pipeline parallelism, tensor parallelism, activation checkpointing/offloading and optimizer state sharding. We will also demonstrate how to integrate SMP with a native Pytorch script.
Sun 11:00 a.m. - 11:45 a.m.
|
Amazon SageMaker Model Parallelism: A General and Flexible Framework for Large Model Training
(
Talk
)
>
SlidesLive Video |
Fei Wu 🔗 |