Timezone: »
Recent innovations on hardware (e.g. Nvidia A100) have motivated learning N:M structured sparsity masks from scratch for fast model inference. However, state-of-the-art learning recipes in this regime (e.g. SR-STE) are proposed for non-adaptive optimizers like momentum SGD, while incurring non-trivial accuracy drop for Adam-trained models like attention-based LLMs. In this paper, we first demonstrate such gap origins from poorly estimated second moment (i.e. variance) in Adam states given by the masked weights. We conjecture that learning N:M masks with Adam should take the critical regime of variance estimation into account. In light of this, we propose STEP, an Adam-aware recipe that learns N:M masks with two phases: first, STEP calculates a reliable variance estimate (precondition phase) and subsequently, the variance remains fixed and is used as a precondition to learn N:M masks (mask-learning phase). STEP automatically identifies the switching point of two phases by dynamically sampling variance changes over the training trajectory and testing the sample concentration. Empirically, we evaluate STEP and other baselines such as ASP and SR-STE on multiple tasks including CIFAR classification, machine translation and LLM fine-tuning (BERT-Base, GPT-2). We show STEP mitigates the accuracy drop of baseline recipes and is robust to aggressive structured sparsity ratios.
Author Information
Yucheng Lu (Cornell University)
Shivani Agrawal (Google)
Suvinay Subramanian (Google)

[Suvinay Subramanian](http://suvinay.com) is a computer architect at Google building hardware systems (TPUs) to accelerate machine learning and AI. His expertise is in hardware-software codesign, and sparsity in deep neural networks. He received a Ph.D. from MIT and a B.Tech from IIT Madras. He also co-hosts the [Computer Architecture Podcast](https://comparchpodcast.podbean.com/).
Oleg Rybakov (Google)
Chris De Sa (Cornell)
Amir Yazdanbakhsh (Google DeepMind)
My name is Amir Yazdanbakhsh. I joined Google Research as a Research Scientist in 2019, following a one year AI residency. I am the co-founder and co-lead of the Machine Learning for Computer Architecture team. We leverage the recent machine learning methods and advancements to innovate and design better hardware accelerators. The work of our team has been covered by media outlets including WIRED, ZDNet, AnalyticsInsight, InfoQ. I am also interested in designing large-scale distributed systems for training machine learning applications. To that end, I led the development of a massively large-scale distributed reinforcement learning system that scales to TPU Pod and efficiently manages thousands of actors to solve complex, real-world tasks. As a case study, our team demonstrates how using this highly scalable system enables reinforcement learning to accomplish chip placement in ~an hour instead of days or weeks by human effort. I received my Ph.D. degree in computer science from the Georgia Institute of Technology. My Ph.D. work has been recognized by various awards, including Microsoft PhD Fellowship and Qualcomm Innovation Fellowship.
More from the Same Authors
-
2023 : CD-GraB: Coordinating Distributed Example Orders for Provably Accelerated Training »
A. Feder Cooper · Wentao Guo · Duc Khiem Pham · Tiancheng Yuan · Charlie Ruan · Yucheng Lu · Chris De Sa -
2023 Poster: InfoDiffusion: Representation Learning Using Information Maximizing Diffusion Models »
Yingheng Wang · Yair Schiff · Aaron Gokaslan · Weishen Pan · Fei Wang · Chris De Sa · Volodymyr Kuleshov -
2023 Poster: CocktailSGD: Fine-tuning Foundation Models over 500Mbps Networks »
Jue Wang · Yucheng Lu · Binhang Yuan · Beidi Chen · Percy Liang · Chris De Sa · Christopher Re · Ce Zhang -
2021 Poster: Variance Reduced Training with Stratified Sampling for Forecasting Models »
Yucheng Lu · Youngsuk Park · Lifan Chen · Yuyang Wang · Christopher De Sa · Dean Foster -
2021 Spotlight: Variance Reduced Training with Stratified Sampling for Forecasting Models »
Yucheng Lu · Youngsuk Park · Lifan Chen · Yuyang Wang · Christopher De Sa · Dean Foster -
2021 Poster: Optimal Complexity in Decentralized Training »
Yucheng Lu · Christopher De Sa -
2021 Oral: Optimal Complexity in Decentralized Training »
Yucheng Lu · Christopher De Sa -
2020 Poster: Moniqua: Modulo Quantized Communication in Decentralized SGD »
Yucheng Lu · Christopher De Sa