Timezone: »
Large neural networks excel in many domains, but they are expensive to train and fine-tune. A popular approach to reduce their compute or memory requirements is to replace dense weight matrices with structured ones (e.g., sparse, low-rank, Fourier transform). These methods have not seen widespread adoption (1) in end-to-end training due to unfavorable efficiency--quality tradeoffs, and (2) in dense-to-sparse fine-tuning due to lack of tractable algorithms to approximate a given dense weight matrix. To address these issues, we propose a class of matrices (Monarch) that is hardware-efficient (they are parameterized as products of two block-diagonal matrices for better hardware utilization) and expressive (they can represent many commonly used transforms). Surprisingly, the problem of approximating a dense weight matrix with a Monarch matrix, though nonconvex, has an analytical optimal solution. These properties of Monarch matrices unlock new ways to train and fine-tune sparse and dense models. We empirically validate that Monarch can achieve favorable accuracy-efficiency tradeoffs in several end-to-end sparse training applications: speeding up ViT and GPT-2 training on ImageNet classification and Wikitext-103 language modeling by 2x with comparable model quality, and reducing the error on PDE solving and MRI reconstruction tasks by 40%. In sparse-to-dense training, with a simple technique called "reverse sparsification," Monarch matrices serve as a useful intermediate representation to speed up GPT-2 pretraining on OpenWebText by 2x without quality drop. The same technique brings 23% faster BERT pretraining than even the very optimized implementation from Nvidia that set the MLPerf 1.1 record. In dense-to-sparse fine-tuning, as a proof-of-concept, our Monarch approximation algorithm speeds up BERT fine-tuning on GLUE by 1.7x with comparable accuracy.
Author Information
Tri Dao (Stanford)
Beidi Chen (Stanford University)
Nimit Sohoni (Stanford University)
Arjun Desai (Stanford University)
Michael Poli (Stanford University)
Jessica Grogan (University at Buffalo)
Current Ph.D. student at the University at Buffalo, interested in Theoretical Computer Science.
Alexander Liu (University of Michigan)
Aniruddh Rao (University of Michigan)
Atri Rudra (University at Buffalo, SUNY)
Christopher Re (Stanford University)
Related Events (a corresponding poster, oral, or spotlight)
-
2022 Poster: Monarch: Expressive Structured Matrices for Efficient and Accurate Training »
Tue. Jul 19th through Wed the 20th Room Hall E #315
More from the Same Authors
-
2021 : A Standardized Data Collection Toolkit for Model Benchmarking »
Avanika Narayan · Piero Molino · Karan Goel · Christopher Re -
2022 : BARACK: Partially Supervised Group Robustness With Guarantees »
Nimit Sohoni · Maziar Sanjabi · Nicolas Ballas · Aditya Grover · Shaoliang Nie · Hamed Firooz · Christopher Re -
2022 : Contrastive Adapters for Foundation Model Group Robustness »
Michael Zhang · Christopher Re -
2022 : The Importance of Background Information for Out of Distribution Generalization »
Jupinder Parmar · Khaled Saab · Brian Pogatchnik · Daniel Rubin · Christopher Ré -
2022 : Efficient Continuous Spatio-Temporal Simulation with Graph Spline Networks »
Chuanbo HUA · Federico Berto · Michael Poli · Stefano Massaroli · Jinkyoo Park -
2022 : Efficient Continuous Spatio-Temporal Simulation with Graph Spline Networks »
Chuanbo HUA · Federico Berto · Michael Poli · Stefano Massaroli · Jinkyoo Park -
2022 : Transform Once: Efficient Operator Learning in Frequency Domain »
Michael Poli · Stefano Massaroli · Federico Berto · Jinkyoo Park · Tri Dao · Christopher Re · Stefano Ermon -
2023 : Skill-it! A Data-Driven Skills Framework for Understanding and Training Language Models »
Mayee Chen · Nicholas Roberts · Kush Bhatia · Jue Wang · Ce Zhang · Frederic Sala · Christopher Ré -
2023 : Prospectors: Leveraging Short Contexts to Mine Salient Objects in High-dimensional Imagery »
Gautam Machiraju · Arjun Desai · James Zou · Christopher Re · Parag Mallick -
2023 : Accelerating LLM Inference with Staged Speculative Decoding »
Benjamin F Spector · Christopher Re -
2023 : H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models »
Zhenyu Zhang · Ying Sheng · Tianyi Zhou · Tianlong Chen · Lianmin Zheng · Ruisi Cai · Zhao Song · Yuandong Tian · Christopher Re · Clark Barrett · Zhangyang “Atlas” Wang · Beidi Chen -
2023 Workshop: ES-FoMo: Efficient Systems for Foundation Models »
Julien Launay · Daniel Y Fu · Tri Dao · Daniel Hesslow · Beidi Chen · Azalia Mirhoseini · Percy Liang -
2023 Oral: Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time »
Zichang Liu · Jue Wang · Tri Dao · Tianyi Zhou · Binhang Yuan · Zhao Song · Anshumali Shrivastava · Ce Zhang · Yuandong Tian · Christopher Re · Beidi Chen -
2023 Oral: Hyena Hierarchy: Towards Larger Convolutional Language Models »
Michael Poli · Stefano Massaroli · Eric Nguyen · Daniel Y Fu · Tri Dao · Stephen Baccus · Yoshua Bengio · Stefano Ermon · Christopher Re -
2023 Poster: Simple Hardware-Efficient Long Convolutions for Sequence Modeling »
Daniel Y Fu · Elliot L Epstein · Eric Nguyen · Armin Thomas · Michael Zhang · Tri Dao · Atri Rudra · Christopher Re -
2023 Poster: FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU »
Ying Sheng · Lianmin Zheng · Binhang Yuan · Zhuohan Li · Max Ryabinin · Beidi Chen · Percy Liang · Christopher Re · Ion Stoica · Ce Zhang -
2023 Oral: FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU »
Ying Sheng · Lianmin Zheng · Binhang Yuan · Zhuohan Li · Max Ryabinin · Beidi Chen · Percy Liang · Christopher Re · Ion Stoica · Ce Zhang -
2023 Poster: Hyena Hierarchy: Towards Larger Convolutional Language Models »
Michael Poli · Stefano Massaroli · Eric Nguyen · Daniel Y Fu · Tri Dao · Stephen Baccus · Yoshua Bengio · Stefano Ermon · Christopher Re -
2023 Poster: CocktailSGD: Fine-tuning Foundation Models over 500Mbps Networks »
Jue Wang · Yucheng Lu · Binhang Yuan · Beidi Chen · Percy Liang · Chris De Sa · Christopher Re · Ce Zhang -
2023 Poster: Deep Latent State Space Models for Time-Series Generation »
Linqi Zhou · Michael Poli · Winnie Xu · Stefano Massaroli · Stefano Ermon -
2023 Poster: Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time »
Zichang Liu · Jue Wang · Tri Dao · Tianyi Zhou · Binhang Yuan · Zhao Song · Anshumali Shrivastava · Ce Zhang · Yuandong Tian · Christopher Re · Beidi Chen -
2022 : FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness »
Tri Dao · Daniel Y Fu · Stefano Ermon · Atri Rudra · Christopher Re -
2022 Poster: It’s Raw! Audio Generation with State-Space Models »
Karan Goel · Albert Gu · Chris Donahue · Christopher Re -
2022 Oral: It’s Raw! Audio Generation with State-Space Models »
Karan Goel · Albert Gu · Chris Donahue · Christopher Re -
2022 Poster: Perfectly Balanced: Improving Transfer and Robustness of Supervised Contrastive Learning »
Mayee Chen · Daniel Y Fu · Avanika Narayan · Michael Zhang · Zhao Song · Kayvon Fatahalian · Christopher Re -
2022 Spotlight: Perfectly Balanced: Improving Transfer and Robustness of Supervised Contrastive Learning »
Mayee Chen · Daniel Y Fu · Avanika Narayan · Michael Zhang · Zhao Song · Kayvon Fatahalian · Christopher Re -
2022 Poster: ButterflyFlow: Building Invertible Layers with Butterfly Matrices »
Chenlin Meng · Linqi Zhou · Kristy Choi · Tri Dao · Stefano Ermon -
2022 Poster: Correct-N-Contrast: a Contrastive Approach for Improving Robustness to Spurious Correlations »
Michael Zhang · Nimit Sohoni · Hongyang Zhang · Chelsea Finn · Christopher Re -
2022 Oral: Correct-N-Contrast: a Contrastive Approach for Improving Robustness to Spurious Correlations »
Michael Zhang · Nimit Sohoni · Hongyang Zhang · Chelsea Finn · Christopher Re -
2022 Spotlight: ButterflyFlow: Building Invertible Layers with Butterfly Matrices »
Chenlin Meng · Linqi Zhou · Kristy Choi · Tri Dao · Stefano Ermon -
2021 Poster: HoroPCA: Hyperbolic Dimensionality Reduction via Horospherical Projections »
Ines Chami · Albert Gu · Dat P Nguyen · Christopher Re -
2021 Spotlight: HoroPCA: Hyperbolic Dimensionality Reduction via Horospherical Projections »
Ines Chami · Albert Gu · Dat P Nguyen · Christopher Re -
2021 Poster: Mandoline: Model Evaluation under Distribution Shift »
Mayee Chen · Karan Goel · Nimit Sohoni · Fait Poms · Kayvon Fatahalian · Christopher Re -
2021 Spotlight: Mandoline: Model Evaluation under Distribution Shift »
Mayee Chen · Karan Goel · Nimit Sohoni · Fait Poms · Kayvon Fatahalian · Christopher Re -
2021 Poster: Catformer: Designing Stable Transformers via Sensitivity Analysis »
Jared Quincy Davis · Albert Gu · Krzysztof Choromanski · Tri Dao · Christopher Re · Chelsea Finn · Percy Liang -
2021 Poster: A Tale of Two Efficient and Informative Negative Sampling Distributions »
Shabnam Daghaghi · Tharun Medini · Nicholas Meisburger · Beidi Chen · Mengnan Zhao · Anshumali Shrivastava -
2021 Spotlight: Catformer: Designing Stable Transformers via Sensitivity Analysis »
Jared Quincy Davis · Albert Gu · Krzysztof Choromanski · Tri Dao · Christopher Re · Chelsea Finn · Percy Liang -
2021 Oral: A Tale of Two Efficient and Informative Negative Sampling Distributions »
Shabnam Daghaghi · Tharun Medini · Nicholas Meisburger · Beidi Chen · Mengnan Zhao · Anshumali Shrivastava -
2020 Poster: Fast and Three-rious: Speeding Up Weak Supervision with Triplet Methods »
Daniel Y Fu · Mayee Chen · Frederic Sala · Sarah Hooper · Kayvon Fatahalian · Christopher Re -
2020 Poster: Angular Visual Hardness »
Beidi Chen · Weiyang Liu · Zhiding Yu · Jan Kautz · Anshumali Shrivastava · Animesh Garg · Anima Anandkumar -
2020 Poster: On the Generalization Effects of Linear Transformations in Data Augmentation »
Sen Wu · Hongyang Zhang · Gregory Valiant · Christopher Re -
2019 Poster: Learning Fast Algorithms for Linear Transforms Using Butterfly Factorizations »
Tri Dao · Albert Gu · Matthew Eichhorn · Atri Rudra · Christopher Re -
2019 Poster: Learning Dependency Structures for Weak Supervision Models »
Paroma Varma · Frederic Sala · Ann He · Alexander J Ratner · Christopher Re -
2019 Oral: Learning Dependency Structures for Weak Supervision Models »
Paroma Varma · Frederic Sala · Ann He · Alexander J Ratner · Christopher Re -
2019 Oral: Learning Fast Algorithms for Linear Transforms Using Butterfly Factorizations »
Tri Dao · Albert Gu · Matthew Eichhorn · Atri Rudra · Christopher Re -
2019 Poster: A Kernel Theory of Modern Data Augmentation »
Tri Dao · Albert Gu · Alexander J Ratner · Virginia Smith · Christopher De Sa · Christopher Re -
2019 Oral: A Kernel Theory of Modern Data Augmentation »
Tri Dao · Albert Gu · Alexander J Ratner · Virginia Smith · Christopher De Sa · Christopher Re -
2018 Poster: Representation Tradeoffs for Hyperbolic Embeddings »
Frederic Sala · Christopher De Sa · Albert Gu · Christopher Re -
2018 Oral: Representation Tradeoffs for Hyperbolic Embeddings »
Frederic Sala · Christopher De Sa · Albert Gu · Christopher Re -
2017 Poster: Learning the Structure of Generative Models without Labeled Data »
Stephen Bach · Bryan He · Alexander J Ratner · Christopher Re -
2017 Talk: Learning the Structure of Generative Models without Labeled Data »
Stephen Bach · Bryan He · Alexander J Ratner · Christopher Re