Timezone: »
Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. Approximate attention methods have attempted to address this problem by trading off model quality to reduce the compute complexity, but often do not achieve wall-clock speedup. We argue that a missing principle is making attention algorithms IO-aware -- accounting for reads and writes between levels of GPU memory. We propose FlashAttention, an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU high bandwidth memory (HBM) and GPU on-chip SRAM. We analyze the IO complexity of FlashAttention, showing that it requires fewer HBM accesses than standard attention, and is optimal for a range of SRAM sizes. We also extend FlashAttention to block-sparse attention, yielding an approximate attention algorithm that is faster than any existing approximate attention method. FlashAttention trains Transformers faster than existing baselines: 15% end-to-end wall-clock speedup on BERT-large (seq. length 512) compared to the MLPerf 1.1 training speed record, 3× speedup on GPT-2 (seq. length 1K), and 2.4× speedup on long-range arena (seq. length 1K-4K). FlashAttention and block-sparse FlashAttention enable longer context in Transformers, yielding higher quality models (0.7 better perplexity on GPT-2 and 6.4 points of lift on long-document classification) and entirely new capabilities: the first Transformers to achieve better-than-chance performance on the Path-X challenge (seq. length 16K, 61.4% accuracy) and Path-256 (seq. length 64K, 63.1% accuracy).
Author Information
Tri Dao (Stanford)
Daniel Y Fu (Stanford University)
Stefano Ermon (Stanford University)
Atri Rudra (University at Buffalo, SUNY)
Christopher Re (Stanford University)
More from the Same Authors
-
2021 : A Standardized Data Collection Toolkit for Model Benchmarking »
Avanika Narayan · Piero Molino · Karan Goel · Christopher Re -
2022 : BARACK: Partially Supervised Group Robustness With Guarantees »
Nimit Sohoni · Maziar Sanjabi · Nicolas Ballas · Aditya Grover · Shaoliang Nie · Hamed Firooz · Christopher Re -
2022 : Contrastive Adapters for Foundation Model Group Robustness »
Michael Zhang · Christopher Re -
2022 : The Importance of Background Information for Out of Distribution Generalization »
Jupinder Parmar · Khaled Saab · Brian Pogatchnik · Daniel Rubin · Christopher Ré -
2022 : Transform Once: Efficient Operator Learning in Frequency Domain »
Michael Poli · Stefano Massaroli · Federico Berto · Jinkyoo Park · Tri Dao · Christopher Re · Stefano Ermon -
2023 : The Role of Linguistic Priors in Measuring Compositional Generalization of Vision-language Models »
Chenwei Wu · Li Li · Stefano Ermon · Patrick Haffner · Rong Ge · Zaiwei Zhang -
2023 : Parallel Sampling of Diffusion Models »
Andy Shih · Suneel Belkhale · Stefano Ermon · Dorsa Sadigh · Nima Anari -
2023 : Skill-it! A Data-Driven Skills Framework for Understanding and Training Language Models »
Mayee Chen · Nicholas Roberts · Kush Bhatia · Jue Wang · Ce Zhang · Frederic Sala · Christopher Ré -
2023 : Prospectors: Leveraging Short Contexts to Mine Salient Objects in High-dimensional Imagery »
Gautam Machiraju · Arjun Desai · James Zou · Christopher Re · Parag Mallick -
2023 : On the Equivalence of Consistency-Type Models: Consistency Models, Consistent Diffusion Models, and Fokker-Planck Regularization »
Chieh-Hsin Lai · Yuhta Takida · Toshimitsu Uesaka · Naoki Murata · Yuki Mitsufuji · Stefano Ermon -
2023 : Parallel Sampling of Diffusion Models »
Andy Shih · Suneel Belkhale · Stefano Ermon · Dorsa Sadigh · Nima Anari -
2023 : Accelerating LLM Inference with Staged Speculative Decoding »
Benjamin F Spector · Christopher Re -
2023 : H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models »
Zhenyu Zhang · Ying Sheng · Tianyi Zhou · Tianlong Chen · Lianmin Zheng · Ruisi Cai · Zhao Song · Yuandong Tian · Christopher Re · Clark Barrett · Zhangyang “Atlas” Wang · Beidi Chen -
2023 : Direct Preference Optimization: Your Language Model is Secretly a Reward Model »
Rafael Rafailov · Archit Sharma · Eric Mitchell · Stefano Ermon · Christopher Manning · Chelsea Finn -
2023 Workshop: ES-FoMo: Efficient Systems for Foundation Models »
Julien Launay · Daniel Y Fu · Tri Dao · Daniel Hesslow · Beidi Chen · Azalia Mirhoseini · Percy Liang -
2023 : Invited Talk by Stefano Ermon »
Stefano Ermon -
2023 Workshop: Differentiable Almost Everything: Differentiable Relaxations, Algorithms, Operators, and Simulators »
Felix Petersen · Marco Cuturi · Mathias Niepert · Hilde Kuehne · Michael Kagan · Willie Neiswanger · Stefano Ermon -
2023 Oral: Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time »
Zichang Liu · Jue Wang · Tri Dao · Tianyi Zhou · Binhang Yuan · Zhao Song · Anshumali Shrivastava · Ce Zhang · Yuandong Tian · Christopher Re · Beidi Chen -
2023 Oral: Hyena Hierarchy: Towards Larger Convolutional Language Models »
Michael Poli · Stefano Massaroli · Eric Nguyen · Daniel Y Fu · Tri Dao · Stephen Baccus · Yoshua Bengio · Stefano Ermon · Christopher Re -
2023 Poster: Geometric Latent Diffusion Models for 3D Molecule Generation »
Minkai Xu · Alexander Powers · Ron Dror · Stefano Ermon · Jure Leskovec -
2023 Poster: Simple Hardware-Efficient Long Convolutions for Sequence Modeling »
Daniel Y Fu · Elliot L Epstein · Eric Nguyen · Armin Thomas · Michael Zhang · Tri Dao · Atri Rudra · Christopher Re -
2023 Poster: Reflected Diffusion Models »
Aaron Lou · Stefano Ermon -
2023 Poster: FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU »
Ying Sheng · Lianmin Zheng · Binhang Yuan · Zhuohan Li · Max Ryabinin · Beidi Chen · Percy Liang · Christopher Re · Ion Stoica · Ce Zhang -
2023 Poster: Long Horizon Temperature Scaling »
Andy Shih · Dorsa Sadigh · Stefano Ermon -
2023 Oral: FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU »
Ying Sheng · Lianmin Zheng · Binhang Yuan · Zhuohan Li · Max Ryabinin · Beidi Chen · Percy Liang · Christopher Re · Ion Stoica · Ce Zhang -
2023 Poster: Hyena Hierarchy: Towards Larger Convolutional Language Models »
Michael Poli · Stefano Massaroli · Eric Nguyen · Daniel Y Fu · Tri Dao · Stephen Baccus · Yoshua Bengio · Stefano Ermon · Christopher Re -
2023 Poster: GibbsDDRM: A Partially Collapsed Gibbs Sampler for Solving Blind Inverse Problems with Denoising Diffusion Restoration »
Naoki Murata · Koichi Saito · Chieh-Hsin Lai · Yuhta Takida · Toshimitsu Uesaka · Yuki Mitsufuji · Stefano Ermon -
2023 Poster: CocktailSGD: Fine-tuning Foundation Models over 500Mbps Networks »
Jue Wang · Yucheng Lu · Binhang Yuan · Beidi Chen · Percy Liang · Chris De Sa · Christopher Re · Ce Zhang -
2023 Poster: FP-Diffusion: Improving Score-based Diffusion Models by Enforcing the Underlying Score Fokker-Planck Equation »
Chieh-Hsin Lai · Yuhta Takida · Naoki Murata · Toshimitsu Uesaka · Yuki Mitsufuji · Stefano Ermon -
2023 Poster: Deep Latent State Space Models for Time-Series Generation »
Linqi Zhou · Michael Poli · Winnie Xu · Stefano Massaroli · Stefano Ermon -
2023 Oral: GibbsDDRM: A Partially Collapsed Gibbs Sampler for Solving Blind Inverse Problems with Denoising Diffusion Restoration »
Naoki Murata · Koichi Saito · Chieh-Hsin Lai · Yuhta Takida · Toshimitsu Uesaka · Yuki Mitsufuji · Stefano Ermon -
2023 Poster: Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time »
Zichang Liu · Jue Wang · Tri Dao · Tianyi Zhou · Binhang Yuan · Zhao Song · Anshumali Shrivastava · Ce Zhang · Yuandong Tian · Christopher Re · Beidi Chen -
2023 Poster: CSP: Self-Supervised Contrastive Spatial Pre-Training for Geospatial-Visual Representations »
Gengchen Mai · Ni Lao · Yutong He · Jiaming Song · Stefano Ermon -
2022 : Generative Modeling with Stochastic Differential Equations »
Stefano Ermon -
2022 : Neural Geometric Embedding Flows »
Aaron Lou · Yang Song · Jiaming Song · Stefano Ermon -
2022 Workshop: Adaptive Experimental Design and Active Learning in the Real World »
Mojmir Mutny · Willie Neiswanger · Ilija Bogunovic · Stefano Ermon · Yisong Yue · Andreas Krause -
2022 Poster: Imitation Learning by Estimating Expertise of Demonstrators »
Mark Beliaev · Andy Shih · Stefano Ermon · Dorsa Sadigh · Ramtin Pedarsani -
2022 Poster: It’s Raw! Audio Generation with State-Space Models »
Karan Goel · Albert Gu · Chris Donahue · Christopher Re -
2022 Spotlight: Imitation Learning by Estimating Expertise of Demonstrators »
Mark Beliaev · Andy Shih · Stefano Ermon · Dorsa Sadigh · Ramtin Pedarsani -
2022 Oral: It’s Raw! Audio Generation with State-Space Models »
Karan Goel · Albert Gu · Chris Donahue · Christopher Re -
2022 Poster: A General Recipe for Likelihood-free Bayesian Optimization »
Jiaming Song · Lantao Yu · Willie Neiswanger · Stefano Ermon -
2022 Poster: Perfectly Balanced: Improving Transfer and Robustness of Supervised Contrastive Learning »
Mayee Chen · Daniel Y Fu · Avanika Narayan · Michael Zhang · Zhao Song · Kayvon Fatahalian · Christopher Re -
2022 Oral: A General Recipe for Likelihood-free Bayesian Optimization »
Jiaming Song · Lantao Yu · Willie Neiswanger · Stefano Ermon -
2022 Spotlight: Perfectly Balanced: Improving Transfer and Robustness of Supervised Contrastive Learning »
Mayee Chen · Daniel Y Fu · Avanika Narayan · Michael Zhang · Zhao Song · Kayvon Fatahalian · Christopher Re -
2022 Poster: ButterflyFlow: Building Invertible Layers with Butterfly Matrices »
Chenlin Meng · Linqi Zhou · Kristy Choi · Tri Dao · Stefano Ermon -
2022 Poster: Bit Prioritization in Variational Autoencoders via Progressive Coding »
Rui Shu · Stefano Ermon -
2022 Poster: Monarch: Expressive Structured Matrices for Efficient and Accurate Training »
Tri Dao · Beidi Chen · Nimit Sohoni · Arjun Desai · Michael Poli · Jessica Grogan · Alexander Liu · Aniruddh Rao · Atri Rudra · Christopher Re -
2022 Poster: Modular Conformal Calibration »
Charles Marx · Shengjia Zhao · Willie Neiswanger · Stefano Ermon -
2022 Poster: Correct-N-Contrast: a Contrastive Approach for Improving Robustness to Spurious Correlations »
Michael Zhang · Nimit Sohoni · Hongyang Zhang · Chelsea Finn · Christopher Re -
2022 Spotlight: Bit Prioritization in Variational Autoencoders via Progressive Coding »
Rui Shu · Stefano Ermon -
2022 Spotlight: Modular Conformal Calibration »
Charles Marx · Shengjia Zhao · Willie Neiswanger · Stefano Ermon -
2022 Oral: Correct-N-Contrast: a Contrastive Approach for Improving Robustness to Spurious Correlations »
Michael Zhang · Nimit Sohoni · Hongyang Zhang · Chelsea Finn · Christopher Re -
2022 Oral: Monarch: Expressive Structured Matrices for Efficient and Accurate Training »
Tri Dao · Beidi Chen · Nimit Sohoni · Arjun Desai · Michael Poli · Jessica Grogan · Alexander Liu · Aniruddh Rao · Atri Rudra · Christopher Re -
2022 Spotlight: ButterflyFlow: Building Invertible Layers with Butterfly Matrices »
Chenlin Meng · Linqi Zhou · Kristy Choi · Tri Dao · Stefano Ermon -
2021 : Invited Talk 5 (Stefano Ermon): Maximum Likelihood Training of Score-Based Diffusion Models »
Stefano Ermon -
2021 Poster: Temporal Predictive Coding For Model-Based Planning In Latent Space »
Tung Nguyen · Rui Shu · Tuan Pham · Hung Bui · Stefano Ermon -
2021 Poster: HoroPCA: Hyperbolic Dimensionality Reduction via Horospherical Projections »
Ines Chami · Albert Gu · Dat P Nguyen · Christopher Re -
2021 Spotlight: HoroPCA: Hyperbolic Dimensionality Reduction via Horospherical Projections »
Ines Chami · Albert Gu · Dat P Nguyen · Christopher Re -
2021 Spotlight: Temporal Predictive Coding For Model-Based Planning In Latent Space »
Tung Nguyen · Rui Shu · Tuan Pham · Hung Bui · Stefano Ermon -
2021 Poster: Bayesian Algorithm Execution: Estimating Computable Properties of Black-box Functions Using Mutual Information »
Willie Neiswanger · Ke Alexander Wang · Stefano Ermon -
2021 Spotlight: Bayesian Algorithm Execution: Estimating Computable Properties of Black-box Functions Using Mutual Information »
Willie Neiswanger · Ke Alexander Wang · Stefano Ermon -
2021 Poster: Mandoline: Model Evaluation under Distribution Shift »
Mayee Chen · Karan Goel · Nimit Sohoni · Fait Poms · Kayvon Fatahalian · Christopher Re -
2021 Poster: Accelerating Feedforward Computation via Parallel Nonlinear Equation Solving »
Yang Song · Chenlin Meng · Renjie Liao · Stefano Ermon -
2021 Spotlight: Mandoline: Model Evaluation under Distribution Shift »
Mayee Chen · Karan Goel · Nimit Sohoni · Fait Poms · Kayvon Fatahalian · Christopher Re -
2021 Spotlight: Accelerating Feedforward Computation via Parallel Nonlinear Equation Solving »
Yang Song · Chenlin Meng · Renjie Liao · Stefano Ermon -
2021 Poster: Reward Identification in Inverse Reinforcement Learning »
Kuno Kim · Shivam Garg · Kirankumar Shiragur · Stefano Ermon -
2021 Spotlight: Reward Identification in Inverse Reinforcement Learning »
Kuno Kim · Shivam Garg · Kirankumar Shiragur · Stefano Ermon -
2021 Poster: Catformer: Designing Stable Transformers via Sensitivity Analysis »
Jared Quincy Davis · Albert Gu · Krzysztof Choromanski · Tri Dao · Christopher Re · Chelsea Finn · Percy Liang -
2021 Spotlight: Catformer: Designing Stable Transformers via Sensitivity Analysis »
Jared Quincy Davis · Albert Gu · Krzysztof Choromanski · Tri Dao · Christopher Re · Chelsea Finn · Percy Liang -
2020 Poster: Predictive Coding for Locally-Linear Control »
Rui Shu · Tung Nguyen · Yinlam Chow · Tuan Pham · Khoat Than · Mohammad Ghavamzadeh · Stefano Ermon · Hung Bui -
2020 Poster: Bridging the Gap Between f-GANs and Wasserstein GANs »
Jiaming Song · Stefano Ermon -
2020 Poster: Individual Calibration with Randomized Forecasting »
Shengjia Zhao · Tengyu Ma · Stefano Ermon -
2020 Poster: Fast and Three-rious: Speeding Up Weak Supervision with Triplet Methods »
Daniel Y Fu · Mayee Chen · Frederic Sala · Sarah Hooper · Kayvon Fatahalian · Christopher Re -
2020 Poster: Domain Adaptive Imitation Learning »
Kuno Kim · Yihong Gu · Jiaming Song · Shengjia Zhao · Stefano Ermon -
2020 Poster: Training Deep Energy-Based Models with f-Divergence Minimization »
Lantao Yu · Yang Song · Jiaming Song · Stefano Ermon -
2020 Poster: On the Generalization Effects of Linear Transformations in Data Augmentation »
Sen Wu · Hongyang Zhang · Gregory Valiant · Christopher Re -
2020 Poster: Fair Generative Modeling via Weak Supervision »
Kristy Choi · Aditya Grover · Trisha Singh · Rui Shu · Stefano Ermon -
2019 Poster: Learning Fast Algorithms for Linear Transforms Using Butterfly Factorizations »
Tri Dao · Albert Gu · Matthew Eichhorn · Atri Rudra · Christopher Re -
2019 Poster: Learning Dependency Structures for Weak Supervision Models »
Paroma Varma · Frederic Sala · Ann He · Alexander J Ratner · Christopher Re -
2019 Poster: Calibrated Model-Based Deep Reinforcement Learning »
Ali Malik · Volodymyr Kuleshov · Jiaming Song · Danny Nemer · Harlan Seymour · Stefano Ermon -
2019 Poster: Graphite: Iterative Generative Modeling of Graphs »
Aditya Grover · Aaron Zweig · Stefano Ermon -
2019 Poster: Adaptive Antithetic Sampling for Variance Reduction »
Hongyu Ren · Shengjia Zhao · Stefano Ermon -
2019 Oral: Learning Dependency Structures for Weak Supervision Models »
Paroma Varma · Frederic Sala · Ann He · Alexander J Ratner · Christopher Re -
2019 Oral: Learning Fast Algorithms for Linear Transforms Using Butterfly Factorizations »
Tri Dao · Albert Gu · Matthew Eichhorn · Atri Rudra · Christopher Re -
2019 Oral: Adaptive Antithetic Sampling for Variance Reduction »
Hongyu Ren · Shengjia Zhao · Stefano Ermon -
2019 Oral: Graphite: Iterative Generative Modeling of Graphs »
Aditya Grover · Aaron Zweig · Stefano Ermon -
2019 Oral: Calibrated Model-Based Deep Reinforcement Learning »
Ali Malik · Volodymyr Kuleshov · Jiaming Song · Danny Nemer · Harlan Seymour · Stefano Ermon -
2019 Poster: A Kernel Theory of Modern Data Augmentation »
Tri Dao · Albert Gu · Alexander J Ratner · Virginia Smith · Christopher De Sa · Christopher Re -
2019 Poster: Multi-Agent Adversarial Inverse Reinforcement Learning »
Lantao Yu · Jiaming Song · Stefano Ermon -
2019 Poster: Neural Joint Source-Channel Coding »
Kristy Choi · Kedar Tatwawadi · Aditya Grover · Tsachy Weissman · Stefano Ermon -
2019 Oral: A Kernel Theory of Modern Data Augmentation »
Tri Dao · Albert Gu · Alexander J Ratner · Virginia Smith · Christopher De Sa · Christopher Re -
2019 Oral: Neural Joint Source-Channel Coding »
Kristy Choi · Kedar Tatwawadi · Aditya Grover · Tsachy Weissman · Stefano Ermon -
2019 Oral: Multi-Agent Adversarial Inverse Reinforcement Learning »
Lantao Yu · Jiaming Song · Stefano Ermon -
2018 Poster: Modeling Sparse Deviations for Compressed Sensing using Generative Models »
Manik Dhar · Aditya Grover · Stefano Ermon -
2018 Oral: Modeling Sparse Deviations for Compressed Sensing using Generative Models »
Manik Dhar · Aditya Grover · Stefano Ermon -
2018 Poster: Representation Tradeoffs for Hyperbolic Embeddings »
Frederic Sala · Christopher De Sa · Albert Gu · Christopher Re -
2018 Poster: Accelerating Natural Gradient with Higher-Order Invariance »
Yang Song · Jiaming Song · Stefano Ermon -
2018 Poster: Accurate Uncertainties for Deep Learning Using Calibrated Regression »
Volodymyr Kuleshov · Nathan Fenner · Stefano Ermon -
2018 Oral: Representation Tradeoffs for Hyperbolic Embeddings »
Frederic Sala · Christopher De Sa · Albert Gu · Christopher Re -
2018 Oral: Accelerating Natural Gradient with Higher-Order Invariance »
Yang Song · Jiaming Song · Stefano Ermon -
2018 Oral: Accurate Uncertainties for Deep Learning Using Calibrated Regression »
Volodymyr Kuleshov · Nathan Fenner · Stefano Ermon -
2017 Poster: Learning the Structure of Generative Models without Labeled Data »
Stephen Bach · Bryan He · Alexander J Ratner · Christopher Re -
2017 Poster: Learning Hierarchical Features from Deep Generative Models »
Shengjia Zhao · Jiaming Song · Stefano Ermon -
2017 Talk: Learning Hierarchical Features from Deep Generative Models »
Shengjia Zhao · Jiaming Song · Stefano Ermon -
2017 Talk: Learning the Structure of Generative Models without Labeled Data »
Stephen Bach · Bryan He · Alexander J Ratner · Christopher Re