Timezone: »
To reach top-tier performance, deep learning models usually require a large number of parameters and operations, using considerable power and memory. Several methods have been proposed to tackle this problem by leveraging quantization of parameters, pruning, clustering of parameters, decompositions of convolutions, or using distillation. However, most of these works focus mainly on improving efficiency at inference time, disregarding the training cost. In practice, however, most of the energy footprint of deep learning results from training. Hence, this workshop focuses on reducing the training complexity of deep neural networks. Our aim is to encourage submissions specifically concerning the reduction in energy, time, or memory usage at training time. Topics of interest include but are not limited to: (i) compression methods for memory and complexity reduction during training, (ii) energy-efficient hardware architectures, (iii) energy-efficient training algorithms, (iv) novel energy models or energy efficiency training benchmarks, (v) practical applications of low-energy training.
Sat 5:45 a.m. - 6:00 a.m.
|
Opening welcome speech
(
Intro
)
SlidesLive Video » |
🔗 |
Sat 6:00 a.m. - 6:30 a.m.
|
Melika Payvand: Brain-inspired hardware and algorithm co-design for low power online training on the edge
(
Keynote
)
SlidesLive Video » |
🔗 |
Sat 6:30 a.m. - 7:00 a.m.
|
Alexander Keller: How Computer Graphics advances Hardware Aware Efficient Training
(
Keynote
)
SlidesLive Video » |
🔗 |
Sat 7:00 a.m. -
|
Not All Lotteries Are Made Equal
(
Poster
)
The Lottery Ticket Hypothesis (LTH) states that for a reasonably sized neural network, a sub-network within the same network yields no less performance than the dense counterpart when trained from the same initialization. This work investigates the relation between model size and the ease of finding these sparse sub-networks. We show through experiments that, surprisingly, under a finite budget, smaller models benefit more from Ticket Search (TS). |
Surya Kant Sahu · Sai Mitheran · Somya Suhans Mahapatra 🔗 |
Sat 7:00 a.m. -
|
Revisiting Architecture-aware Knowledge Distillation: Smaller Models and Faster Search
(
Poster
)
Knowledge Distillation (KD) has recently emerged as a popular method for compressing neural networks. In recent studies, generalized distillation methods that find parameters and architectures of student models at the same time have been proposed. Still, this search method requires a lot of computation to search for architectures and has the disadvantage of considering only convolutional blocks in their search space. This paper introduces a new algorithm, coined as Trust Region Aware architecture search to Distill knowledge Effectively (TRADE), that rapidly finds effective student architectures from several state-of-the-art architectures using trust region Bayesian optimization approach. Experimental results show our proposed TRADE algorithm consistently outperforms both the conventional NAS approach and pre-defined architecture under KD training. |
Taehyeon Kim · Heesoo Myeong · Se-Young Yun 🔗 |
Sat 7:00 a.m. -
|
Efficient Fine-Tuning of Compressed Language Models with Learners
(
Poster
)
Fine-tuning BERT-based models is resource-intensive in memory, computation, and time. While many prior works aim to improve inference efficiency via compression techniques, e.g., pruning, these works do not explicitly address the computational challenges of training to downstream tasks. We introduce the Learner module, a novel method for fine-tuning that exploits the overparameterization of pre-trained language models to gain benefits in convergence speed and resource utilization. Learner modules navigate the double bind of 1) training efficiently by fine-tuning a subset of parameters, and 2) training effectively by ensuring quick convergence and high metric scores. Our results on DistilBERT demonstrate that learners perform on par with or surpass the baselines. Learners train 7x fewer parameters than state-of-the-art methods on GLUE. On CoLA, learners fine-tune 20% faster, and have significantly lower resource utilization. |
Danilo Vucetic · Mohammadreza Tayaranian · Maryam Zia · James J. Clark · Brett Meyer · Warren Gross 🔗 |
Sat 7:00 a.m. -
|
Cut Inner Layers: A Structured Pruning Strategy for Efficient U-Net GANs
(
Poster
)
Pruning effectively compresses overparameterized models. Despite the success of pruning methods for discriminative models, applying them for generative models has been relatively rarely approached. This study conducts structured pruning on U-Net generators of conditional GANs. A per-layer sensitivity analysis confirms that many unnecessary filters exist in the innermost layers near the bottleneck and can be substantially pruned. Based on this observation, we prune these filters from multiple inner layers or suggest alternative architectures by completely eliminating the layers. We evaluate our approach with Pix2Pix for image-to-image translation and Wav2Lip for speech-driven talking face generation. Our method outperforms global pruning baselines, demonstrating the importance of properly considering where to prune for U-Net generators. |
Bo-Kyeong Kim · Shinkook Choi · Hancheol Park 🔗 |
Sat 7:00 a.m. -
|
A 28nm 8-bit Floating-Point CNN Training Processor with Hardware-Efficient Dynamic Sparsification and 4.7X Training Speedup
(
Poster
)
We present an 8-bit floating-point (FP8) training processor which implements (1) highly parallel tensor cores that maintain high utilization throughout forward propagation (FP), backward propagation (BP), and weight update (WU) phases of the training process, (2) hardware-efficient channel gating for dynamic output activation sparsity, (3) dynamic weight sparsity based on group Lasso, and (4) gradient skipping based on FP prediction error. We develop a custom ISA to flexibly support different CNN topologies and training parameters. The 28nm prototype chip demonstrates large improvements in FLOPs reduction (7.3X), energy efficiency (16.4 TFLOPS/W), and overall training latency speedup (4.7X), for both supervised and self-supervised training tasks. |
Shreyas Kolala Venkataramanaiah · Jian Meng · Han-Sok Suh · Injune Yeo · Jyotishman Saikia · Sai Kiran Cherupally · Yichi Zhang · Zhiru Zhang · Jae-sun Seo 🔗 |
Sat 7:00 a.m. -
|
MobileTL: On-device Transfer Learning with Inverted Residual Blocks
(
Poster
)
We present MobileTL, a memory and computationally efficient on-device transfer learning method for models built with Inverted Residual blocks (IRBs). An IRB splits a full convolution into depthwise and pointwise convolutions, leading to more stacking layers. Though they are efficient for inference, IRBs require additional activation maps stored in memory during back-propagation. To address this issue, MobileTL only updates the bias for internal normalization layers to avoid storing activation maps. Additionally, MobileTL approximates memory-intensive activation layers (e.g., Hard-Swish and ReLU6) as a signed function thereby enabling the use of a binary mask during the backward pass. MobileTL only fine-tunes a few high-level task-specific blocks to reduce the computation cost rather than propagating the gradient through the whole network. Our method reduces training memory usage by 46 % and 53 % for MobileNetV2 and V3 IRBs respectively. For MobileNetV3, we find a 36 % reduction of the floating-point operations when fine-tuning 5 blocks, while only incurring a 0.6 % accuracy reduction in CIFAR10. Extensive experiments on multiple datasets illustrate that our method is Pareto-optimal under given hardware constraints when compared to prior work. Code will be available at: https://github.com/enyac-group. |
Hung-Yueh Chiang · Natalia Frumkin · Feng Liang · Diana Marculescu 🔗 |
Sat 7:00 a.m. -
|
Low-Bit DNN Training with Hardware-Efficient Stochastic Rounding Unit Design
(
Poster
)
Stochastic rounding is crucial in the training of low-bit deep neural networks (DNNs) to achieve high accuracy. Unfortunately, prior studies require a large number of high-precision stochastic rounding units (SRUs) to guarantee the low-bit DNN accuracy, which involves considerable hardware overhead. In this paper, we propose an automated framework to explore hardware-efficient low-bit SRUs (ESRUs) that can still generate high-quality random numbers to guarantee the accuracy of low-bit DNN training. Experimental results using state-of-the-art DNN models demonstrate that, compared to the prior 24-bit SRU with 24-bit pseudo random number generator (PRNG), our 8-bit ESRU~with 3-bit PRNG reduces the SRU resource usage by $9.75\times$ while achieving a higher accuracy.
|
Sung-En Chang · Geng Yuan · Alec Lu · Mengshu Sun · Yanyu Li · Xiaolong Ma · Yanyue Xie · Minghai Qin · Xue Lin · Zhenman Fang · Yanzhi Wang
|
Sat 7:00 a.m. -
|
Investigating the Not-So-Obvious Effects of Structured Pruning
(
Poster
)
Structured pruning is a popular method to reduce the cost of convolutional neural networks. However, depending on the architecture, pruning introduces dimensional discrepancies which prevent the actual reduction of pruned networks and mask their true complexity. Most papers in the literature overlook these issues. We propose a method that systematically solves them and generate an operational network. We show through experiments the gap between the theoretical pruning ratio and the actual complexity revealed by our method. |
Hugo Tessier · Vincent Gripon · Mathieu Léonardon · Matthieu Arzel · David Bertrand · Thomas Hannagan 🔗 |
Sat 7:00 a.m. -
|
OSDP: Optimal Sharded Data Parallel for Distributed Deep Learning
(
Poster
)
Large-scale deep learning models contribute to significant performance improvements on varieties of downstream tasks. Current data and model parallelism approaches utilize model replication and partition techniques to support the distributed training of ultra-large models. However, directly deploying these systems often leads to sub-optimal training efficiency due to the complex model architectures and the strict device memory constraints. In this paper, we propose Optimal Sharded Data Parallel (OSDP), an automated parallel training system that combines the advantages from both data and model parallelism. Given the model description and the device information, OSDP makes trade-offs between the memory consumption and the hardware utilization, thus automatically generates the distributed computation graph and maximizes the overall system throughput. In addition, OSDP introduces operator splitting to further alleviate peak memory footprints during training with negligible overheads, which enables the trainability of larger models as well as the higher throughput. Extensive experimental results of OSDP on multiple different kinds of large-scale models demonstrate that the proposed strategy outperforms the state-of-the-art in multiple regards. Our code is available at https://anonymous.4open.science/r/OptimalShardedDataParallel-751F. |
Youhe Jiang · Xupeng Miao · Xiaonan Nie · Bin Cui 🔗 |
Sat 7:00 a.m. -
|
Studying the impact of magnitude pruning on contrastive learning methods
(
Poster
)
We study the impact of different versions of magnitude pruning on the representation learned by deep models trained with supervised and supervised contrastive learning methods. We discover that at high sparsity contrastive learning results in a higher number of misclassified examples than if the models are trained with supervised learning. We use the number of PIEs (Hooker et al., 2019), Q Score (Kalibhat et al., 2022), and PD- Score (Baldock et al., 2021) metrics to understand the impact of pruning on the learned representation quality. Our analysis suggests that popular pruning methods are oblivious to representation learning: misclassified examples are largely unique for a combination of learning and pruning methods. The negative impact of sparsity on the quality of the learned representation is the highest early on in the training phase. |
Francesco Corti · Rahim Entezari · Sara Hooker · Davide Bacciu · Olga Saukh 🔗 |
Sat 7:00 a.m. -
|
RevBiFPN: The Fully Reversible Bidirectional Feature Pyramid Network
(
Poster
)
This work introduces the RevSilo, the first reversible module for bidirectional multi-scale feature fusion. Like other reversible methods, RevSilo eliminates the need to store hidden activations by recomputing them. Existing reversible methods, however, do not apply to multi-scale feature fusion and are therefore not applicable to a large class of networks. Bidirectional multi-scale feature fusion promotes local and global coherence and has become a de facto design principle for networks targeting spatially sensitive tasks e.g. HRNet and EfficientDet. When paired with high-resolution inputs, these networks achieve state-of-the-art results across various computer vision tasks, but training them requires substantial accelerator memory for saving large, multi-resolution activations. These memory requirements cap network size and limit progress. Using reversible recomputation, the RevSilo alleviates memory issues while still operating across resolution scales. Stacking RevSilos, we create RevBiFPN, a fully reversible bidirectional feature pyramid network. For classification, RevBiFPN is competitive with networks such as EfficientNet while using up to 19.8x lesser training memory. When fine-tuned on COCO, RevBiFPN provides up to a 2.5% boost in AP over HRNet using fewer MACs and a 2.4x reduction in training-time memory. |
Vitaliy Chiley · Vithursan Thangarasa · Abhay Gupta · Anshul Samar · Joel Hestness · Dennis DeCoste 🔗 |
Sat 7:00 a.m. -
|
Finding Structured Winning Tickets with Early Pruning
(
Poster
)
Early in the training of a neural network, there exist sparse subnetworks (“winning lottery tickets”) that can be trained in isolation to match the accuracy of full, dense training (Frankle & Carbin, 2019; Frankle et al., 2020a). While this behavior was first observed for unstructured pruning, it is less clear if such subnetworks also appear in different structured pruning regimes, which have the advantage of being more computationally efficient than unstructured pruning. In this work, we show that a simple method of kernel pruning by mean magnitude, which outperforms the better-studied method of filter pruning, can also identify structured winning tickets, much like filter pruning or unstructured pruning. Moreover, we demonstrate that applying mean magnitude kernel pruning to networks early in training can achieve a higher accuracy-to-FLOPs ratio than training dense networks, filter pruning networks, or pruning networks at initialization. |
Udbhav Bamba · Devin Kwok · Gintare Karolina Dziugaite · David Rolnick 🔗 |
Sat 7:00 a.m. - 8:30 a.m.
|
Poster session I: open discussion and coffee break.
(
Poster session
)
|
🔗 |
Sat 8:30 a.m. - 9:00 a.m.
|
Damien Querlioz: Memory-Centric Machine Learning
(
Keynote
)
SlidesLive Video » |
🔗 |
Sat 9:00 a.m. - 10:15 a.m.
|
Lunch
|
🔗 |
Sat 10:15 a.m. - 10:45 a.m.
|
Fabien Cardinaux: DNN Quantization with Mixed Precision and Trained Lookup Tables
(
Keynote
)
SlidesLive Video » |
🔗 |
Sat 10:45 a.m. - 11:15 a.m.
|
Tien-Ju Yang: Neural Network Design and Training for Efficient On-Device Learning
(
Keynote
)
SlidesLive Video » |
🔗 |
Sat 11:15 a.m. - 11:45 a.m.
|
Jian Tang: Neural Bellman-Ford Networks: An Efficient and General Path-based Method for Link Prediction based on GNNs
(
Keynote
)
SlidesLive Video » |
🔗 |
Sat 11:45 a.m. - 12:00 p.m.
|
Best paper award presentation
(
Presentation
)
SlidesLive Video » |
🔗 |
Sat 12:00 p.m. -
|
Rethinking Pareto Frontier for Performance Evaluation of Deep Neural Networks
(
Poster
)
Performance optimization of deep learning models is conducted either manually or through automatic architecture search, or a combination of both. On the other hand, their performance strongly depends on the target hardware and how successfully the models were trained. We propose to use a multi-dimensional Pareto frontier to re-define the efficiency measure of candidate deep learning models, where several variables such as training cost, inference latency, and accuracy play a relative role in defining a dominant model. Furthermore, a random version of the multi-dimensional Pareto frontier is introduced to mitigate the uncertainty of accuracy, latency, and throughput of deep learning models in different experimental setups. These two complementary methods can be combined to perform objective benchmarking of deep learning models. Our proposed method is applied to a wide range of deep image classification models trained on ImageNet data. Our method combines competing variables with stochastic nature in a single relative efficiency measure. This allows ranking deep learning models that run efficiently on different hardware, and combining inference efficiency with training efficiency objectively. |
Vahid Partovi Nia · Alireza Ghaffari · Mahdi Zolnouri · Yvon Savaria 🔗 |
Sat 12:00 p.m. -
|
GroupBERT: Enhanced Transformer Architecture with Efficient GroupedStructures
(
Poster
)
Attention based language models have high computational requirements, due to large parameter count, dense operations and large volumes of data. We modify the structure of a Transformer layer by introducing grouped transformations to dense feed-forward layers and add a grouped convolution module. The resulting architecture shows superior computational and task performance compared to BERT model family, that translate to time and energy savings for model training. |
Ivan Chelombiev · Daniel Justus · Douglas Orr · Anastasia Dietrich · Frithjof Gressmann · Alexandros Koliousis · Carlo Luschi 🔗 |
Sat 12:00 p.m. -
|
Principal Component Networks: Parameter Reduction Early in Training
(
Poster
)
In this paper, we show that hidden layer activations in overparameterized neural networks for image classification exist primarily in subspaces smaller than the actual model width. We further show that these subspaces can be identified early in training. Based on these observations, we show how to efficiently find small networks that exhibit similar accuracy to their overparameterized counterparts after only a few training epochs. We term these network architectures Principal Component Networks (PCNs). We evaluate PCNs on CIFAR-10 and ImageNet for VGG and ResNet style architectures and find that PCNs consistently reduce parameter counts with little accuracy loss, thus providing the potential to reduce the computational costs of deep neural network training. |
Roger Waleffe · Theodoros Rekatsinas 🔗 |
Sat 12:00 p.m. -
|
TT-PINN: A Tensor-Compressed Neural PDE Solver for Edge Computing
(
Poster
)
Physics-informed neural networks (PINNs) have been increasingly employed due to their capability of modeling complex physics systems. To achieve better expressiveness, increasingly larger network sizes are required in many problems. This has caused challenges when we need to train PINNs on edge devices with limited memory, computing and energy resources. To enable training PINNs on edge devices, this paper proposes an end-to-end compressed PINN based on Tensor-Train decomposition. In solving a Helmholtz equation, our proposed model significantly outperforms the original PINNs with few parameters and achieves satisfactory prediction with up to 15x overall parameter reduction. |
Ziyue Liu · Xinling Yu · Zheng Zhang 🔗 |
Sat 12:00 p.m. -
|
Get the Random Number on the fly: A Low-Precision DNN Training Framework using Stochastic Rounding without the Random Number Generator
(
Poster
)
Stochastic rounding is a critical technique used in low-precision deep neural networks (DNNs) training to ensure good model accuracy. However, it requires a large number of random numbers generated on the fly. This is not a trivial task on the hardware platforms such as FPGA and ASIC.The widely used solution is to introduce random number generators with extra hardware costs. In this paper, we innovatively propose to employ the stochastic property of DNN training process itself and directly extract random numbers from DNNs in a self-sufficient manner. We propose different methods to obtain random numbers from different sources in neural networks and a generator-free framework is proposed for low-precision DNN training on a variety of deep learning tasks. Moreover, we evaluate the quality of the extracted random numbers and find that high-quality random numbers widely exist in DNNs, while their quality can even pass the NIST test suite. |
Geng Yuan · Sung-En Chang · Alec Lu · Jun Liu · Yanyu Li · Yushu Wu · Zhenglun Kong · Yanyue Xie · Peiyan Dong · Minghai Qin · Xiaolong Ma · Zhenman Fang · Yanzhi Wang
|
Sat 12:00 p.m. -
|
Efficient Training of Deep Equilibrium Models
(
Poster
)
Deep equilibrium models (DEQs) have proven to be very powerful for learning data representations. The idea is to replace traditional (explicit) feedforward neural networks with an implicit fixed-point equation, which allows to decouple the forward and backward passes. In particular, training DEQ layers becomes very memory-efficient via the implicit function theorem. However, backpropagation through DEQ layers still requires solving an expensive Jacobian-based equation. In this paper, we introduce a simple but effective strategy to avoid this computational burden. Our method relies on the Jacobian approximation of Broyden’s method after the forward pass to compute the gradients during the backward pass. Experiments show that simply re-using this approximation cansignificantly speed up the training while not causing any performance degradation. |
Bac Nguyen · Lukas Mauch 🔗 |
Sat 12:00 p.m. -
|
Locally Supervised Learning with Periodic Global Guidance
(
Poster
)
Locally supervised learning aims to train a neural network based on a local estimation of the global loss function at each decoupled module of the network. Auxiliary networks are typically appended to the modules to approximate the gradient updates based on the greedy local losses. Despite being advantageous in terms of parallelism and reduced memory consumption, this paradigm of training severely degrades the generalization performance of neural networks. In this paper, we propose Periodically Guided local Learning (PGL), which reinstates the global objective repetitively into the local-loss based training of neural networks primarily to enhance the model's generalization capability. We show that a simple periodic guidance scheme begets significant performance gains while having a low memory footprint. We conduct extensive experiments on various datasets and networks to demonstrate the effectiveness of PGL, especially in the configuration with numerous decoupled modules. |
Hasnain Irshad Bhatti · Jaekyun Moon 🔗 |
Sat 12:00 p.m. -
|
Energy-aware Network Operator Search in Deep Neural Networks
(
Poster
)
This work proposes a novel Energy-aware Network Operator Search (ENOS) approach to address the energy-accuracy trade-offs of a deep neural network (DNN) accelerator. The proposed ENOS framework allows an optimal layer-wise integration of inference operators with optimal precision to maintain high prediction accuracy along with high energy efficiency. The search is formulated as a continuous optimization problem, solvable using gradient descent methods, thereby minimally increasing the training cost when learning both layer-wise inference operators and weights. We discuss multiply-accumulate (MAC) cores for digital spatial architectures that can be reconfigured to different operators and varying computing precision. ENOS training methods with single and bi-level optimization objectives are discussed and compared. We also discuss a sequential operator assignment strategy and a stochastic mode of ENOS. ENOS, characterized on ShuffleNet and SqueezeNet using CIFAR10 and CIFAR100, improves accuracy by 10--20% compared to the conventional uni-operator approaches and by 3-5% compared to mixed-precision uni-operator implementations for the same energy budget. |
Shamma Nasrin 🔗 |
Sat 12:00 p.m. -
|
TrimBERT: Tailoring BERT for Trade-offs
(
Poster
)
Models based on BERT have been extremely successful in solving a variety of natural language processing (NLP) tasks. Unfortunately, many of these large models require a great deal of computational resources and/or time for pre-training and fine-tuning which limits wider adoptability. While self-attention layers have been well-studied, a strong justification for inclusion of the intermediate layers which follow them remains missing in the literature. In this work, we show that reducing the number of intermediate layers in BERT-Base results in minimal fine-tuning accuracy loss of downstream tasks while significantly decreasing model size and training time. We further mitigate two key bottlenecks, by replacing softmax operations in the self-attention layers with a computationally simpler alternative and removing half of all layernorm operations. This further increases the throughput while maintaining a high level of fine-tuning accuracy. |
Sharath Nittur Sridhar · Anthony Sarah · Sairam Sundaresan 🔗 |
Sat 12:00 p.m. -
|
QReg: On Regularization Effects of Quantization
(
Poster
)
In this paper we study the effects of quantization in DNN training. We hypothesize that weight quantization is a form of regularization and the amount of regularization is correlated with the quantization level (precision). We confirm our hypothesis by providing analytical study and empirical results. By modeling weight quantizationas a form of additive noise to weights, we explore how this noise propagates through the network at training time. We then show that the magnitude of this noise is correlated with the level of quantization. To confirm our analytical study, we performed an extensive list of experiments summarized in this paper in which we show that the regularization effects of quantization can be seen in various vision tasks and models, over various datasets. Based on our study, we propose that8-bit quantization provides a reliable form of regularization in different vision tasks and models. |
MohammadHossein AskariHemmat · Reyhane Askari Hemmat · Alexander Hoffman · Ivan Lazarevich · Ehsan Saboori · Olivier Mastropietro · Sudhakar Sah · Yvon Savaria · Jean-Pierre David 🔗 |
Sat 12:00 p.m. -
|
MCTensor: A High-Precision Deep Learning Library with Multi-Component Floating-Point
(
Poster
)
In this paper, we introduce MCTensor, a library based on PyTorch for providing general-purpose and high-precision arithmetic for DL training. MCTensor is used in the same way as PyTorch Tensor: we implement multiple basic, matrix-level computation operators and NN modules for MCTensor with identical PyTorch interface. Our algorithms achieve high precision computation and also benefits from heavily-optimized PyTorch floating-point arithmetic. We evaluate MCTensor arithmetic against PyTorch native arithmetic for a series of tasks, where models using MCTensor in float16 would match or outperform the PyTorch model with float32 or float64 precision. |
Tao Yu · Wentao Guo · Canal Li · Tiancheng Yuan · Christopher De Sa 🔗 |
Sat 12:00 p.m. -
|
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
(
Poster
)
Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. Approximate attention methods have attempted to address this problem by trading off model quality to reduce the compute complexity, but often do not achieve wall-clock speedup. We argue that a missing principle is making attention algorithms IO-aware -- accounting for reads and writes between levels of GPU memory. We propose FlashAttention, an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU high bandwidth memory (HBM) and GPU on-chip SRAM. We analyze the IO complexity of FlashAttention, showing that it requires fewer HBM accesses than standard attention, and is optimal for a range of SRAM sizes. We also extend FlashAttention to block-sparse attention, yielding an approximate attention algorithm that is faster than any existing approximate attention method. FlashAttention trains Transformers faster than existing baselines: 15% end-to-end wall-clock speedup on BERT-large (seq. length 512) compared to the MLPerf 1.1 training speed record, 3× speedup on GPT-2 (seq. length 1K), and 2.4× speedup on long-range arena (seq. length 1K-4K). FlashAttention and block-sparse FlashAttention enable longer context in Transformers, yielding higher quality models (0.7 better perplexity on GPT-2 and 6.4 points of lift on long-document classification) and entirely new capabilities: the first Transformers to achieve better-than-chance performance on the Path-X challenge (seq. length 16K, 61.4% accuracy) and Path-256 (seq. length 64K, 63.1% accuracy). |
Tri Dao · Daniel Y Fu · Stefano Ermon · Atri Rudra · Christopher Re 🔗 |
Sat 12:00 p.m. - 1:30 p.m.
|
Poster session II: open discussion and coffee break.
(
Poster session
)
|
🔗 |
Sat 1:30 p.m. - 2:15 p.m.
|
Panel
(
Discussion Panel
)
SlidesLive Video » |
🔗 |
Sat 2:15 p.m. - 2:30 p.m.
|
Closing remarks.
(
Closing message
)
SlidesLive Video » |
🔗 |
Author Information
Gonçalo Mordido (Mila & Polytechnique Montreal)
Yoshua Bengio (Mila - Quebec AI Institute)
Ghouthi BOUKLI HACENE (SONY)
Vincent Gripon (IMT Atlantique)
François Leduc-Primeau (Polytechnique Montreal)
Vahid Partovi Nia (Polytechnique Montreal)
Julie Grollier (Unité Mixte CNRS/Thalès)
More from the Same Authors
-
2021 : Gradient Starvation: A Learning Proclivity in Neural Networks »
Mohammad Pezeshki · Sékou-Oumar Kaba · Yoshua Bengio · Aaron Courville · Doina Precup · Guillaume Lajoie -
2021 : Epoch-Wise Double Descent: A Theory of Multi-scale Feature Learning Dynamics »
Mohammad Pezeshki · Amartya Mitra · Yoshua Bengio · Guillaume Lajoie -
2021 : Exploration-Driven Representation Learning in Reinforcement Learning »
Akram Erraqabi · Mingde Zhao · Marlos C. Machado · Yoshua Bengio · Sainbayar Sukhbaatar · Ludovic Denoyer · Alessandro Lazaric -
2021 : Variational Causal Networks: Approximate Bayesian Inference over Causal Structures »
Yashas Annadani · Jonas Rothfuss · Alexandre Lacoste · Nino Scherrer · Anirudh Goyal · Yoshua Bengio · Stefan Bauer -
2022 : On the Generalization and Adaption Performance of Causal Models »
Nino Scherrer · Anirudh Goyal · Stefan Bauer · Yoshua Bengio · Rosemary Nan Ke -
2022 : MAgNet: Mesh Agnostic Neural PDE Solver »
Oussama Boussif · Yoshua Bengio · Loubna Benabbou · Dan Assouline -
2023 : Improving and Generalizing Flow-Based Generative Models with Minibatch Optimal Transport »
Alexander Tong · Nikolay Malkin · Guillaume Huguet · Yanlei Zhang · Jarrid Rector-Brooks · Kilian Fatras · Guy Wolf · Yoshua Bengio -
2023 : Simulation-Free Schrödinger Bridges via Score and Flow Matching »
Alexander Tong · Nikolay Malkin · Kilian Fatras · Lazar Atanackovic · Yanlei Zhang · Guillaume Huguet · Guy Wolf · Yoshua Bengio -
2023 : OC-NMN: Object-centric Compositional Neural Module Network for Generative Visual Analogical Reasoning »
Rim Assouel · Pau Rodriguez · Perouz Taslakian · David Vazquez · Yoshua Bengio -
2023 : Benchmarking Bayesian Causal Discovery Methods for Downstream Treatment Effect Estimation »
Chris Emezue · Alexandre Drouin · Tristan Deleu · Stefan Bauer · Yoshua Bengio -
2023 : Joint Bayesian Inference of Graphical Structure and Parameters with a Single Generative Flow Network »
Tristan Deleu · Mizu Nishikawa-Toomey · Jithendaraa Subramanian · Nikolay Malkin · Laurent Charlin · Yoshua Bengio -
2023 : BatchGFN: Generative Flow Networks for Batch Active Learning »
Shreshth Malik · Salem Lahlou · Andrew Jesson · Moksh Jain · Nikolay Malkin · Tristan Deleu · Yoshua Bengio · Yarin Gal -
2023 : Thompson Sampling for Improved Exploration in GFlowNets »
Jarrid Rector-Brooks · Kanika Madan · Moksh Jain · Maksym Korablyov · Chenghao Liu · Sarath Chandar · Nikolay Malkin · Yoshua Bengio -
2023 : GFlowNets for Causal Discovery: an Overview »
Dragos Cristian Manta · Edward Hu · Yoshua Bengio -
2023 : Constant Memory Attention Block »
Leo Feng · Frederick Tung · Hossein Hajimirsadeghi · Yoshua Bengio · Mohamed Osama Ahmed -
2023 : What if We Enrich day-ahead Solar Irradiance Time Series Forecasting with Spatio-Temporal Context? »
Oussama Boussif · Ghait Boukachab · Dan Assouline · Stefano Massaroli · Tianle Yuan · Loubna Benabbou · Yoshua Bengio -
2023 : GFlowNets for Causal Discovery: an Overview »
Dragos Cristian Manta · Edward Hu · Yoshua Bengio -
2023 Workshop: Structured Probabilistic Inference and Generative Modeling »
Dinghuai Zhang · Yuanqi Du · Chenlin Meng · Shawn Tan · Yingzhen Li · Max Welling · Yoshua Bengio -
2023 : Opening Remark »
Dinghuai Zhang · Yuanqi Du · Chenlin Meng · Shawn Tan · Yingzhen Li · Max Welling · Yoshua Bengio -
2023 Oral: Hyena Hierarchy: Towards Larger Convolutional Language Models »
Michael Poli · Stefano Massaroli · Eric Nguyen · Daniel Y Fu · Tri Dao · Stephen Baccus · Yoshua Bengio · Stefano Ermon · Christopher Re -
2023 Poster: Equivariance with Learned Canonicalization Functions »
Sékou-Oumar Kaba · Arnab Kumar Mondal · Yan Zhang · Yoshua Bengio · Siamak Ravanbakhsh -
2023 Poster: GFlowOut: Dropout with Generative Flow Networks »
Dianbo Liu · Moksh Jain · Bonaventure F. P. Dossou · Qianli Shen · Salem Lahlou · Anirudh Goyal · Nikolay Malkin · Chris Emezue · Dinghuai Zhang · Nadhir Hassen · Xu Ji · Kenji Kawaguchi · Yoshua Bengio -
2023 Poster: Discrete Key-Value Bottleneck »
Frederik Träuble · Anirudh Goyal · Nasim Rahaman · Michael Mozer · Kenji Kawaguchi · Yoshua Bengio · Bernhard Schölkopf -
2023 Poster: Hyena Hierarchy: Towards Larger Convolutional Language Models »
Michael Poli · Stefano Massaroli · Eric Nguyen · Daniel Y Fu · Tri Dao · Stephen Baccus · Yoshua Bengio · Stefano Ermon · Christopher Re -
2023 Poster: Synergies between Disentanglement and Sparsity: Generalization and Identifiability in Multi-Task Learning »
Sébastien Lachapelle · Tristan Deleu · Divyat Mahajan · Ioannis Mitliagkas · Yoshua Bengio · Simon Lacoste-Julien · Quentin Bertrand -
2023 Poster: Better Training of GFlowNets with Local Credit and Incomplete Trajectories »
Ling Pan · Nikolay Malkin · Dinghuai Zhang · Yoshua Bengio -
2023 Poster: Learning GFlowNets From Partial Episodes For Improved Convergence And Stability »
Kanika Madan · Jarrid Rector-Brooks · Maksym Korablyov · Emmanuel Bengio · Moksh Jain · Andrei-Cristian Nica · Tom Bosc · Yoshua Bengio · Nikolay Malkin -
2023 Oral: Interventional Causal Representation Learning »
Kartik Ahuja · Divyat Mahajan · Yixin Wang · Yoshua Bengio -
2023 Oral: Learning GFlowNets From Partial Episodes For Improved Convergence And Stability »
Kanika Madan · Jarrid Rector-Brooks · Maksym Korablyov · Emmanuel Bengio · Moksh Jain · Andrei-Cristian Nica · Tom Bosc · Yoshua Bengio · Nikolay Malkin -
2023 Poster: FAENet: Frame Averaging Equivariant GNN for Materials Modeling »
ALEXANDRE DUVAL · Victor Schmidt · Alex Hernandez-Garcia · Santiago Miret · Fragkiskos Malliaros · Yoshua Bengio · David Rolnick -
2023 Poster: Multi-Objective GFlowNets »
Moksh Jain · Sharath Chandra Raparthy · Alex Hernandez-Garcia · Jarrid Rector-Brooks · Yoshua Bengio · Santiago Miret · Emmanuel Bengio -
2023 Poster: Interventional Causal Representation Learning »
Kartik Ahuja · Divyat Mahajan · Yixin Wang · Yoshua Bengio -
2023 Poster: A theory of continuous generative flow networks »
Salem Lahlou · Tristan Deleu · Pablo Lemos · Dinghuai Zhang · Alexandra Volokhova · Alex Hernandez-Garcia · Lena Nehale Ezzine · Yoshua Bengio · Nikolay Malkin -
2023 Poster: GFlowNet-EM for Learning Compositional Latent Variable Models »
Edward Hu · Nikolay Malkin · Moksh Jain · Katie Everett · Alexandros Graikos · Yoshua Bengio -
2022 : Investigating the Not-So-Obvious Effects of Structured Pruning »
Hugo Tessier · Vincent Gripon · Mathieu Léonardon · Matthieu Arzel · David Bertrand · Thomas Hannagan -
2022 : Is a Modular Architecture Enough? »
Sarthak Mittal · Yoshua Bengio · Guillaume Lajoie -
2022 Poster: Building Robust Ensembles via Margin Boosting »
Dinghuai Zhang · Hongyang Zhang · Aaron Courville · Yoshua Bengio · Pradeep Ravikumar · Arun Sai Suggala -
2022 Poster: Multi-scale Feature Learning Dynamics: Insights for Double Descent »
Mohammad Pezeshki · Amartya Mitra · Yoshua Bengio · Guillaume Lajoie -
2022 Spotlight: Building Robust Ensembles via Margin Boosting »
Dinghuai Zhang · Hongyang Zhang · Aaron Courville · Yoshua Bengio · Pradeep Ravikumar · Arun Sai Suggala -
2022 Spotlight: Multi-scale Feature Learning Dynamics: Insights for Double Descent »
Mohammad Pezeshki · Amartya Mitra · Yoshua Bengio · Guillaume Lajoie -
2022 Poster: Biological Sequence Design with GFlowNets »
Moksh Jain · Emmanuel Bengio · Alex Hernandez-Garcia · Jarrid Rector-Brooks · Bonaventure Dossou · Chanakya Ekbote · Jie Fu · Tianyu Zhang · Michael Kilgour · Dinghuai Zhang · Lena Simine · Payel Das · Yoshua Bengio -
2022 Spotlight: Biological Sequence Design with GFlowNets »
Moksh Jain · Emmanuel Bengio · Alex Hernandez-Garcia · Jarrid Rector-Brooks · Bonaventure Dossou · Chanakya Ekbote · Jie Fu · Tianyu Zhang · Michael Kilgour · Dinghuai Zhang · Lena Simine · Payel Das · Yoshua Bengio -
2022 Poster: Generative Flow Networks for Discrete Probabilistic Modeling »
Dinghuai Zhang · Nikolay Malkin · Zhen Liu · Alexandra Volokhova · Aaron Courville · Yoshua Bengio -
2022 Poster: Towards Scaling Difference Target Propagation by Learning Backprop Targets »
Maxence ERNOULT · Fabrice Normandin · Abhinav Moudgil · Sean Spinney · Eugene Belilovsky · Irina Rish · Blake Richards · Yoshua Bengio -
2022 Spotlight: Towards Scaling Difference Target Propagation by Learning Backprop Targets »
Maxence ERNOULT · Fabrice Normandin · Abhinav Moudgil · Sean Spinney · Eugene Belilovsky · Irina Rish · Blake Richards · Yoshua Bengio -
2022 Spotlight: Generative Flow Networks for Discrete Probabilistic Modeling »
Dinghuai Zhang · Nikolay Malkin · Zhen Liu · Alexandra Volokhova · Aaron Courville · Yoshua Bengio -
2021 Workshop: Tackling Climate Change with Machine Learning »
Hari Prasanna Das · Katarzyna Tokarska · Maria João Sousa · Meareg Hailemariam · David Rolnick · Xiaoxiang Zhu · Yoshua Bengio -
2021 Poster: An End-to-End Framework for Molecular Conformation Generation via Bilevel Programming »
Minkai Xu · Wujie Wang · Shitong Luo · Chence Shi · Yoshua Bengio · Rafael Gomez-Bombarelli · Jian Tang -
2021 Spotlight: An End-to-End Framework for Molecular Conformation Generation via Bilevel Programming »
Minkai Xu · Wujie Wang · Shitong Luo · Chence Shi · Yoshua Bengio · Rafael Gomez-Bombarelli · Jian Tang -
2020 : QA for invited talk 4 Bengio »
Yoshua Bengio -
2020 : Invited talk 4 Bengio »
Yoshua Bengio -
2020 : Keynote: Yoshua Bengio (Q&A) »
Yoshua Bengio -
2020 : Keynote: Yoshua Bengio »
Yoshua Bengio -
2020 Workshop: Object-Oriented Learning: Perception, Representation, and Reasoning »
Sungjin Ahn · Adam Kosiorek · Jessica Hamrick · Sjoerd van Steenkiste · Yoshua Bengio -
2020 Workshop: MLRetrospectives: A Venue for Self-Reflection in ML Research »
Jessica Forde · Jesse Dodge · Mayoore Jaiswal · Rosanne Liu · Ryan Lowe · Rosanne Liu · Joelle Pineau · Yoshua Bengio -
2020 Poster: Learning to Combine Top-Down and Bottom-Up Signals in Recurrent Neural Networks with Attention over Modules »
Sarthak Mittal · Alex Lamb · Anirudh Goyal · Vikram Voleti · Murray Shanahan · Guillaume Lajoie · Michael Mozer · Yoshua Bengio -
2020 Poster: Learning to Navigate The Synthetically Accessible Chemical Space Using Reinforcement Learning »
Sai Krishna Gottipati · Boris Sattarov · Sufeng Niu · Yashaswi Pathak · Haoran Wei · Shengchao Liu · Shengchao Liu · Simon Blackburn · Karam Thomas · Connor Coley · Jian Tang · Sarath Chandar · Yoshua Bengio -
2020 Poster: Perceptual Generative Autoencoders »
Zijun Zhang · Ruixiang ZHANG · Zongpeng Li · Yoshua Bengio · Liam Paull -
2020 Poster: Revisiting Fundamentals of Experience Replay »
William Fedus · Prajit Ramachandran · Rishabh Agarwal · Yoshua Bengio · Hugo Larochelle · Mark Rowland · Will Dabney -
2020 Poster: Small-GAN: Speeding up GAN Training using Core-Sets »
Samrath Sinha · Han Zhang · Anirudh Goyal · Yoshua Bengio · Hugo Larochelle · Augustus Odena -
2019 : AI Commons »
Yoshua Bengio -
2019 : Opening remarks »
Yoshua Bengio -
2019 Workshop: AI For Social Good (AISG) »
Margaux Luck · Kris Sankaran · Tristan Sylvain · Sean McGregor · Jonnie Penn · Girmaw Abebe Tadesse · Virgile Sylvain · Myriam Côté · Lester Mackey · Rayid Ghani · Yoshua Bengio -
2019 : Panel Discussion »
Yoshua Bengio · Andrew Ng · Raia Hadsell · John Platt · Claire Monteleoni · Jennifer Chayes -
2019 : Poster discussion »
Roman Novak · Maxime Gabella · Frederic Dreyer · Siavash Golkar · Anh Tong · Irina Higgins · Mirco Milletari · Joe Antognini · Sebastian Goldt · Adín Ramírez Rivera · Roberto Bondesan · Ryo Karakida · Remi Tachet des Combes · Michael Mahoney · Nicholas Walker · Stanislav Fort · Samuel Smith · Rohan Ghosh · Aristide Baratin · Diego Granziol · Stephen Roberts · Dmitry Vetrov · Andrew Wilson · César Laurent · Valentin Thomas · Simon Lacoste-Julien · Dar Gilboa · Daniel Soudry · Anupam Gupta · Anirudh Goyal · Yoshua Bengio · Erich Elsen · Soham De · Stanislaw Jastrzebski · Charles H Martin · Samira Shabanian · Aaron Courville · Shorato Akaho · Lenka Zdeborova · Ethan Dyer · Maurice Weiler · Pim de Haan · Taco Cohen · Max Welling · Ping Luo · zhanglin peng · Nasim Rahaman · Loic Matthey · Danilo J. Rezende · Jaesik Choi · Kyle Cranmer · Lechao Xiao · Jaehoon Lee · Yasaman Bahri · Jeffrey Pennington · Greg Yang · Jiri Hron · Jascha Sohl-Dickstein · Guy Gur-Ari -
2019 : Personalized Visualization of the Impact of Climate Change »
Yoshua Bengio -
2019 : Networking Lunch (provided) + Poster Session »
Abraham Stanway · Alex Robson · Aneesh Rangnekar · Ashesh Chattopadhyay · Ashley Pilipiszyn · Benjamin LeRoy · Bolong Cheng · Ce Zhang · Chaopeng Shen · Christian Schroeder · Christian Clough · Clement DUHART · Clement Fung · Cozmin Ududec · Dali Wang · David Dao · di wu · Dimitrios Giannakis · Dino Sejdinovic · Doina Precup · Duncan Watson-Parris · Gege Wen · George Chen · Gopal Erinjippurath · Haifeng Li · Han Zou · Herke van Hoof · Hillary A Scannell · Hiroshi Mamitsuka · Hongbao Zhang · Jaegul Choo · James Wang · James Requeima · Jessica Hwang · Jinfan Xu · Johan Mathe · Jonathan Binas · Joonseok Lee · Kalai Ramea · Kate Duffy · Kevin McCloskey · Kris Sankaran · Lester Mackey · Letif Mones · Loubna Benabbou · Lynn Kaack · Matthew Hoffman · Mayur Mudigonda · Mehrdad Mahdavi · Michael McCourt · Mingchao Jiang · Mohammad Mahdi Kamani · Neel Guha · Niccolo Dalmasso · Nick Pawlowski · Nikola Milojevic-Dupont · Paulo Orenstein · Pedram Hassanzadeh · Pekka Marttinen · Ramesh Nair · Sadegh Farhang · Samuel Kaski · Sandeep Manjanna · Sasha Luccioni · Shuby Deshpande · Soo Kim · Soukayna Mouatadid · Sunghyun Park · Tao Lin · Telmo Felgueira · Thomas Hornigold · Tianle Yuan · Tom Beucler · Tracy Cui · Volodymyr Kuleshov · Wei Yu · yang song · Ydo Wexler · Yoshua Bengio · Zhecheng Wang · Zhuangfang Yi · Zouheir Malki -
2019 Workshop: Climate Change: How Can AI Help? »
David Rolnick · Alexandre Lacoste · Tegan Maharaj · Jennifer Chayes · Yoshua Bengio -
2019 Poster: State-Reification Networks: Improving Generalization by Modeling the Distribution of Hidden Representations »
Alex Lamb · Jonathan Binas · Anirudh Goyal · Sandeep Subramanian · Ioannis Mitliagkas · Yoshua Bengio · Michael Mozer -
2019 Poster: On the Spectral Bias of Neural Networks »
Nasim Rahaman · Aristide Baratin · Devansh Arpit · Felix Draxler · Min Lin · Fred Hamprecht · Yoshua Bengio · Aaron Courville -
2019 Oral: On the Spectral Bias of Neural Networks »
Nasim Rahaman · Aristide Baratin · Devansh Arpit · Felix Draxler · Min Lin · Fred Hamprecht · Yoshua Bengio · Aaron Courville -
2019 Oral: State-Reification Networks: Improving Generalization by Modeling the Distribution of Hidden Representations »
Alex Lamb · Jonathan Binas · Anirudh Goyal · Sandeep Subramanian · Ioannis Mitliagkas · Yoshua Bengio · Michael Mozer -
2019 Poster: Manifold Mixup: Better Representations by Interpolating Hidden States »
Vikas Verma · Alex Lamb · Christopher Beckham · Amir Najafi · Ioannis Mitliagkas · David Lopez-Paz · Yoshua Bengio -
2019 Poster: GMNN: Graph Markov Neural Networks »
Meng Qu · Yoshua Bengio · Jian Tang -
2019 Oral: GMNN: Graph Markov Neural Networks »
Meng Qu · Yoshua Bengio · Jian Tang -
2019 Oral: Manifold Mixup: Better Representations by Interpolating Hidden States »
Vikas Verma · Alex Lamb · Christopher Beckham · Amir Najafi · Ioannis Mitliagkas · David Lopez-Paz · Yoshua Bengio -
2018 Poster: Mutual Information Neural Estimation »
Mohamed Belghazi · Aristide Baratin · Sai Rajeswar · Sherjil Ozair · Yoshua Bengio · R Devon Hjelm · Aaron Courville -
2018 Oral: Mutual Information Neural Estimation »
Mohamed Belghazi · Aristide Baratin · Sai Rajeswar · Sherjil Ozair · Yoshua Bengio · R Devon Hjelm · Aaron Courville -
2018 Poster: Focused Hierarchical RNNs for Conditional Sequence Processing »
Rosemary Nan Ke · Konrad Zolna · Alessandro Sordoni · Zhouhan Lin · Adam Trischler · Yoshua Bengio · Joelle Pineau · Laurent Charlin · Christopher Pal -
2018 Oral: Focused Hierarchical RNNs for Conditional Sequence Processing »
Rosemary Nan Ke · Konrad Zolna · Alessandro Sordoni · Zhouhan Lin · Adam Trischler · Yoshua Bengio · Joelle Pineau · Laurent Charlin · Christopher Pal -
2017 Workshop: Reproducibility in Machine Learning Research »
Rosemary Nan Ke · Anirudh Goyal · Alex Lamb · Joelle Pineau · Samy Bengio · Yoshua Bengio -
2017 Poster: Sharp Minima Can Generalize For Deep Nets »
Laurent Dinh · Razvan Pascanu · Samy Bengio · Yoshua Bengio -
2017 Poster: A Closer Look at Memorization in Deep Networks »
David Krueger · Yoshua Bengio · Stanislaw Jastrzebski · Maxinder S. Kanwal · Nicolas Ballas · Asja Fischer · Emmanuel Bengio · Devansh Arpit · Tegan Maharaj · Aaron Courville · Simon Lacoste-Julien -
2017 Talk: A Closer Look at Memorization in Deep Networks »
David Krueger · Yoshua Bengio · Stanislaw Jastrzebski · Maxinder S. Kanwal · Nicolas Ballas · Asja Fischer · Emmanuel Bengio · Devansh Arpit · Tegan Maharaj · Aaron Courville · Simon Lacoste-Julien -
2017 Talk: Sharp Minima Can Generalize For Deep Nets »
Laurent Dinh · Razvan Pascanu · Samy Bengio · Yoshua Bengio