Timezone: »
Recurrent Neural Networks (RNNs) offer fast inference on long sequences but are hard to optimize and slow to train. Deep state-space models (SSMs) have recently been shown to perform remarkably well on long sequence modeling tasks, and have the added benefits of fast parallelizable training and RNN-like fast inference. However, while SSMs are superficially similar to RNNs, there are important differences that make it unclear where their performance boost over RNNs comes from. We show that careful design of deep RNNs using standard signal propagation arguments can recover the impressive performance of deep SSMs on long-range reasoning tasks, while matching their training speed. To achieve this, we analyze and ablate a series of changes to standard RNNs including linearizing and diagonalizing the recurrence, using better parameterizations and initializations, and ensuring careful normalization of the forward pass. Our results provide new insights on the origins of the impressive performance of deep SSMs, and introduce an RNN block called the Linear Recurrent Unit (or LRU) that matches both their performance on the Long Range Arena benchmark and their computational efficiency.
Author Information
Antonio Orvieto (ETH Zurich)
Samuel Smith (Google DeepMind)
Albert Gu (Carnegie Mellon University, DeepMind)
Anushan Fernando (DeepMind)
Caglar Gulcehre (Google DeepMind)
Razvan Pascanu (DeepMind)
Soham De (Google DeepMind)
Related Events (a corresponding poster, oral, or spotlight)
-
2023 Oral: Resurrecting Recurrent Neural Networks for Long Sequences »
Fri. Jul 28th 01:16 -- 01:24 AM Room Meeting Room 316 A-C
More from the Same Authors
-
2022 : Towards a General Purpose CNN for Long Range Dependencies in $N$D »
David Romero · David Knigge · Albert Gu · Erik Bekkers · Efstratios Gavves · Jakub Tomczak · Mark Hoogendoorn -
2022 : Should You Follow the Gradient Flow? Insights from Runge-Kutta Gradient Descent »
Xiang Li · Antonio Orvieto -
2022 : Enhancing Unit-tests for Invariance Discovery »
Piersilvio De Bartolomeis · Antonio Orvieto · Giambattista Parascandolo -
2022 : Pushing the limits of self-supervised ResNets: Can we outperform supervised learning without labels on ImageNet? »
Nenad Tomasev · Ioana Bica · Brian McWilliams · Lars Buesing · Razvan Pascanu · Charles Blundell · Jovana Mitrovic -
2023 : On the Universality of Linear Recurrences Followed by Nonlinear Projections »
Antonio Orvieto · Soham De · Razvan Pascanu · Caglar Gulcehre · Samuel Smith -
2023 : An Adaptive Method for Minimizing Non-negative Losses »
Antonio Orvieto · Lin Xiao -
2023 : On the Advantage of Lion Compared to signSGD with Momentum »
Alessandro Noiato · Luca Biggio · Antonio Orvieto -
2023 : Structured State Space Models for In-Context Reinforcement Learning »
Christopher Lu · Yannick Schroecker · Albert Gu · Emilio Parisotto · Jakob Foerster · Satinder Singh · Feryal Behbahani -
2023 : Latent Space Representations of Neural Algorithmic Reasoners »
Vladimir V. Mirjanić · Razvan Pascanu · Petar Veličković -
2023 : Asynchronous Algorithmic Alignment with Cocycles »
Andrew Dudzik · Tamara von Glehn · Razvan Pascanu · Petar Veličković -
2023 : Asynchronous Algorithmic Alignment with Cocycles »
Andrew Dudzik · Tamara von Glehn · Razvan Pascanu · Petar Veličković -
2023 Oral: Understanding Plasticity in Neural Networks »
Clare Lyle · Zeyu Zheng · Evgenii Nikishin · Bernardo Avila Pires · Razvan Pascanu · Will Dabney -
2023 Poster: Understanding Plasticity in Neural Networks »
Clare Lyle · Zeyu Zheng · Evgenii Nikishin · Bernardo Avila Pires · Razvan Pascanu · Will Dabney -
2023 Poster: An SDE for Modeling SAM: Theory and Insights »
Enea Monzio Compagnoni · Luca Biggio · Antonio Orvieto · Frank Proske · Hans Kersting · Aurelien Lucchi -
2022 Poster: Wide Neural Networks Forget Less Catastrophically »
Seyed Iman Mirzadeh · Arslan Chaudhry · Dong Yin · Huiyi Hu · Razvan Pascanu · Dilan Gorur · Mehrdad Farajtabar -
2022 Poster: It’s Raw! Audio Generation with State-Space Models »
Karan Goel · Albert Gu · Chris Donahue · Christopher Re -
2022 Spotlight: Wide Neural Networks Forget Less Catastrophically »
Seyed Iman Mirzadeh · Arslan Chaudhry · Dong Yin · Huiyi Hu · Razvan Pascanu · Dilan Gorur · Mehrdad Farajtabar -
2022 Oral: It’s Raw! Audio Generation with State-Space Models »
Karan Goel · Albert Gu · Chris Donahue · Christopher Re -
2022 Poster: The CLRS Algorithmic Reasoning Benchmark »
Petar Veličković · Adrià Puigdomenech Badia · David Budden · Razvan Pascanu · Andrea Banino · Misha Dashevskiy · Raia Hadsell · Charles Blundell -
2022 Spotlight: The CLRS Algorithmic Reasoning Benchmark »
Petar Veličković · Adrià Puigdomenech Badia · David Budden · Razvan Pascanu · Andrea Banino · Misha Dashevskiy · Raia Hadsell · Charles Blundell -
2022 Poster: Anticorrelated Noise Injection for Improved Generalization »
Antonio Orvieto · Hans Kersting · Frank Proske · Francis Bach · Aurelien Lucchi -
2022 Spotlight: Anticorrelated Noise Injection for Improved Generalization »
Antonio Orvieto · Hans Kersting · Frank Proske · Francis Bach · Aurelien Lucchi -
2021 : Invited Talk #4 »
Razvan Pascanu -
2021 : Panel Discussion1 »
Razvan Pascanu · Irina Rish -
2021 Poster: HoroPCA: Hyperbolic Dimensionality Reduction via Horospherical Projections »
Ines Chami · Albert Gu · Dat P Nguyen · Christopher Re -
2021 Spotlight: HoroPCA: Hyperbolic Dimensionality Reduction via Horospherical Projections »
Ines Chami · Albert Gu · Dat P Nguyen · Christopher Re -
2021 Poster: Spectral Normalisation for Deep Reinforcement Learning: An Optimisation Perspective »
Florin Gogianu · Tudor Berariu · Mihaela Rosca · Claudia Clopath · Lucian Busoniu · Razvan Pascanu -
2021 Spotlight: Spectral Normalisation for Deep Reinforcement Learning: An Optimisation Perspective »
Florin Gogianu · Tudor Berariu · Mihaela Rosca · Claudia Clopath · Lucian Busoniu · Razvan Pascanu -
2021 Poster: Catformer: Designing Stable Transformers via Sensitivity Analysis »
Jared Quincy Davis · Albert Gu · Krzysztof Choromanski · Tri Dao · Christopher Re · Chelsea Finn · Percy Liang -
2021 Poster: High-Performance Large-Scale Image Recognition Without Normalization »
Andy Brock · Soham De · Samuel Smith · Karen Simonyan -
2021 Spotlight: High-Performance Large-Scale Image Recognition Without Normalization »
Andy Brock · Soham De · Samuel Smith · Karen Simonyan -
2021 Spotlight: Catformer: Designing Stable Transformers via Sensitivity Analysis »
Jared Quincy Davis · Albert Gu · Krzysztof Choromanski · Tri Dao · Christopher Re · Chelsea Finn · Percy Liang -
2020 : Invited Talk: Razvan Pascanu "Continual Learning from an Optimization/Learning-dynamics perspective" »
Razvan Pascanu -
2020 Workshop: Workshop on Continual Learning »
Haytham Fayek · Arslan Chaudhry · David Lopez-Paz · Eugene Belilovsky · Jonathan Richard Schwarz · Marc Pickett · Rahaf Aljundi · Sayna Ebrahimi · Razvan Pascanu · Puneet Dokania -
2020 Poster: On the Generalization Benefit of Noise in Stochastic Gradient Descent »
Samuel Smith · Erich Elsen · Soham De -
2020 Poster: The Impact of Neural Network Overparameterization on Gradient Confusion and Stochastic Gradient Descent »
Karthik Abinav Sankararaman · Soham De · Zheng Xu · W. Ronny Huang · Tom Goldstein -
2020 Poster: An Accelerated DFO Algorithm for Finite-sum Convex Functions »
Yuwen Chen · Antonio Orvieto · Aurelien Lucchi -
2020 Poster: Stabilizing Transformers for Reinforcement Learning »
Emilio Parisotto · Francis Song · Jack Rae · Razvan Pascanu · Caglar Gulcehre · Siddhant Jayakumar · Max Jaderberg · Raphael Lopez Kaufman · Aidan Clark · Seb Noury · Matthew Botvinick · Nicolas Heess · Raia Hadsell -
2020 Poster: Improving the Gating Mechanism of Recurrent Neural Networks »
Albert Gu · Caglar Gulcehre · Thomas Paine · Matthew Hoffman · Razvan Pascanu -
2019 : Poster discussion »
Roman Novak · Maxime Gabella · Frederic Dreyer · Siavash Golkar · Anh Tong · Irina Higgins · Mirco Milletari · Joe Antognini · Sebastian Goldt · Adín Ramírez Rivera · Roberto Bondesan · Ryo Karakida · Remi Tachet des Combes · Michael Mahoney · Nicholas Walker · Stanislav Fort · Samuel Smith · Rohan Ghosh · Aristide Baratin · Diego Granziol · Stephen Roberts · Dmitry Vetrov · Andrew Wilson · César Laurent · Valentin Thomas · Simon Lacoste-Julien · Dar Gilboa · Daniel Soudry · Anupam Gupta · Anirudh Goyal · Yoshua Bengio · Erich Elsen · Soham De · Stanislaw Jastrzebski · Charles H Martin · Samira Shabanian · Aaron Courville · Shorato Akaho · Lenka Zdeborova · Ethan Dyer · Maurice Weiler · Pim de Haan · Taco Cohen · Max Welling · Ping Luo · zhanglin peng · Nasim Rahaman · Loic Matthey · Danilo J. Rezende · Jaesik Choi · Kyle Cranmer · Lechao Xiao · Jaehoon Lee · Yasaman Bahri · Jeffrey Pennington · Greg Yang · Jiri Hron · Jascha Sohl-Dickstein · Guy Gur-Ari -
2019 Poster: Learning Fast Algorithms for Linear Transforms Using Butterfly Factorizations »
Tri Dao · Albert Gu · Matthew Eichhorn · Atri Rudra · Christopher Re -
2019 Oral: Learning Fast Algorithms for Linear Transforms Using Butterfly Factorizations »
Tri Dao · Albert Gu · Matthew Eichhorn · Atri Rudra · Christopher Re -
2019 Poster: A Kernel Theory of Modern Data Augmentation »
Tri Dao · Albert Gu · Alexander J Ratner · Virginia Smith · Christopher De Sa · Christopher Re -
2019 Oral: A Kernel Theory of Modern Data Augmentation »
Tri Dao · Albert Gu · Alexander J Ratner · Virginia Smith · Christopher De Sa · Christopher Re -
2018 Poster: Progress & Compress: A scalable framework for continual learning »
Jonathan Richard Schwarz · Wojciech Czarnecki · Jelena Luketina · Agnieszka Grabska-Barwinska · Yee Teh · Razvan Pascanu · Raia Hadsell -
2018 Poster: Mix & Match - Agent Curricula for Reinforcement Learning »
Wojciech Czarnecki · Siddhant Jayakumar · Max Jaderberg · Leonard Hasenclever · Yee Teh · Nicolas Heess · Simon Osindero · Razvan Pascanu -
2018 Oral: Progress & Compress: A scalable framework for continual learning »
Jonathan Richard Schwarz · Wojciech Czarnecki · Jelena Luketina · Agnieszka Grabska-Barwinska · Yee Teh · Razvan Pascanu · Raia Hadsell -
2018 Oral: Mix & Match - Agent Curricula for Reinforcement Learning »
Wojciech Czarnecki · Siddhant Jayakumar · Max Jaderberg · Leonard Hasenclever · Yee Teh · Nicolas Heess · Simon Osindero · Razvan Pascanu -
2018 Poster: Representation Tradeoffs for Hyperbolic Embeddings »
Frederic Sala · Christopher De Sa · Albert Gu · Christopher Re -
2018 Oral: Representation Tradeoffs for Hyperbolic Embeddings »
Frederic Sala · Christopher De Sa · Albert Gu · Christopher Re -
2018 Poster: Been There, Done That: Meta-Learning with Episodic Recall »
Samuel Ritter · Jane Wang · Zeb Kurth-Nelson · Siddhant Jayakumar · Charles Blundell · Razvan Pascanu · Matthew Botvinick -
2018 Oral: Been There, Done That: Meta-Learning with Episodic Recall »
Samuel Ritter · Jane Wang · Zeb Kurth-Nelson · Siddhant Jayakumar · Charles Blundell · Razvan Pascanu · Matthew Botvinick -
2017 Poster: Sharp Minima Can Generalize For Deep Nets »
Laurent Dinh · Razvan Pascanu · Samy Bengio · Yoshua Bengio -
2017 Talk: Sharp Minima Can Generalize For Deep Nets »
Laurent Dinh · Razvan Pascanu · Samy Bengio · Yoshua Bengio