Timezone: »
The training of deep residual neural networks (ResNets) with backpropagation has a memory cost that increases linearly with respect to the depth of the network. A simple way to circumvent this issue is to use reversible architectures. In this paper, we propose to change the forward rule of a ResNet by adding a momentum term. The resulting networks, momentum residual neural networks (MomentumNets), are invertible. Unlike previous invertible architectures, they can be used as a drop-in replacement for any existing ResNet block. We show that MomentumNets can be interpreted in the infinitesimal step size regime as second-order ordinary differential equations (ODEs) and exactly characterize how adding momentum progressively increases the representation capabilities of MomentumNets: they can learn any linear mapping up to a multiplicative factor, while ResNets cannot. In a learning to optimize setting, where convergence to a fixed point is required, we show theoretically and empirically that our method succeeds while existing invertible architectures fail. We show on CIFAR and ImageNet that MomentumNets have the same accuracy as ResNets, while having a much smaller memory footprint, and show that pre-trained MomentumNets are promising for fine-tuning models.
Author Information
Michael Sander (ENS and CNRS)
Pierre Ablin (CNRS and ENS)
Mathieu Blondel (Google)
Gabriel Peyré (CNRS and ENS)
Related Events (a corresponding poster, oral, or spotlight)
-
2021 Poster: Momentum Residual Neural Networks »
Tue. Jul 20th 04:00 -- 06:00 PM Room Virtual
More from the Same Authors
-
2023 Poster: Fast, Differentiable and Sparse Top-k: a Convex Analysis Perspective »
Michael Sander · Joan Puigcerver · Josip Djolonga · Gabriel Peyré · Mathieu Blondel -
2023 Poster: Monge, Bregman and Occam: Interpretable Optimal Transport in High-Dimensions with Feature-Sparse Maps »
Marco Cuturi · Michal Klein · Pierre Ablin -
2022 Poster: Unsupervised Ground Metric Learning Using Wasserstein Singular Vectors »
Geert-Jan Huizing · Laura Cantini · Gabriel Peyré -
2022 Spotlight: Unsupervised Ground Metric Learning Using Wasserstein Singular Vectors »
Geert-Jan Huizing · Laura Cantini · Gabriel Peyré -
2022 Poster: Linear-Time Gromov Wasserstein Distances using Low Rank Couplings and Costs »
Meyer Scetbon · Gabriel Peyré · Marco Cuturi -
2022 Spotlight: Linear-Time Gromov Wasserstein Distances using Low Rank Couplings and Costs »
Meyer Scetbon · Gabriel Peyré · Marco Cuturi -
2021 Poster: Kernel Stein Discrepancy Descent »
Anna Korba · Pierre-Cyril Aubin-Frankowski · Szymon Majewski · Pierre Ablin -
2021 Oral: Kernel Stein Discrepancy Descent »
Anna Korba · Pierre-Cyril Aubin-Frankowski · Szymon Majewski · Pierre Ablin -
2021 Poster: Low-Rank Sinkhorn Factorization »
Meyer Scetbon · Marco Cuturi · Gabriel Peyré -
2021 Spotlight: Low-Rank Sinkhorn Factorization »
Meyer Scetbon · Marco Cuturi · Gabriel Peyré -
2020 Poster: Super-efficiency of automatic differentiation for functions defined as a minimum »
Pierre Ablin · Gabriel Peyré · Thomas Moreau -
2020 Poster: Fast Differentiable Sorting and Ranking »
Mathieu Blondel · Olivier Teboul · Quentin Berthet · Josip Djolonga -
2020 Poster: Implicit differentiation of Lasso-type models for hyperparameter optimization »
Quentin Bertrand · Quentin Klopfenstein · Mathieu Blondel · Samuel Vaiter · Alexandre Gramfort · Joseph Salmon -
2019 Poster: Geometric Losses for Distributional Learning »
Arthur Mensch · Mathieu Blondel · Gabriel Peyré -
2019 Oral: Geometric Losses for Distributional Learning »
Arthur Mensch · Mathieu Blondel · Gabriel Peyré -
2019 Poster: Stochastic Deep Networks »
Gwendoline De Bie · Gabriel Peyré · Marco Cuturi -
2019 Oral: Stochastic Deep Networks »
Gwendoline De Bie · Gabriel Peyré · Marco Cuturi -
2018 Poster: Differentiable Dynamic Programming for Structured Prediction and Attention »
Arthur Mensch · Mathieu Blondel -
2018 Oral: Differentiable Dynamic Programming for Structured Prediction and Attention »
Arthur Mensch · Mathieu Blondel -
2018 Poster: SparseMAP: Differentiable Sparse Structured Inference »
Vlad Niculae · Andre Filipe Torres Martins · Mathieu Blondel · Claire Cardie -
2018 Oral: SparseMAP: Differentiable Sparse Structured Inference »
Vlad Niculae · Andre Filipe Torres Martins · Mathieu Blondel · Claire Cardie -
2017 Poster: Soft-DTW: a Differentiable Loss Function for Time-Series »
Marco Cuturi · Mathieu Blondel -
2017 Talk: Soft-DTW: a Differentiable Loss Function for Time-Series »
Marco Cuturi · Mathieu Blondel