Timezone: »
Deep learning networks are typically trained by Stochastic Gradient Descent (SGD) methods that iteratively improve the model parameters by estimating a gradient on a very small fraction of the training data. A major roadblock faced when increasing the batch size to a substantial fraction of the training data for reducing training time is the persistent degradation in performance (generalization gap). To address this issue, recent work propose to add small perturbations to the model parameters when computing the stochastic gradients and report improved generalization performance due to smoothing effects. However, this approach is poorly understood; it requires often model-specific noise and fine-tuning. To alleviate these drawbacks, we propose to use instead computationally efficient extrapolation (extragradient) to stabilize the optimization trajectory while still benefiting from smoothing to avoid sharp minima. This principled approach is well grounded from an optimization perspective and we show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer. We demonstrate that in a variety of experiments the scheme allows scaling to much larger batch sizes than before whilst reaching or surpassing SOTA accuracy.
Author Information
Tao Lin (EPFL)
Lingjing Kong (EPFL)
Sebastian Stich (EPFL)
Martin Jaggi (EPFL)
More from the Same Authors
-
2021 : iFedAvg – Interpretable Data-Interoperability for Federated Learning »
David Roschewitz · Mary-Anne Hartley · Luca Corinzia · Martin Jaggi -
2022 : The Gap Between Continuous and Discrete Gradient Descent »
Amirkeivan Mohtashami · Martin Jaggi · Sebastian Stich -
2023 : Layerwise Linear Mode Connectivity »
Linara Adilova · Asja Fischer · Martin Jaggi -
2023 : Landmark Attention: Random-Access Infinite Context Length for Transformers »
Amirkeivan Mohtashami · Martin Jaggi -
2023 : 🎤 Fast Causal Attention with Dynamic Sparsity »
Daniele Paliotta · Matteo Pagliardini · Martin Jaggi · François Fleuret -
2023 Oral: Second-Order Optimization with Lazy Hessians »
Nikita Doikov · El Mahdi Chayti · Martin Jaggi -
2023 Poster: Second-Order Optimization with Lazy Hessians »
Nikita Doikov · El Mahdi Chayti · Martin Jaggi -
2023 Poster: Special Properties of Gradient Descent with Large Learning Rates »
Amirkeivan Mohtashami · Martin Jaggi · Sebastian Stich -
2021 : Exact Optimization of Conformal Predictors via Incremental and Decremental Learning (Spotlight #13) »
Giovanni Cherubin · Konstantinos Chatzikokolakis · Martin Jaggi -
2021 : Algorithms for Efficient Federated and Decentralized Learning (Q&A) »
Sebastian Stich -
2021 : Algorithms for Efficient Federated and Decentralized Learning »
Sebastian Stich -
2021 Poster: Exact Optimization of Conformal Predictors via Incremental and Decremental Learning »
Giovanni Cherubin · Konstantinos Chatzikokolakis · Martin Jaggi -
2021 Poster: Consensus Control for Decentralized Deep Learning »
Lingjing Kong · Tao Lin · Anastasiia Koloskova · Martin Jaggi · Sebastian Stich -
2021 Poster: Quasi-global Momentum: Accelerating Decentralized Deep Learning on Heterogeneous Data »
Tao Lin · Sai Praneeth Reddy Karimireddy · Sebastian Stich · Martin Jaggi -
2021 Spotlight: Exact Optimization of Conformal Predictors via Incremental and Decremental Learning »
Giovanni Cherubin · Konstantinos Chatzikokolakis · Martin Jaggi -
2021 Spotlight: Quasi-global Momentum: Accelerating Decentralized Deep Learning on Heterogeneous Data »
Tao Lin · Sai Praneeth Reddy Karimireddy · Sebastian Stich · Martin Jaggi -
2021 Spotlight: Consensus Control for Decentralized Deep Learning »
Lingjing Kong · Tao Lin · Anastasiia Koloskova · Martin Jaggi · Sebastian Stich -
2021 Poster: Learning from History for Byzantine Robust Optimization »
Sai Praneeth Reddy Karimireddy · Lie He · Martin Jaggi -
2021 Spotlight: Learning from History for Byzantine Robust Optimization »
Sai Praneeth Reddy Karimireddy · Lie He · Martin Jaggi -
2020 Poster: Optimizer Benchmarking Needs to Account for Hyperparameter Tuning »
Prabhu Teja Sivaprasad · Florian Mai · Thijs Vogels · Martin Jaggi · François Fleuret -
2020 Poster: A Unified Theory of Decentralized SGD with Changing Topology and Local Updates »
Anastasiia Koloskova · Nicolas Loizou · Sadra Boreiri · Martin Jaggi · Sebastian Stich -
2020 Poster: SCAFFOLD: Stochastic Controlled Averaging for Federated Learning »
Sai Praneeth Reddy Karimireddy · Satyen Kale · Mehryar Mohri · Sashank Jakkam Reddi · Sebastian Stich · Ananda Theertha Suresh -
2020 Poster: Is Local SGD Better than Minibatch SGD? »
Blake Woodworth · Kumar Kshitij Patel · Sebastian Stich · Zhen Dai · Brian Bullins · Brendan McMahan · Ohad Shamir · Nati Srebro -
2019 Poster: Exploring interpretable LSTM neural networks over multi-variable data »
Tian Guo · Tao Lin · Nino Antulov-Fantulin -
2019 Oral: Exploring interpretable LSTM neural networks over multi-variable data »
Tian Guo · Tao Lin · Nino Antulov-Fantulin -
2019 Poster: Overcoming Multi-model Forgetting »
Yassine Benyahia · Kaicheng Yu · Kamil Bennani-Smires · Martin Jaggi · Anthony C. Davison · Mathieu Salzmann · Claudiu Musat -
2019 Oral: Overcoming Multi-model Forgetting »
Yassine Benyahia · Kaicheng Yu · Kamil Bennani-Smires · Martin Jaggi · Anthony C. Davison · Mathieu Salzmann · Claudiu Musat -
2019 Poster: Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication »
Anastasiia Koloskova · Sebastian Stich · Martin Jaggi -
2019 Poster: Error Feedback Fixes SignSGD and other Gradient Compression Schemes »
Sai Praneeth Reddy Karimireddy · Quentin Rebjock · Sebastian Stich · Martin Jaggi -
2019 Oral: Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication »
Anastasiia Koloskova · Sebastian Stich · Martin Jaggi -
2019 Oral: Error Feedback Fixes SignSGD and other Gradient Compression Schemes »
Sai Praneeth Reddy Karimireddy · Quentin Rebjock · Sebastian Stich · Martin Jaggi -
2018 Poster: On Matching Pursuit and Coordinate Descent »
Francesco Locatello · Anant Raj · Sai Praneeth Reddy Karimireddy · Gunnar Ratsch · Bernhard Schölkopf · Sebastian Stich · Martin Jaggi -
2018 Oral: On Matching Pursuit and Coordinate Descent »
Francesco Locatello · Anant Raj · Sai Praneeth Reddy Karimireddy · Gunnar Ratsch · Bernhard Schölkopf · Sebastian Stich · Martin Jaggi -
2018 Poster: A Distributed Second-Order Algorithm You Can Trust »
Celestine Mendler-Dünner · Aurelien Lucchi · Matilde Gargiani · Yatao Bian · Thomas Hofmann · Martin Jaggi -
2018 Oral: A Distributed Second-Order Algorithm You Can Trust »
Celestine Mendler-Dünner · Aurelien Lucchi · Matilde Gargiani · Yatao Bian · Thomas Hofmann · Martin Jaggi -
2017 Poster: Approximate Steepest Coordinate Descent »
Sebastian Stich · Anant Raj · Martin Jaggi -
2017 Talk: Approximate Steepest Coordinate Descent »
Sebastian Stich · Anant Raj · Martin Jaggi