Timezone: »

 
Poster
On the Generalization Benefit of Noise in Stochastic Gradient Descent
Samuel Smith · Erich Elsen · Soham De

Wed Jul 15 01:00 PM -- 01:45 PM & Thu Jul 16 01:00 AM -- 01:45 AM (PDT) @ Virtual

It has long been argued that minibatch stochastic gradient descent can generalize better than large batch gradient descent in deep neural networks. However recent papers have questioned this claim, arguing that this effect is simply a consequence of suboptimal hyperparameter tuning or insufficient compute budgets when the batch size is large. In this paper, we perform carefully designed experiments and rigorous hyperparameter sweeps on a range of popular models, which verify that small or moderately large batch sizes can substantially outperform very large batches on the test set. This occurs even when both models are trained for the same number of iterations and large batches achieve smaller training losses. Our results confirm that the noise in stochastic gradients can enhance generalization. We study how the optimal learning rate schedule changes as the epoch budget grows, and we provide a theoretical account of our observations based on the stochastic differential equation perspective of SGD dynamics.

Author Information

Samuel Smith (DeepMind)
Erich Elsen (Google)
Soham De (DeepMind)

More from the Same Authors

  • 2023 Poster: Resurrecting Recurrent Neural Networks for Long Sequences »
    Antonio Orvieto · Samuel Smith · Albert Gu · Anushan Fernando · Caglar Gulchere · Razvan Pascanu · Soham De
  • 2023 Oral: Resurrecting Recurrent Neural Networks for Long Sequences »
    Antonio Orvieto · Samuel Smith · Albert Gu · Anushan Fernando · Caglar Gulchere · Razvan Pascanu · Soham De
  • 2022 Poster: The State of Sparse Training in Deep Reinforcement Learning »
    Laura Graesser · Utku Evci · Erich Elsen · Pablo Samuel Castro
  • 2022 Spotlight: The State of Sparse Training in Deep Reinforcement Learning »
    Laura Graesser · Utku Evci · Erich Elsen · Pablo Samuel Castro
  • 2022 Poster: Improving Language Models by Retrieving from Trillions of Tokens »
    Sebastian Borgeaud · Arthur Mensch · Jordan Hoffmann · Trevor Cai · Eliza Rutherford · Katie Millican · George van den Driessche · Jean-Baptiste Lespiau · Bogdan Damoc · Aidan Clark · Diego de Las Casas · Aurelia Guy · Jacob Menick · Roman Ring · Tom Hennigan · Saffron Huang · Loren Maggiore · Chris Jones · Albin Cassirer · Andy Brock · Michela Paganini · Geoffrey Irving · Oriol Vinyals · Simon Osindero · Karen Simonyan · Jack Rae · Erich Elsen · Laurent Sifre
  • 2022 Poster: Unified Scaling Laws for Routed Language Models »
    Aidan Clark · Diego de Las Casas · Aurelia Guy · Arthur Mensch · Michela Paganini · Jordan Hoffmann · Bogdan Damoc · Blake Hechtman · Trevor Cai · Sebastian Borgeaud · George van den Driessche · Eliza Rutherford · Tom Hennigan · Matthew Johnson · Albin Cassirer · Chris Jones · Elena Buchatskaya · David Budden · Laurent Sifre · Simon Osindero · Oriol Vinyals · Marc'Aurelio Ranzato · Jack Rae · Erich Elsen · Koray Kavukcuoglu · Karen Simonyan
  • 2022 Spotlight: Improving Language Models by Retrieving from Trillions of Tokens »
    Sebastian Borgeaud · Arthur Mensch · Jordan Hoffmann · Trevor Cai · Eliza Rutherford · Katie Millican · George van den Driessche · Jean-Baptiste Lespiau · Bogdan Damoc · Aidan Clark · Diego de Las Casas · Aurelia Guy · Jacob Menick · Roman Ring · Tom Hennigan · Saffron Huang · Loren Maggiore · Chris Jones · Albin Cassirer · Andy Brock · Michela Paganini · Geoffrey Irving · Oriol Vinyals · Simon Osindero · Karen Simonyan · Jack Rae · Erich Elsen · Laurent Sifre
  • 2022 Oral: Unified Scaling Laws for Routed Language Models »
    Aidan Clark · Diego de Las Casas · Aurelia Guy · Arthur Mensch · Michela Paganini · Jordan Hoffmann · Bogdan Damoc · Blake Hechtman · Trevor Cai · Sebastian Borgeaud · George van den Driessche · Eliza Rutherford · Tom Hennigan · Matthew Johnson · Albin Cassirer · Chris Jones · Elena Buchatskaya · David Budden · Laurent Sifre · Simon Osindero · Oriol Vinyals · Marc'Aurelio Ranzato · Jack Rae · Erich Elsen · Koray Kavukcuoglu · Karen Simonyan
  • 2021 Poster: High-Performance Large-Scale Image Recognition Without Normalization »
    Andy Brock · Soham De · Samuel Smith · Karen Simonyan
  • 2021 Spotlight: High-Performance Large-Scale Image Recognition Without Normalization »
    Andy Brock · Soham De · Samuel Smith · Karen Simonyan
  • 2020 Poster: Rigging the Lottery: Making All Tickets Winners »
    Utku Evci · Trevor Gale · Jacob Menick · Pablo Samuel Castro · Erich Elsen
  • 2020 Poster: The Impact of Neural Network Overparameterization on Gradient Confusion and Stochastic Gradient Descent »
    Karthik Abinav Sankararaman · Soham De · Zheng Xu · W. Ronny Huang · Tom Goldstein
  • 2019 : Poster discussion »
    Roman Novak · Maxime Gabella · Frederic Dreyer · Siavash Golkar · Anh Tong · Irina Higgins · Mirco Milletari · Joe Antognini · Sebastian Goldt · Adín Ramírez Rivera · Roberto Bondesan · Ryo Karakida · Remi Tachet des Combes · Michael Mahoney · Nicholas Walker · Stanislav Fort · Samuel Smith · Rohan Ghosh · Aristide Baratin · Diego Granziol · Stephen Roberts · Dmitry Vetrov · Andrew Wilson · César Laurent · Valentin Thomas · Simon Lacoste-Julien · Dar Gilboa · Daniel Soudry · Anupam Gupta · Anirudh Goyal · Yoshua Bengio · Erich Elsen · Soham De · Stanislaw Jastrzebski · Charles H Martin · Samira Shabanian · Aaron Courville · Shorato Akaho · Lenka Zdeborova · Ethan Dyer · Maurice Weiler · Pim de Haan · Taco Cohen · Max Welling · Ping Luo · zhanglin peng · Nasim Rahaman · Loic Matthey · Danilo J. Rezende · Jaesik Choi · Kyle Cranmer · Lechao Xiao · Jaehoon Lee · Yasaman Bahri · Jeffrey Pennington · Greg Yang · Jiri Hron · Jascha Sohl-Dickstein · Guy Gur-Ari
  • 2019 Poster: The Effect of Network Width on Stochastic Gradient Descent and Generalization: an Empirical Study »
    Daniel Park · Jascha Sohl-Dickstein · Quoc Le · Samuel Smith
  • 2019 Oral: The Effect of Network Width on Stochastic Gradient Descent and Generalization: an Empirical Study »
    Daniel Park · Jascha Sohl-Dickstein · Quoc Le · Samuel Smith
  • 2018 Poster: Efficient Neural Audio Synthesis »
    Nal Kalchbrenner · Erich Elsen · Karen Simonyan · Seb Noury · Norman Casagrande · Edward Lockhart · Florian Stimberg · Aäron van den Oord · Sander Dieleman · Koray Kavukcuoglu
  • 2018 Oral: Efficient Neural Audio Synthesis »
    Nal Kalchbrenner · Erich Elsen · Karen Simonyan · Seb Noury · Norman Casagrande · Edward Lockhart · Florian Stimberg · Aäron van den Oord · Sander Dieleman · Koray Kavukcuoglu