Timezone: »
The early phase of training a deep neural network has a dramatic effect on the local curvature of the loss function. For instance, using a small learning rate does not guarantee stable optimization because the optimization trajectory has a tendency to steer towards regions of the loss surface with increasing local curvature. We ask whether this tendency is connected to the widely observed phenomenon that the choice of the learning rate strongly influences generalization. We first show that stochastic gradient descent (SGD) implicitly penalizes the trace of the Fisher Information Matrix (FIM), a measure of the local curvature, from the start of training. We argue it is an implicit regularizer in SGD by showing that explicitly penalizing the trace of the FIM can significantly improve generalization. We highlight that poor final generalization coincides with the trace of the FIM attaining a large value early in training, to which we refer as catastrophic Fisher explosion. Finally, to gain insight into the regularization effect of penalizing the trace of the FIM, we show that it limits memorization by reducing the learning speed of examples with noisy labels more than that of the examples with clean labels.
Author Information
Stanislaw Jastrzebski (Molecule.one / Jagiellonian University)
Devansh Arpit (Salesforce Research)
Oliver Astrand (.)
Giancarlo Kerg (MILA)
Huan Wang (Salesforce Research)
Caiming Xiong (Salesforce)
Richard Socher (Salesforce)
Kyunghyun Cho (New York University)

Kyunghyun Cho is an associate professor of computer science and data science at New York University and CIFAR Fellow of Learning in Machines & Brains. He is also a senior director of frontier research at the Prescient Design team within Genentech Research & Early Development (gRED). He was a research scientist at Facebook AI Research from June 2017 to May 2020 and a postdoctoral fellow at University of Montreal until Summer 2015 under the supervision of Prof. Yoshua Bengio, after receiving MSc and PhD degrees from Aalto University April 2011 and April 2014, respectively, under the supervision of Prof. Juha Karhunen, Dr. Tapani Raiko and Dr. Alexander Ilin. He received the Samsung Ho-Am Prize in Engineering in 2021. He tries his best to find a balance among machine learning, natural language processing, and life, but almost always fails to do so.
Krzysztof J Geras (New York University)
Related Events (a corresponding poster, oral, or spotlight)
-
2021 Poster: Catastrophic Fisher Explosion: Early Phase Fisher Matrix Impacts Generalization »
Thu. Jul 22nd 04:00 -- 06:00 AM Room
More from the Same Authors
-
2020 : Learning Long-term Dependencies Using Cognitive Inductive Biases in Self-attention RNNs »
Giancarlo Kerg -
2021 : True Few-Shot Learning with Language Models »
Ethan Perez · Douwe Kiela · Kyunghyun Cho -
2021 : Policy Finetuning: Bridging Sample-Efficient Offline and Online Reinforcement Learning »
Tengyang Xie · Nan Jiang · Huan Wang · Caiming Xiong · Yu Bai -
2021 : Sample-Efficient Learning of Stackelberg Equilibria in General-Sum Games »
Yu Bai · Chi Jin · Huan Wang · Caiming Xiong -
2022 : Linear Connectivity Reveals Generalization Strategies »
Jeevesh Juneja · Rachit Bansal · Kyunghyun Cho · João Sedoc · Naomi Saphra -
2023 : Latent State Transitions in Training Dynamics »
Michael Hu · Angelica Chen · Naomi Saphra · Kyunghyun Cho -
2023 : Separating multimodal modeling from multidimensional modeling for multimodal learning »
Divyam Madaan · Taro Makino · Sumit Chopra · Kyunghyun Cho -
2023 : Antibody DomainBed: Towards robust predictions using invariant representations of biological sequences carrying complex distribution shifts »
Natasa Tagasovska · Ji Won Park · Stephen Ra · Kyunghyun Cho -
2023 : Concept Bottleneck Generative Models »
Aya Ismail · Julius Adebayo · Hector Corrada Bravo · Stephen Ra · Kyunghyun Cho -
2023 : Protein Design with Guided Discrete Diffusion »
Nate Gruver · Samuel Stanton · Nathan Frey · Tim G. J. Rudner · Isidro Hotzel · Julien Lafrance-Vanasse · Arvind Rajpal · Kyunghyun Cho · Andrew Wilson -
2023 Poster: Towards Understanding and Improving GFlowNet Training »
Max Shen · Emmanuel Bengio · Ehsan Hajiramezanali · Andreas Loukas · Kyunghyun Cho · Tommaso Biancalani -
2023 Panel: ICML Education Outreach Panel »
Andreas Krause · Barbara Engelhardt · Emma Brunskill · Kyunghyun Cho -
2022 Poster: BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation »
Junnan Li · DONGXU LI · Caiming Xiong · Steven Hoi -
2022 Poster: Characterizing and Overcoming the Greedy Nature of Learning in Multi-modal Deep Neural Networks »
Nan Wu · Stanislaw Jastrzebski · Kyunghyun Cho · Krzysztof J Geras -
2022 Spotlight: BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation »
Junnan Li · DONGXU LI · Caiming Xiong · Steven Hoi -
2022 Spotlight: Characterizing and Overcoming the Greedy Nature of Learning in Multi-modal Deep Neural Networks »
Nan Wu · Stanislaw Jastrzebski · Kyunghyun Cho · Krzysztof J Geras -
2021 : Sample-Efficient Learning of Stackelberg Equilibria in General-Sum Games »
Yu Bai · Chi Jin · Huan Wang · Caiming Xiong -
2021 Poster: Rissanen Data Analysis: Examining Dataset Characteristics via Description Length »
Ethan Perez · Douwe Kiela · Kyunghyun Cho -
2021 Spotlight: Rissanen Data Analysis: Examining Dataset Characteristics via Description Length »
Ethan Perez · Douwe Kiela · Kyunghyun Cho -
2021 Poster: How Important is the Train-Validation Split in Meta-Learning? »
Yu Bai · Minshuo Chen · Pan Zhou · Tuo Zhao · Jason Lee · Sham Kakade · Huan Wang · Caiming Xiong -
2021 Spotlight: How Important is the Train-Validation Split in Meta-Learning? »
Yu Bai · Minshuo Chen · Pan Zhou · Tuo Zhao · Jason Lee · Sham Kakade · Huan Wang · Caiming Xiong -
2021 Poster: Don’t Just Blame Over-parametrization for Over-confidence: Theoretical Analysis of Calibration in Binary Classification »
Yu Bai · Song Mei · Huan Wang · Caiming Xiong -
2021 Spotlight: Don’t Just Blame Over-parametrization for Over-confidence: Theoretical Analysis of Calibration in Binary Classification »
Yu Bai · Song Mei · Huan Wang · Caiming Xiong -
2020 Poster: Explore, Discover and Learn: Unsupervised Discovery of State-Covering Skills »
Victor Campos · Alexander Trott · Caiming Xiong · Richard Socher · Xavier Giro-i-Nieto · Jordi Torres -
2019 : Deep Neural Networks Improve Radiologists' Performance in Breast Cancer Screening »
Krzysztof J Geras · Nan Wu -
2019 Workshop: Workshop on Multi-Task and Lifelong Reinforcement Learning »
Sarath Chandar · Shagun Sodhani · Khimya Khetarpal · Tom Zahavy · Daniel J. Mankowitz · Shie Mannor · Balaraman Ravindran · Doina Precup · Chelsea Finn · Abhishek Gupta · Amy Zhang · Kyunghyun Cho · Andrei A Rusu · Facebook Rob Fergus -
2019 : Poster discussion »
Roman Novak · Maxime Gabella · Frederic Dreyer · Siavash Golkar · Anh Tong · Irina Higgins · Mirco Milletari · Joe Antognini · Sebastian Goldt · Adín Ramírez Rivera · Roberto Bondesan · Ryo Karakida · Remi Tachet des Combes · Michael Mahoney · Nicholas Walker · Stanislav Fort · Samuel Smith · Rohan Ghosh · Aristide Baratin · Diego Granziol · Stephen Roberts · Dmitry Vetrov · Andrew Wilson · César Laurent · Valentin Thomas · Simon Lacoste-Julien · Dar Gilboa · Daniel Soudry · Anupam Gupta · Anirudh Goyal · Yoshua Bengio · Erich Elsen · Soham De · Stanislaw Jastrzebski · Charles H Martin · Samira Shabanian · Aaron Courville · Shorato Akaho · Lenka Zdeborova · Ethan Dyer · Maurice Weiler · Pim de Haan · Taco Cohen · Max Welling · Ping Luo · zhanglin peng · Nasim Rahaman · Loic Matthey · Danilo J. Rezende · Jaesik Choi · Kyle Cranmer · Lechao Xiao · Jaehoon Lee · Yasaman Bahri · Jeffrey Pennington · Greg Yang · Jiri Hron · Jascha Sohl-Dickstein · Guy Gur-Ari -
2019 : Poster Session 1 (all papers) »
Matilde Gargiani · Yochai Zur · Chaim Baskin · Evgenii Zheltonozhskii · Liam Li · Ameet Talwalkar · Xuedong Shang · Harkirat Singh Behl · Atilim Gunes Baydin · Ivo Couckuyt · Tom Dhaene · Chieh Lin · Wei Wei · Min Sun · Orchid Majumder · Michele Donini · Yoshihiko Ozaki · Ryan P. Adams · Christian Geißler · Ping Luo · zhanglin peng · · Ruimao Zhang · John Langford · Rich Caruana · Debadeepta Dey · Charles Weill · Xavi Gonzalvo · Scott Yang · Scott Yak · Eugen Hotaj · Vladimir Macko · Mehryar Mohri · Corinna Cortes · Stefan Webb · Jonathan Chen · Martin Jankowiak · Noah Goodman · Aaron Klein · Frank Hutter · Mojan Javaheripi · Mohammad Samragh · Sungbin Lim · Taesup Kim · SUNGWOONG KIM · Michael Volpp · Iddo Drori · Yamuna Krishnamurthy · Kyunghyun Cho · Stanislaw Jastrzebski · Quentin de Laroussilhe · Mingxing Tan · Xiao Ma · Neil Houlsby · Andrea Gesmundo · Zalán Borsos · Krzysztof Maziarz · Felipe Petroski Such · Joel Lehman · Kenneth Stanley · Jeff Clune · Pieter Gijsbers · Joaquin Vanschoren · Felix Mohr · Eyke Hüllermeier · Zheng Xiong · Wenpeng Zhang · Wenwu Zhu · Weijia Shao · Aleksandra Faust · Michal Valko · Michael Y Li · Hugo Jair Escalante · Marcel Wever · Andrey Khorlin · Tara Javidi · Anthony Francis · Saurajit Mukherjee · Jungtaek Kim · Michael McCourt · Saehoon Kim · Tackgeun You · Seungjin Choi · Nicolas Knudde · Alexander Tornede · Ghassen Jerfel -
2019 Poster: Non-Monotonic Sequential Text Generation »
Sean Welleck · Kiante Brantley · Hal Daumé III · Kyunghyun Cho -
2019 Poster: Parameter-Efficient Transfer Learning for NLP »
Neil Houlsby · Andrei Giurgiu · Stanislaw Jastrzebski · Bruna Morrone · Quentin de Laroussilhe · Andrea Gesmundo · Mona Attariyan · Sylvain Gelly -
2019 Poster: On the Spectral Bias of Neural Networks »
Nasim Rahaman · Aristide Baratin · Devansh Arpit · Felix Draxler · Min Lin · Fred Hamprecht · Yoshua Bengio · Aaron Courville -
2019 Oral: Non-Monotonic Sequential Text Generation »
Sean Welleck · Kiante Brantley · Hal Daumé III · Kyunghyun Cho -
2019 Oral: On the Spectral Bias of Neural Networks »
Nasim Rahaman · Aristide Baratin · Devansh Arpit · Felix Draxler · Min Lin · Fred Hamprecht · Yoshua Bengio · Aaron Courville -
2019 Oral: Parameter-Efficient Transfer Learning for NLP »
Neil Houlsby · Andrei Giurgiu · Stanislaw Jastrzebski · Bruna Morrone · Quentin de Laroussilhe · Andrea Gesmundo · Mona Attariyan · Sylvain Gelly -
2019 Poster: Learn to Grow: A Continual Structure Learning Framework for Overcoming Catastrophic Forgetting »
Xilai Li · Yingbo Zhou · Tianfu Wu · Richard Socher · Caiming Xiong -
2019 Poster: Taming MAML: Efficient unbiased meta-reinforcement learning »
Hao Liu · Richard Socher · Caiming Xiong -
2019 Poster: On the Generalization Gap in Reparameterizable Reinforcement Learning »
Huan Wang · Stephan Zheng · Caiming Xiong · Richard Socher -
2019 Oral: Learn to Grow: A Continual Structure Learning Framework for Overcoming Catastrophic Forgetting »
Xilai Li · Yingbo Zhou · Tianfu Wu · Richard Socher · Caiming Xiong -
2019 Oral: On the Generalization Gap in Reparameterizable Reinforcement Learning »
Huan Wang · Stephan Zheng · Caiming Xiong · Richard Socher -
2019 Oral: Taming MAML: Efficient unbiased meta-reinforcement learning »
Hao Liu · Richard Socher · Caiming Xiong -
2017 Poster: A Closer Look at Memorization in Deep Networks »
David Krueger · Yoshua Bengio · Stanislaw Jastrzebski · Maxinder S. Kanwal · Nicolas Ballas · Asja Fischer · Emmanuel Bengio · Devansh Arpit · Tegan Maharaj · Aaron Courville · Simon Lacoste-Julien -
2017 Talk: A Closer Look at Memorization in Deep Networks »
David Krueger · Yoshua Bengio · Stanislaw Jastrzebski · Maxinder S. Kanwal · Nicolas Ballas · Asja Fischer · Emmanuel Bengio · Devansh Arpit · Tegan Maharaj · Aaron Courville · Simon Lacoste-Julien