Timezone: »
In recent years, state-of-the-art methods in computer vision have utilized increasingly deep convolutional neural network architectures (CNNs), with some of the most successful models employing hundreds or even thousands of layers. A variety of pathologies such as vanishing/exploding gradients make training such deep networks challenging. While residual connections and batch normalization do enable training at these depths, it has remained unclear whether such specialized architecture designs are truly necessary to train deep CNNs. In this work, we demonstrate that it is possible to train vanilla CNNs with ten thousand layers or more simply by using an appropriate initialization scheme. We derive this initialization scheme theoretically by developing a mean field theory for signal propagation and by characterizing the conditions for dynamical isometry, the equilibration of singular values of the input-output Jacobian matrix. These conditions require that the convolution operator be an orthogonal transformation in the sense that it is norm-preserving. We present an algorithm for generating such random initial orthogonal convolution kernels and demonstrate empirically that they enable efficient training of extremely deep architectures.
Author Information
Lechao Xiao (Google AI)
Yasaman Bahri (Google Brain)
Jascha Sohl-Dickstein (Google Brain)
Samuel Schoenholz (Google Brain)
Jeffrey Pennington (Google Brain)
Related Events (a corresponding poster, oral, or spotlight)
-
2018 Poster: Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10,000-Layer Vanilla Convolutional Neural Networks »
Wed. Jul 11th 04:15 -- 07:00 PM Room Hall B #171
More from the Same Authors
-
2023 Poster: Second-order regression models exhibit progressive sharpening to the edge of stability »
Atish Agarwala · Fabian Pedregosa · Jeffrey Pennington -
2022 Poster: Fast Finite Width Neural Tangent Kernel »
Roman Novak · Jascha Sohl-Dickstein · Samuel Schoenholz -
2022 Spotlight: Fast Finite Width Neural Tangent Kernel »
Roman Novak · Jascha Sohl-Dickstein · Samuel Schoenholz -
2022 Poster: Deep equilibrium networks are sensitive to initialization statistics »
Atish Agarwala · Samuel Schoenholz -
2022 Poster: Synergy and Symmetry in Deep Learning: Interactions between the Data, Model, and Inference Algorithm »
Lechao Xiao · Jeffrey Pennington -
2022 Poster: Wide Bayesian neural networks have a simple weight posterior: theory and accelerated sampling »
Jiri Hron · Roman Novak · Jeffrey Pennington · Jascha Sohl-Dickstein -
2022 Spotlight: Synergy and Symmetry in Deep Learning: Interactions between the Data, Model, and Inference Algorithm »
Lechao Xiao · Jeffrey Pennington -
2022 Spotlight: Wide Bayesian neural networks have a simple weight posterior: theory and accelerated sampling »
Jiri Hron · Roman Novak · Jeffrey Pennington · Jascha Sohl-Dickstein -
2022 Spotlight: Deep equilibrium networks are sensitive to initialization statistics »
Atish Agarwala · Samuel Schoenholz -
2021 Workshop: Over-parameterization: Pitfalls and Opportunities »
Yasaman Bahri · Quanquan Gu · Amin Karbasi · Hanie Sedghi -
2021 Poster: Learn2Hop: Learned Optimization on Rough Landscapes »
Amil Merchant · Luke Metz · Samuel Schoenholz · Ekin Dogus Cubuk -
2021 Spotlight: Learn2Hop: Learned Optimization on Rough Landscapes »
Amil Merchant · Luke Metz · Samuel Schoenholz · Ekin Dogus Cubuk -
2021 Poster: Tilting the playing field: Dynamical loss functions for machine learning »
Miguel Ruiz Garcia · Ge Zhang · Samuel Schoenholz · Andrea Liu -
2021 Oral: Tilting the playing field: Dynamical loss functions for machine learning »
Miguel Ruiz Garcia · Ge Zhang · Samuel Schoenholz · Andrea Liu -
2021 Poster: Whitening and Second Order Optimization Both Make Information in the Dataset Unusable During Training, and Can Reduce or Prevent Generalization »
Neha Wadia · Daniel Duckworth · Samuel Schoenholz · Ethan Dyer · Jascha Sohl-Dickstein -
2021 Poster: Unbiased Gradient Estimation in Unrolled Computation Graphs with Persistent Evolution Strategies »
Paul Vicol · Luke Metz · Jascha Sohl-Dickstein -
2021 Spotlight: Whitening and Second Order Optimization Both Make Information in the Dataset Unusable During Training, and Can Reduce or Prevent Generalization »
Neha Wadia · Daniel Duckworth · Samuel Schoenholz · Ethan Dyer · Jascha Sohl-Dickstein -
2021 Oral: Unbiased Gradient Estimation in Unrolled Computation Graphs with Persistent Evolution Strategies »
Paul Vicol · Luke Metz · Jascha Sohl-Dickstein -
2021 : The Mystery of Generalization: Why Does Deep Learning Work? »
Jeffrey Pennington -
2021 Tutorial: Random Matrix Theory and ML (RMT+ML) »
Fabian Pedregosa · Courtney Paquette · Thomas Trogdon · Jeffrey Pennington -
2020 Poster: The Neural Tangent Kernel in High Dimensions: Triple Descent and a Multi-Scale Theory of Generalization »
Ben Adlam · Jeffrey Pennington -
2020 Poster: Infinite attention: NNGP and NTK for deep attention networks »
Jiri Hron · Yasaman Bahri · Jascha Sohl-Dickstein · Roman Novak -
2020 Poster: Disentangling Trainability and Generalization in Deep Neural Networks »
Lechao Xiao · Jeffrey Pennington · Samuel Schoenholz -
2019 : Poster discussion »
Roman Novak · Maxime Gabella · Frederic Dreyer · Siavash Golkar · Anh Tong · Irina Higgins · Mirco Milletari · Joe Antognini · Sebastian Goldt · Adín Ramírez Rivera · Roberto Bondesan · Ryo Karakida · Remi Tachet des Combes · Michael Mahoney · Nicholas Walker · Stanislav Fort · Samuel Smith · Rohan Ghosh · Aristide Baratin · Diego Granziol · Stephen Roberts · Dmitry Vetrov · Andrew Wilson · César Laurent · Valentin Thomas · Simon Lacoste-Julien · Dar Gilboa · Daniel Soudry · Anupam Gupta · Anirudh Goyal · Yoshua Bengio · Erich Elsen · Soham De · Stanislaw Jastrzebski · Charles H Martin · Samira Shabanian · Aaron Courville · Shorato Akaho · Lenka Zdeborova · Ethan Dyer · Maurice Weiler · Pim de Haan · Taco Cohen · Max Welling · Ping Luo · zhanglin peng · Nasim Rahaman · Loic Matthey · Danilo J. Rezende · Jaesik Choi · Kyle Cranmer · Lechao Xiao · Jaehoon Lee · Yasaman Bahri · Jeffrey Pennington · Greg Yang · Jiri Hron · Jascha Sohl-Dickstein · Guy Gur-Ari -
2019 : Understanding overparameterized neural networks »
Jascha Sohl-Dickstein -
2019 Workshop: Theoretical Physics for Deep Learning »
Jaehoon Lee · Jeffrey Pennington · Yasaman Bahri · Max Welling · Surya Ganguli · Joan Bruna -
2019 : Opening Remarks »
Jaehoon Lee · Jeffrey Pennington · Yasaman Bahri · Max Welling · Surya Ganguli · Joan Bruna -
2019 Poster: Understanding and correcting pathologies in the training of learned optimizers »
Luke Metz · Niru Maheswaranathan · Jeremy Nixon · Daniel Freeman · Jascha Sohl-Dickstein -
2019 Poster: Guided evolutionary strategies: augmenting random search with surrogate gradients »
Niru Maheswaranathan · Luke Metz · George Tucker · Dami Choi · Jascha Sohl-Dickstein -
2019 Oral: Guided evolutionary strategies: augmenting random search with surrogate gradients »
Niru Maheswaranathan · Luke Metz · George Tucker · Dami Choi · Jascha Sohl-Dickstein -
2019 Oral: Understanding and correcting pathologies in the training of learned optimizers »
Luke Metz · Niru Maheswaranathan · Jeremy Nixon · Daniel Freeman · Jascha Sohl-Dickstein -
2019 Poster: The Effect of Network Width on Stochastic Gradient Descent and Generalization: an Empirical Study »
Daniel Park · Jascha Sohl-Dickstein · Quoc Le · Samuel Smith -
2019 Oral: The Effect of Network Width on Stochastic Gradient Descent and Generalization: an Empirical Study »
Daniel Park · Jascha Sohl-Dickstein · Quoc Le · Samuel Smith -
2018 Poster: Dynamical Isometry and a Mean Field Theory of RNNs: Gating Enables Signal Propagation in Recurrent Neural Networks »
Minmin Chen · Jeffrey Pennington · Samuel Schoenholz -
2018 Oral: Dynamical Isometry and a Mean Field Theory of RNNs: Gating Enables Signal Propagation in Recurrent Neural Networks »
Minmin Chen · Jeffrey Pennington · Samuel Schoenholz -
2017 Poster: Neural Message Passing for Quantum Chemistry »
Justin Gilmer · Samuel Schoenholz · Patrick F Riley · Oriol Vinyals · George Dahl -
2017 Talk: Neural Message Passing for Quantum Chemistry »
Justin Gilmer · Samuel Schoenholz · Patrick F Riley · Oriol Vinyals · George Dahl -
2017 Poster: Geometry of Neural Network Loss Surfaces via Random Matrix Theory »
Jeffrey Pennington · Yasaman Bahri -
2017 Poster: Input Switched Affine Networks: An RNN Architecture Designed for Interpretability »
Jakob Foerster · Justin Gilmer · Jan Chorowski · Jascha Sohl-Dickstein · David Sussillo -
2017 Poster: Learned Optimizers that Scale and Generalize »
Olga Wichrowska · Niru Maheswaranathan · Matthew Hoffman · Sergio Gómez Colmenarejo · Misha Denil · Nando de Freitas · Jascha Sohl-Dickstein -
2017 Talk: Input Switched Affine Networks: An RNN Architecture Designed for Interpretability »
Jakob Foerster · Justin Gilmer · Jan Chorowski · Jascha Sohl-Dickstein · David Sussillo -
2017 Poster: On the Expressive Power of Deep Neural Networks »
Maithra Raghu · Ben Poole · Surya Ganguli · Jon Kleinberg · Jascha Sohl-Dickstein -
2017 Talk: Learned Optimizers that Scale and Generalize »
Olga Wichrowska · Niru Maheswaranathan · Matthew Hoffman · Sergio Gómez Colmenarejo · Misha Denil · Nando de Freitas · Jascha Sohl-Dickstein -
2017 Talk: On the Expressive Power of Deep Neural Networks »
Maithra Raghu · Ben Poole · Surya Ganguli · Jon Kleinberg · Jascha Sohl-Dickstein -
2017 Talk: Geometry of Neural Network Loss Surfaces via Random Matrix Theory »
Jeffrey Pennington · Yasaman Bahri