Timezone: »
A striking observation about iterative magnitude pruning (IMP; Frankle et al. 2020) is that—after just a few hundred steps of dense training—the method can find a sparse sub-network that can be trained to the same accuracy as the dense network. However, the same does not hold at step 0, i.e., random initialization. In this work, we seek to understand how this early phase of pre-training leads to a good initialization for IMP through the lens of the data distribution. Empirically we observe that, holding the number of pre-training iterations constant, training on a small fraction of (randomly chosen) data suffices to obtain an equally good initialization for IMP. We additionally observe that by pre-training only on "easy" training data we can decrease the number of steps necessary to find a good initialization for IMP compared to training on the full dataset or a randomly chosen subset. Combined, these results provide new insight into the role played by data in the early phase of training.
Author Information
Mansheej Paul (Stanford University)
Brett Larsen (Stanford University)
Surya Ganguli (Stanford)
Jonathan Frankle (MosaicML / Harvard)
Gintare Karolina Dziugaite (Element AI, a ServiceNow Company)
More from the Same Authors
-
2021 : Studying the Consistency and Composability of Lottery Ticket Pruning Masks »
Rajiv Movva · Michael Carbin · Jonathan Frankle -
2021 : Towards a Unified Information-Theoretic Framework for Generalization »
Mahdi Haghifam · Gintare Karolina Dziugaite · Shay Moran -
2021 : On the Generalization Improvement from Neural Network Pruning »
Tian Jin · Gintare Karolina Dziugaite · Michael Carbin -
2022 : Knowledge Distillation for Efficient Sequences of Training Runs »
Xingyu Liu · Xingyu Liu · Alexander Leonardi · Alexander Leonardi · Lu Yu · Lu Yu · Christopher Gilmer-Hill · Christopher Gilmer-Hill · Matthew Leavitt · Matthew Leavitt · Jonathan Frankle · Jonathan Frankle -
2023 : Flat minima can fail to transfer to downstream tasks »
Deepansha Singh · Ekansh Sharma · Daniel Roy · Gintare Karolina Dziugaite -
2023 : Predicting Task Forgetting in Large Language Models »
Anat Kleiman · Jonathan Frankle · Sham Kakade · Mansheej Paul -
2023 : A strong implicit bias in SGD dynamics towards much simpler subnetworks through stochastic collapse to invariant sets, Surya Ganguli »
Surya Ganguli -
2023 : Invited talk: Lessons Learned from Studying PAC-Bayes and Generalization »
Gintare Karolina Dziugaite -
2022 : Finding Structured Winning Tickets with Early Pruning »
Udbhav Bamba · Devin Kwok · Gintare Karolina Dziugaite · David Rolnick -
2022 Poster: What Can Linear Interpolation of Neural Network Loss Landscapes Tell Us? »
Tiffany Vlaar · Jonathan Frankle -
2022 Spotlight: What Can Linear Interpolation of Neural Network Loss Landscapes Tell Us? »
Tiffany Vlaar · Jonathan Frankle -
2021 : On the Generalization Improvement from Neural Network Pruning »
Tian Jin · Gintare Karolina Dziugaite · Michael Carbin -
2021 Poster: Understanding self-supervised learning dynamics without contrastive pairs »
Yuandong Tian · Xinlei Chen · Surya Ganguli -
2021 Poster: A theory of high dimensional regression with arbitrary correlations between input features and target functions: sample complexity, multiple descent curves and a hierarchy of phase transitions »
Gabriel Mel · Surya Ganguli -
2021 Spotlight: A theory of high dimensional regression with arbitrary correlations between input features and target functions: sample complexity, multiple descent curves and a hierarchy of phase transitions »
Gabriel Mel · Surya Ganguli -
2021 Oral: Understanding self-supervised learning dynamics without contrastive pairs »
Yuandong Tian · Xinlei Chen · Surya Ganguli -
2021 Poster: On the Predictability of Pruning Across Scales »
Jonathan Rosenfeld · Jonathan Frankle · Michael Carbin · Nir Shavit -
2021 Spotlight: On the Predictability of Pruning Across Scales »
Jonathan Rosenfeld · Jonathan Frankle · Michael Carbin · Nir Shavit -
2020 : Q&A: Jonathan Frankle »
Jonathan Frankle · Mayoore Jaiswal -
2020 : Contributed Talk: Jonathan Frankle »
Jonathan Frankle -
2020 Poster: Generalization via Derandomization »
Jeffrey Negrea · Gintare Karolina Dziugaite · Daniel Roy -
2020 Poster: Linear Mode Connectivity and the Lottery Ticket Hypothesis »
Jonathan Frankle · Gintare Karolina Dziugaite · Daniel Roy · Michael Carbin -
2020 Poster: Two Routes to Scalable Credit Assignment without Weight Symmetry »
Daniel Kunin · Aran Nayebi · Javier Sagastuy-Brena · Surya Ganguli · Jonathan Bloom · Daniel Yamins -
2019 Workshop: Theoretical Physics for Deep Learning »
Jaehoon Lee · Jeffrey Pennington · Yasaman Bahri · Max Welling · Surya Ganguli · Joan Bruna -
2019 : Opening Remarks »
Jaehoon Lee · Jeffrey Pennington · Yasaman Bahri · Max Welling · Surya Ganguli · Joan Bruna -
2018 Poster: Entropy-SGD optimizes the prior of a PAC-Bayes bound: Generalization properties of Entropy-SGD and data-dependent priors »
Gintare Karolina Dziugaite · Daniel Roy -
2018 Oral: Entropy-SGD optimizes the prior of a PAC-Bayes bound: Generalization properties of Entropy-SGD and data-dependent priors »
Gintare Karolina Dziugaite · Daniel Roy -
2017 Poster: Continual Learning Through Synaptic Intelligence »
Friedemann Zenke · Ben Poole · Surya Ganguli -
2017 Talk: Continual Learning Through Synaptic Intelligence »
Friedemann Zenke · Ben Poole · Surya Ganguli -
2017 Poster: On the Expressive Power of Deep Neural Networks »
Maithra Raghu · Ben Poole · Surya Ganguli · Jon Kleinberg · Jascha Sohl-Dickstein -
2017 Talk: On the Expressive Power of Deep Neural Networks »
Maithra Raghu · Ben Poole · Surya Ganguli · Jon Kleinberg · Jascha Sohl-Dickstein