Timezone: »
In many practical scenarios - like hyperparameter search or continual retraining with new data - related training runs are performed many times in sequence. Current practice is to train each of these models independently from scratch. We study the problem of exploiting the computation invested in previous runs to reduce the cost of future runs using knowledge distillation (KD). We find that augmenting future runs with KD from previous runs dramatically reduces the time necessary to train these models, even taking into account the overhead of KD. We improve on these results with two strategies that reduce the overhead of KD by 80-90% with minimal effect on accuracy and vast pareto-improvements in overall cost. We conclude that KD is a promising avenue for reducing the cost of the expensive preparatory work that precedes training final models in practice.
Author Information
Xingyu Liu (Harvard University, Harvard University)
Xingyu Liu (Harvard University, Harvard University)
Alexander Leonardi (Harvard University)
Alexander Leonardi (Harvard University)
Lu Yu (Harvard University, Harvard University)
Lu Yu (Harvard University, Harvard University)
Christopher Gilmer-Hill (Harvard University)
Christopher Gilmer-Hill (Harvard University)
Matthew Leavitt (Facebook AI Research (FAIR))
Matthew Leavitt (Facebook AI Research (FAIR))
Jonathan Frankle (MosaicML / Harvard)
Jonathan Frankle (MosaicML / Harvard)
More from the Same Authors
-
2021 : Studying the Consistency and Composability of Lottery Ticket Pruning Masks »
Rajiv Movva · Michael Carbin · Jonathan Frankle -
2022 : Pre-Training on a Data Diet: Identifying Sufficient Examples for Early Training »
Mansheej Paul · Brett Larsen · Surya Ganguli · Jonathan Frankle · Gintare Karolina Dziugaite -
2022 Poster: What Can Linear Interpolation of Neural Network Loss Landscapes Tell Us? »
Tiffany Vlaar · Jonathan Frankle -
2022 Spotlight: What Can Linear Interpolation of Neural Network Loss Landscapes Tell Us? »
Tiffany Vlaar · Jonathan Frankle -
2021 Poster: On the Predictability of Pruning Across Scales »
Jonathan Rosenfeld · Jonathan Frankle · Michael Carbin · Nir Shavit -
2021 Spotlight: On the Predictability of Pruning Across Scales »
Jonathan Rosenfeld · Jonathan Frankle · Michael Carbin · Nir Shavit -
2021 Poster: ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases »
Stéphane d'Ascoli · Hugo Touvron · Matthew Leavitt · Ari Morcos · Giulio Biroli · Levent Sagun -
2021 Spotlight: ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases »
Stéphane d'Ascoli · Hugo Touvron · Matthew Leavitt · Ari Morcos · Giulio Biroli · Levent Sagun -
2020 : Q&A: Jonathan Frankle »
Jonathan Frankle · Mayoore Jaiswal -
2020 : Contributed Talk: Jonathan Frankle »
Jonathan Frankle -
2020 Poster: Linear Mode Connectivity and the Lottery Ticket Hypothesis »
Jonathan Frankle · Gintare Karolina Dziugaite · Daniel Roy · Michael Carbin