Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Data-centric Machine Learning Research (DMLR): Datasets for Foundation Models

Understanding the Gains from Repeated Self-Distillation

Divyansh Pareek · Simon Du · Sewoong Oh


Abstract: Self-Distillation (SD) is a special type of knowledge distillation where the student model has the same architecture as the teacher model. Despite using the same architecture and the same training data, SD has been empirically observed to improve performance, especially when applied repeatedly -- meaning that it is possible to extract more knowledge from a given training dataset by applying SD. For such a process, there is a fundamental question of interest: How much gain is possible by applying multiple steps of self-distillation?To investigate this gain, we propose studying the simple but canonical task of linear regression. Our analysis shows that the excess risk achieved by multi-step SD can significantly improve upon a single step of SD, reducing the excess risk by a factor as large as $d$, where $d$ is the input dimension. Empirical results on regression tasks from the UCI repository show a reduction in the learnt model's risk (MSE) by up to $47$\% by applying $2$-step SD.

Chat is not available.