Poster
in
Workshop: Workshop on Theoretical Foundations of Foundation Models (TF2M)
Progressive distillation improves feature learning via implicit curriculum
Abhishek Panigrahi · Bingbin Liu · Sadhika Malladi · Andrej Risteski · Surbhi Goel
Knowledge distillation, where a student model learns from a teacher model, is a widely-adopted approach to improve the training of small models. A known challenge in distillation is that a large teacher-student performance gap can hurt the effectiveness of distillation, which prior works have aimed to mitigate by providing intermediate supervision.In this work, we study a popular approach called progressive distillation, where several intermediate checkpoints of the teacher are used successively to supervise the student as it learns.Using sparse parity as a testbed, we show empirically and theoretically that these intermediate checkpoints constitute an implicit curriculum that accelerates student learning.This curriculum provides explicit supervision to learn underlying features used in the task, and, importantly, a fully trained teacher does not provide this supervision.