Theoretical Characterization of Generalization in Knowledge Distillation
Abstract
This paper presents a theoretical investigation into the generalization capabilities of cross-domain knowledge distillation. Utilizing a high-dimensional asymptotic analysis of a linear teacher–student model, we characterize the excess risk while accounting for both model and covariate shifts. Our results provide a formal guarantee for the efficacy of distillation: even when the source and target domains differ substantially, there still may exist a regime where the student model achieves superior generalization ability over the student-only baseline. Moreover, we identify a \textit{crossed double descent} phenomenon: the excess risk can vary non-monotonically with the teacher’s and student’s dimension-to-sample-size ratios. These results provide rigorous insight into when and why distillation helps across domains.