Timezone: »

A statistical perspective on distillation
Aditya Menon · Ankit Singh Rawat · Sashank Jakkam Reddi · Seungyeon Kim · Sanjiv Kumar

Wed Jul 21 07:20 AM -- 07:25 AM (PDT) @

Knowledge distillation is a technique for improving a student'' model by replacing its one-hot training labels with a label distribution obtained from ateacher'' model. Despite its broad success, several basic questions --- e.g., Why does distillation help? Why do more accurate teachers not necessarily distill better? --- have received limited formal study. In this paper, we present a statistical perspective on distillation which provides an answer to these questions. Our core observation is that a Bayes teacher'' providing the true class-probabilities can lower the variance of the student objective, and thus improve performance. We then establish a bias-variance tradeoff that quantifies the value of teachers that approximate the Bayes class-probabilities. This provides a formal criterion as to what constitutes agood'' teacher, namely, the quality of its probability estimates. Finally, we illustrate how our statistical perspective facilitates novel applications of distillation to bipartite ranking and multiclass retrieval.

Author Information

Aditya Menon (Google Research)
Ankit Singh Rawat (Google)
Sashank Jakkam Reddi (Google)
Seungyeon Kim (Google Research)
Sanjiv Kumar (Google Research, NY)

Related Events (a corresponding poster, oral, or spotlight)

More from the Same Authors