Enhancing Logits Distillation with Plug&Play Kendall's $\tau$ Ranking Loss
Yuchen Guan · Runxi Cheng · Kang Liu · Chun Yuan
Abstract
Knowledge distillation typically minimizes the Kullback–Leibler (KL) divergence between teacher and student logits. However, optimizing the KL divergence can be challenging for the student and often leads to sub-optimal solutions. We further show that gradients induced by KL divergence scale with the magnitude of the teacher logits, thereby diminishing updates on low-probability channels. This imbalance weakens the transfer of inter-class information and in turn limits the performance improvements achievable by the student. To mitigate this issue, we propose a plug-and-play auxiliary ranking loss based on Kendall’s $\tau$ coefficient that can be seamlessly integrated into any logit-based distillation framework. It supplies inter-class relational information while rebalancing gradients toward low-probability channels. We demonstrate that the proposed ranking loss is largely invariant to channel scaling and optimizes an objective aligned with that of KL divergence, making it a natural complement rather than a replacement. Extensive experiments on CIFAR-100, ImageNet, and COCO datasets, as well as various CNN and ViT teacher-student architecture combinations, demonstrate that our plug-and-play ranking loss consistently boosts the performance of multiple distillation baselines.
Lay Summary
Knowledge distillation transfers capabilities from a powerful teacher model to a lightweight student model. However, existing distillation losses overlook low-probability channels and suffer from suboptimal optimization, limiting the transfer of inter-class relational knowledge and hindering performance gains. To mitigate this issue, we propose a plug-and-play auxiliary ranking loss based on Kendall’s $\tau$ coefficient. It supplies low-probability channel information and aligns optimization objectives, seamlessly integrating with most distillation frameworks. Extensive experiments across multiple datasets and various teacher-student architecture combinations demonstrate that our plug-and-play ranking loss consistently boosts the performance of multiple distillation baselines.
Video
Chat is not available.
Successful Page Load