Do Students Debias Like Teachers? On the Distillability of Bias Mitigation Methods
Abstract
Knowledge distillation (KD) is an effective method for model compression and transferring knowledge between models. However, its effect on model's robustness against spurious correlations, shortcuts and task-irrelevant features that degrade performance on out-of-distribution data remains underexplored. This study investigates the effect of knowledge distillation on natural language inference (NLI) and image classification tasks, with a focus on the transferability of ``debiasing'' capabilities from teacher models to student models. Through extensive experiments, we illustrate several key findings: (i) the effect of KD on debiasing performance depends on the underlying debiasing method, the relative scale of the models involved, and the size of the training set; (ii) KD effectively transfers debiasing capabilities when teacher and student are similar in scale (number of parameters); (iii) KD may amplify the student model's reliance on spurious features, and this effect does not diminish as the teacher model scales up; and (iv) although the overall robustness of a model may remain stable post-distillation, significant variations can occur across different types of biases; and Given the above findings, we propose three effective solutions to improve the distillability of debiasing methods: developing high quality data for augmentation, implementing iterative knowledge distillation, and initializing student models with weights obtained from teacher models.